SENG 474: Data Mining - Spring 2009

Instructor: Alex Thomo
Phone: (250) 472-5786
Office: ECS 556
Office Hours: T 2:30 - 3:30 p.m., F 1:30 - 2:30
TA: Marina Barsky
Course Outline: Link


Introduction to Data Mining (First Edition)
by Pang-Ning Tan, Michael Steinbach, Vipin Kumar.
Addison Wesley, 2005. (PSK)
2 hours reserve in the library.

Data Mining: Practical Machine Learning Tools and Techniques
by Ian H. Witten, Eibe Frank.
Morgan Kaufmann; 2nd edition, 2005. (WF)
2 hours reserve in the library.

Programming Collective Intelligence
by Toby Segaran
O'Reilly; 1st edition, 2007. (SEG)
Accessible online through the UVic library: link .

Marks so far: link.

Midterm Solutions: link.

Reading list: link


Assignment 1. Hints. Solutions.
Assignment 2. Solutions.
Assignment 3. Solutions.

Project: Description.

Labs (by Marina Barsky): Lab1, Lab2, Lab3, Lab4, Lab5, Lab6 Lab7 Lab8

Lecture Handouts:

Predictive Data Mining

  • Intro to Data Mining Slides.
  • Applying Decision Trees. Learning Decision Trees. Measures of Node Impurity, Entropy. Information Gain. Decision Trees with Numerical Attributes. Regression Trees. Slides (1). Slides (2). Python Code.
  • SLIQ and SPRINT Decision Trees Algorithms. Slides. SLIQ paper. SPRINT paper.
  • MapReduce Framework. Slides. MapReduce paper. Python test code. Word count example.
  • Rule-Based Classifiers. Coverage and Accuracy. Decision Trees vs. rules. Ordered Rule Set. Separate-and-conquer algorithms. PRISM and RIPPER algorithms. Slides.
  • Uncertain knowledge. Belief and Probability. Conditional probability. Bayes' Rule. Conditional Independence. Normalization constant. Naive Bayes Classifier. Text Categorization. Slides.
  • Bayesian Belief Networks: Semantics, Inference, Classification, Construction, Complexity. Slides.
  • Bayesian Belief Networks: Practice. Slides (a). Slides (b).
  • Credibility: Evaluating what's been learned. Predicting performance. Confidence intervals. Holdout estimation. Cross-validation. The bootstrap. Counting the cost. Slides.
  • ROC curves. Slides. (A more concise version is here. A useful page with tutorials and code is here.)
  • Linear Separators: Hyperplane Geometry, Margin, Perceptron Algorithm. Beyond Linear Separability: Kernel Trick. Support Vectors. Slides. See also Point-LineDistance.
  • Beyond Linear Separability: Artificial Neural Networks. Slides.
  • Genetic Algorithms. Slides.
  • Instance Based Learning. Slides.
  • Recommender Systems. Slides.

Association Analysis

  • Frequent Itemset Generation: The Apriori Principle, Apriori Algorithm, Candidate Generation and Pruning, Support Counting. Slides.
  • More on Apriori Algorithm. Rule Generation: Confidence-Based Pruning, Rule Generation in Apriori Algorithm. Compact Representation of Frequent Itemsets: Maximal Frequent Itemsets, Closed Frequent Itemsets. Slides.
  • Alternative Methods for Frequent Itemset Generation. FP-Growth Algorithm: FP-Tree Representation, Frequent Itemset Generation in FP-Growth Algorithm. Slides.
  • FPTree/FPGrowth Complete Example. Slides.
  • Evaluation of Association Patterns: Objective Measures of Interestingness. Simpson's Paradox. Skewed distribution, Cross support patterns, Lowest confidence rule. Slides.
  • Data Engineering: Transforming attributes. Multi-level Association Rules. Mining word associations. Min-Apriori. Slides.
  • Mining of sequences. Candidate Generation. Timing Constraints. Slides.
  • Mining Graphs. Frequent Subgraph Mining. Edge Growing. Multiplicity of Candidates. Slides.
  • Finding Similar Items. Minhashing. Locality Sensitive Hashing. Slides.

Cluster Analysis

  • Applications of Cluster Analysis. Types of Clusters. K-means Algorithm. Problems with Selecting Initial Points. Bisecting K-means. Limitations of K-means. Slides.
  • Agglomerative Hierarchical Clustering. Density based clustering DBSCAN. Slides
  • Self Organizing Maps. HICAP: Hierarchical Clustering with Pattern Preservation.

Mining the Web

There will be three assignments.

Interesting Links:
A Map Reduce Framework for Programming Graphics Processors
Mars: A MapReduce Framework on Graphics Processors