SENG 474: Data Mining - Spring 2009

Instructor:	Alex Thomo
Phone:	(250) 472-5786
Office:	ECS 556
Office Hours:	T 2:30 - 3:30 p.m., F 1:30 - 2:30
Email:	thomo@cs.uvic.ca
TA:	Marina Barsky
Email:	mgbarsky@uvic.ca
Course Outline:	Link

Books:

Introduction to Data Mining (First Edition)
by Pang-Ning Tan, Michael Steinbach, Vipin Kumar.
Addison Wesley, 2005. (PSK)
2 hours reserve in the library.

Data Mining: Practical Machine Learning Tools and Techniques
by Ian H. Witten, Eibe Frank.
Morgan Kaufmann; 2nd edition, 2005. (WF)
2 hours reserve in the library.

Programming Collective Intelligence
by Toby Segaran
O'Reilly; 1st edition, 2007. (SEG)
Accessible online through the UVic library: link .

Marks so far: link.

Midterm Solutions: link.

Reading list: link

Assignments:

Assignment 1. Hints. Solutions.
Assignment 2. Solutions.
Assignment 3. Solutions.

Project: Description.

Labs (by Marina Barsky): Lab1, Lab2, Lab3, Lab4, Lab5, Lab6 Lab7 Lab8

Lecture Handouts:

Predictive Data Mining

Intro to Data Mining Slides.
Applying Decision Trees. Learning Decision Trees. Measures of Node Impurity, Entropy. Information Gain. Decision Trees with Numerical Attributes. Regression Trees. Slides (1). Slides (2). Python Code.
SLIQ and SPRINT Decision Trees Algorithms. Slides. SLIQ paper. SPRINT paper.
MapReduce Framework. Slides. MapReduce paper. Python test code. Word count example.
Rule-Based Classifiers. Coverage and Accuracy. Decision Trees vs. rules. Ordered Rule Set. Separate-and-conquer algorithms. PRISM and RIPPER algorithms. Slides.
Uncertain knowledge. Belief and Probability. Conditional probability. Bayes' Rule. Conditional Independence. Normalization constant. Naive Bayes Classifier. Text Categorization. Slides.
Bayesian Belief Networks: Semantics, Inference, Classification, Construction, Complexity. Slides.
Bayesian Belief Networks: Practice. Slides (a). Slides (b).
Credibility: Evaluating what's been learned. Predicting performance. Confidence intervals. Holdout estimation. Cross-validation. The bootstrap. Counting the cost. Slides.
ROC curves. Slides. (A more concise version is here. A useful page with tutorials and code is here.)
Linear Separators: Hyperplane Geometry, Margin, Perceptron Algorithm. Beyond Linear Separability: Kernel Trick. Support Vectors. Slides. See also Point-LineDistance.
Beyond Linear Separability: Artificial Neural Networks. Slides.
Genetic Algorithms. Slides.
Instance Based Learning. Slides.
Recommender Systems. Slides.

Association Analysis

Frequent Itemset Generation: The Apriori Principle, Apriori Algorithm, Candidate Generation and Pruning, Support Counting. Slides.
More on Apriori Algorithm. Rule Generation: Confidence-Based Pruning, Rule Generation in Apriori Algorithm. Compact Representation of Frequent Itemsets: Maximal Frequent Itemsets, Closed Frequent Itemsets. Slides.
Alternative Methods for Frequent Itemset Generation. FP-Growth Algorithm: FP-Tree Representation, Frequent Itemset Generation in FP-Growth Algorithm. Slides.
FPTree/FPGrowth Complete Example. Slides.
Evaluation of Association Patterns: Objective Measures of Interestingness. Simpson's Paradox. Skewed distribution, Cross support patterns, Lowest confidence rule. Slides.
Data Engineering: Transforming attributes. Multi-level Association Rules. Mining word associations. Min-Apriori. Slides.
Mining of sequences. Candidate Generation. Timing Constraints. Slides.
Mining Graphs. Frequent Subgraph Mining. Edge Growing. Multiplicity of Candidates. Slides.
Finding Similar Items. Minhashing. Locality Sensitive Hashing. Slides.

Cluster Analysis

Applications of Cluster Analysis. Types of Clusters. K-means Algorithm. Problems with Selecting Initial Points. Bisecting K-means. Limitations of K-means. Slides.
Agglomerative Hierarchical Clustering. Density based clustering DBSCAN. Slides
Self Organizing Maps. HICAP: Hierarchical Clustering with Pattern Preservation.

Mining the Web

Information Retrieval. PageRank. Web Link Matrix. Dead ends, and Spider traps. Slides.
Latent Semantic Analysis. Slides. Tutorial. Excel spreadsheet for the example.

Assignments:
There will be three assignments.

Interesting Links:
A Map Reduce Framework for Programming Graphics Processors
Mars: A MapReduce Framework on Graphics Processors