Instructor: 
Alex Thomo 
Phone: 
(250) 4725786 
Office: 
ECS 556 
Office Hours: 
T 2:30  3:30 p.m., F 1:30  2:30 
Email: 
thomo@cs.uvic.ca 
TA: 
Marina Barsky 
Email: 
mgbarsky@uvic.ca 
Course Outline: 
Link 
Books:
Introduction to Data Mining (First Edition)
by PangNing Tan, Michael Steinbach, Vipin Kumar.
Addison Wesley, 2005. (PSK)
2 hours reserve in the library.
Data Mining: Practical Machine Learning Tools and Techniques
by Ian H. Witten, Eibe Frank.
Morgan Kaufmann; 2nd edition, 2005. (WF)
2 hours reserve in the library.
Programming Collective Intelligence
by Toby Segaran
O'Reilly; 1st edition, 2007. (SEG)
Accessible online through the UVic library:
link
.
Marks so far:
link.
Midterm Solutions:
link.
Reading list:
link
Assignments:
Assignment 1.
Hints.
Solutions.
Assignment 2.
Solutions.
Assignment 3.
Solutions.
Project:
Description.
Labs (by Marina Barsky):
Lab1,
Lab2,
Lab3,
Lab4,
Lab5,
Lab6
Lab7
Lab8
Lecture Handouts:
Predictive Data Mining
 Intro to Data Mining
Slides.
 Applying Decision Trees. Learning Decision Trees. Measures of Node Impurity, Entropy. Information Gain.
Decision Trees with Numerical Attributes. Regression Trees.
Slides (1).
Slides (2).
Python Code.
 SLIQ and SPRINT Decision Trees Algorithms.
Slides.
SLIQ paper.
SPRINT paper.
 MapReduce Framework.
Slides.
MapReduce paper.
Python test code.
Word count example.
 RuleBased Classifiers. Coverage and Accuracy.
Decision Trees vs. rules. Ordered Rule Set.
Separateandconquer algorithms. PRISM and RIPPER algorithms.
Slides.
 Uncertain knowledge. Belief and Probability. Conditional
probability. Bayes' Rule. Conditional Independence. Normalization constant.
Naive Bayes Classifier. Text Categorization.
Slides.
 Bayesian Belief Networks: Semantics, Inference, Classification, Construction, Complexity.
Slides.
 Bayesian Belief Networks: Practice.
Slides (a).
Slides (b).
 Credibility: Evaluating what's been learned. Predicting performance. Confidence intervals.
Holdout estimation. Crossvalidation. The bootstrap.
Counting the cost.
Slides.
 ROC curves.
Slides.
(A more concise version is
here.
A useful page with tutorials and code is
here.)
 Linear Separators: Hyperplane Geometry, Margin, Perceptron Algorithm.
Beyond Linear Separability: Kernel Trick. Support Vectors.
Slides.
See also
PointLineDistance.
 Beyond Linear Separability: Artificial Neural Networks.
Slides.
 Genetic Algorithms.
Slides.
 Instance Based Learning.
Slides.
 Recommender Systems.
Slides.
Association Analysis
 Frequent Itemset Generation: The Apriori Principle, Apriori Algorithm, Candidate Generation and Pruning, Support Counting.
Slides.
 More on Apriori Algorithm. Rule Generation: ConfidenceBased Pruning, Rule Generation in Apriori
Algorithm. Compact Representation of Frequent Itemsets: Maximal Frequent Itemsets, Closed Frequent Itemsets.
Slides.

Alternative Methods for Frequent Itemset Generation.
FPGrowth Algorithm: FPTree Representation, Frequent Itemset Generation in FPGrowth Algorithm.
Slides.
 FPTree/FPGrowth Complete Example.
Slides.
 Evaluation of Association Patterns: Objective Measures of Interestingness.
Simpson's Paradox.
Skewed distribution, Cross support patterns, Lowest confidence rule.
Slides.

Data Engineering: Transforming attributes. Multilevel Association Rules.
Mining word associations. MinApriori.
Slides.
 Mining of sequences. Candidate Generation. Timing Constraints.
Slides.

Mining Graphs.
Frequent Subgraph Mining. Edge Growing. Multiplicity of Candidates.
Slides.

Finding Similar Items. Minhashing. Locality Sensitive Hashing.
Slides.
Cluster Analysis
 Applications of Cluster Analysis. Types of Clusters. Kmeans Algorithm.
Problems with Selecting Initial Points. Bisecting Kmeans.
Limitations of Kmeans.
Slides.
 Agglomerative Hierarchical Clustering. Density based clustering DBSCAN.
Slides
 Self Organizing Maps. HICAP: Hierarchical Clustering with Pattern Preservation.
Mining the Web
Assignments: There will be three
assignments.
Interesting Links:
A Map Reduce Framework for Programming Graphics Processors
Mars: A MapReduce Framework on Graphics Processors
