| 
        
        
          | Instructor: | Alex Thomo |  
          | Phone: | (250) 472-5786 |  
          | Office: | ECS 556 |  
          | Office Hours: | T 2:30 - 3:30 p.m., F 1:30 - 2:30 |  
          | Email: | thomo@cs.uvic.ca |  
          | TA: | Marina Barsky |  
          | Email: | mgbarsky@uvic.ca |  
          | Course Outline: | Link |  
       
       
       Books: 
       
	  Introduction to Data Mining (First Edition) by Pang-Ning Tan, Michael Steinbach, Vipin Kumar.
 Addison Wesley, 2005. (PSK)
 2 hours reserve in the library.
 		
        Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank.
 Morgan Kaufmann; 2nd edition, 2005. (WF)
 2 hours reserve in the library.
         
	Programming Collective Intelligence by Toby Segaran
 O'Reilly; 1st edition, 2007. (SEG)
 Accessible online through the UVic library:  
link
.
 Marks so far:
link.
 Midterm Solutions:
link.
 Reading list:      
link
 Assignments: 
 Assignment 1. 
Hints.
Solutions.
Assignment 2.
Solutions.
 Assignment 3.
Solutions.
 Project:
Description.
 Labs (by Marina Barsky):
Lab1, 
Lab2,
Lab3,
Lab4,
Lab5, 
Lab6
Lab7
Lab8
       Lecture Handouts: 
	    Predictive Data Mining
       
        Intro to Data Mining 
		Slides. 
        Applying Decision Trees. Learning Decision Trees. Measures of Node Impurity, Entropy. Information Gain.
		Decision Trees with Numerical Attributes. Regression Trees.
		Slides (1).
		Slides (2).
		Python Code. 
	SLIQ and SPRINT Decision Trees Algorithms.
	Slides.
	SLIQ paper.
	SPRINT paper.
	MapReduce Framework.
	Slides.
	MapReduce paper.
	Python test code. 
	Word count example.
		
		
		 Rule-Based Classifiers. Coverage and Accuracy. 
			Decision Trees vs. rules. Ordered Rule Set. 
			Separate-and-conquer algorithms. PRISM and RIPPER algorithms.
		Slides.
		 Uncertain knowledge. Belief and Probability. Conditional
probability. Bayes' Rule. Conditional Independence. Normalization constant. 
Naive Bayes Classifier. Text Categorization.
Slides.
		Bayesian Belief Networks: Semantics, Inference, Classification, Construction, Complexity.
		Slides.
                Bayesian Belief Networks: Practice.
                Slides (a).
		Slides (b).
Credibility: Evaluating what's  been learned. Predicting performance. Confidence intervals.
Holdout estimation. Cross-validation. The bootstrap. 
Counting the cost. 
Slides.
 ROC curves. 
Slides.
(A more concise version is 
here.
A useful page with tutorials and code is 
here.)
 
Linear Separators: Hyperplane Geometry, Margin, Perceptron Algorithm.
Beyond Linear Separability: Kernel Trick. Support Vectors.
Slides. 
See also
Point-LineDistance.
Beyond Linear Separability: Artificial Neural Networks.
Slides.
Genetic Algorithms.
Slides.
Instance Based Learning.
Slides.
Recommender Systems.                               
Slides.
  Association Analysis
       
        Frequent Itemset Generation: The Apriori Principle, Apriori Algorithm, Candidate Generation and Pruning, Support Counting.
	Slides.
		
		More on Apriori Algorithm. Rule Generation: Confidence-Based Pruning, Rule Generation in Apriori
Algorithm. Compact Representation of Frequent Itemsets: Maximal Frequent Itemsets, Closed Frequent Itemsets.
Slides.
Alternative Methods for Frequent Itemset Generation.
FP-Growth Algorithm: FP-Tree Representation, Frequent Itemset Generation in FP-Growth Algorithm.
Slides.
FPTree/FPGrowth Complete Example. 
Slides.
Evaluation of Association Patterns: Objective Measures of Interestingness. 
Simpson's Paradox. 
Skewed distribution, Cross support patterns, Lowest confidence rule.
Slides.
Data Engineering: Transforming attributes. Multi-level Association Rules.
Mining word associations. Min-Apriori. 
Slides.
Mining of sequences. Candidate Generation. Timing Constraints.
Slides.
Mining Graphs. 
Frequent Subgraph Mining. Edge Growing. Multiplicity of Candidates. 
Slides.
Finding Similar Items. Minhashing. Locality Sensitive Hashing. 
Slides.
	  Cluster Analysis
       
        Applications of Cluster Analysis. Types of Clusters. K-means Algorithm. 
            Problems with Selecting Initial Points. Bisecting K-means.
		Limitations of K-means. 
	    Slides.
	Agglomerative Hierarchical Clustering. Density based clustering DBSCAN. 
	    Slides
	Self Organizing Maps. HICAP: Hierarchical Clustering with Pattern Preservation.
	  Mining the Web
 Assignments:There will be three 
      assignments.
 Interesting Links:A Map Reduce Framework for Programming Graphics Processors
 Mars: A MapReduce Framework on Graphics Processors
 |