CISC333 Data Mining

CISC333 Data Mining: Fall 2013

Announcements

Introduction

This course is offered in Fall 2013 in slot 2 in Ontario 207. The prerequisites are CISC121 and a stats course. Although the course is numbered at 3rd Year, second-year students who wish to take it may do so (although this was more likely when it was offered in the Winter Term).

Course Material

This course is available on Moodle, so look for most of the information there.

Tutorials

The tool we will use for most of the practical work in the course is Rapidminer, a development from the Weka toolkit. You can find extensive tutorial material on the Rapid website.

Rapidminer is available for free download at http://rapid-i.com/; you need Rapidminer Community Edition (should be Rapidminer 5).

Rapidminer is also available on the Caslab machines, under either Windows or Linux.

Tutorial on using Matlab (useful for visualization) and the SVD and ICA matrix decompositions.

Exercises and Assignments

Exercises are a chance for you to get some hands-on experience. The exercise questions will often be open-ended. You might expect to spend 3 or 4 hours on these sheets each week. Each one is marked on this scale: acceptable; inadequate; or not seriously attempted. There will be five exercise sheets in the first half of the term.

(Exercise sheets will appear a week or ten days before they are due.)

Outline

This table describes what we will cover, keyed to the modules and text.

Avail means that the basic powerpoint is available. Done means that the marked-up powerpoint is available.

Text
Tan, Steinbach, Kumar, Introduction to Data Mining, Addison-Wesley, 2006, ISBN 0-321-32136-7 ($105 at Amazon, $97.95 at Campus Bookstore).

Assessment
There are 3 deliverables:

5 weekly exercise sheets, due in each of weeks 2-6. These exercise sheets are not marked as such. However, for each exercise sheet that is not adequate, your final mark in the course will be reduced by a factor of 0.02. The exercise sheets are useful both for understanding the material and as ground work for the project, so they are worth doing!
A project in which you can apply data mining techniques to a real dataset and report on what you discover, due at the end of week 11, and worth 30% of your final grade.
A final examination, worth the remaining 70% of your grade. You must pass the exam to pass the course. You may bring one 8.5x11 sheet of notes into the exam.
Note that the assessment in the course is backloaded, so please take this into account when planning your procrastination.

Your instruction team
Instructor
David Skillicorn
528 Goodwin Hall
skill cs queensu ca
533 6065

Questions? Try asking me before or after class, or come and find me at my office any time I'm there. I have a schedule posted by my door.
I will schedule an office hour once term has started.
Teaching Assistant

Resources
Study skills. You probably know all of the conventional wisdom about how to learn, but perhaps you don't actually use it. Here is an excellent link: Study Hacks.
You may also want to subscribe to, or read: KD Nuggets

Module Content Text Refs

Module 0 Introduction Chapter 1

Module 1 Prediction and Exploration Bits of Chapter 2

Module 2 Data preparation and model quality Chapter 2

Module 3 Simple predictors

Module 4 Decision trees Chapter 4

Module 5 More decision trees

Module 6 Neural networks Chapter 5.4

Module 7 Support Vector Machines Chapter 5.5

Module 8 Rule based systems Chapter 5.1

Module 9 Object selection: sampling, ensemble techniques Chapter 5.6

Module 10 Attribute selection Section 2.3.4

Module 11 Prediction case studies

Module 12 Clustering I: similarities, k-means Chapter 8.2

Module 13 Clustering II: Expectation-Maximization Section 9.2.2

Module 14 Clustering III: Top-down and bottom-up clustering Chapter 8.3

Module 15 Clustering IV: Matrix decompositions

Module 16 Clustering V: Clustering Large Datasets Section 9.5.2,Section 8.4.2

Module 17 Visualization Chapter 3.3

Module 18 Clustering Case Study

Module 19 Clustering VI: Biclustering

Module 20 Biclustering Case Study: Topic Detection

Module 21 Mining the Web

Module 22 Collaborative Filtering

Module 23 Adversarial Data Mining

Module 24 Summary

Module	Content	Text Refs
Module 0	Introduction	Chapter 1
Module 1	Prediction and Exploration	Bits of Chapter 2
Module 2	Data preparation and model quality	Chapter 2
Module 3	Simple predictors
Module 4	Decision trees	Chapter 4
Module 5	More decision trees
Module 6	Neural networks	Chapter 5.4
Module 7	Support Vector Machines	Chapter 5.5
Module 8	Rule based systems	Chapter 5.1
Module 9	Object selection: sampling, ensemble techniques	Chapter 5.6
Module 10	Attribute selection	Section 2.3.4
Module 11	Prediction case studies
Module 12	Clustering I: similarities, k-means	Chapter 8.2
Module 13	Clustering II: Expectation-Maximization	Section 9.2.2
Module 14	Clustering III: Top-down and bottom-up clustering	Chapter 8.3
Module 15	Clustering IV: Matrix decompositions
Module 16	Clustering V: Clustering Large Datasets	Section 9.5.2,Section 8.4.2
Module 17	Visualization	Chapter 3.3
Module 18	Clustering Case Study
Module 19	Clustering VI: Biclustering
Module 20	Biclustering Case Study: Topic Detection
Module 21	Mining the Web
Module 22	Collaborative Filtering
Module 23	Adversarial Data Mining
Module 24	Summary