CISC333 Data Mining: Fall 2011

2010 final exam

Fall 2008 final exam

Winter 2008 final exam

2007 final exam

2006 final exam

2005 final exam

Announcements

Introduction

This course is offered in Fall 2011 in slot 14 in Botterell B148. The prerequisites are CISC121 and a stats course. Although the course is numbered at 3rd Year, second-year students who wish to take it may do so.

Course Material

The outline form of the class presentations is available here in the subdirectory announced in class. File names are of the form modulei.ppt, where i starts at 0. You may wish to print these and take notes on them, or load them onto your laptop and bring them to class to follow along. I will add substantial content to these outlines in class.

When each module is completed, the annotated version will also be available in the other subdirectory announced in class, with file names of the form donemodulei.ppt. (Some of the powerpoint slides have been adapted from a college data mining course by Piatetsky-Shapiro and Parker (with permission)).

Tutorials

The tool we will use for most of the practical work in the course is Rapidminer, a development from the Weka toolkit. You can find extensive tutorial material on the Rapid website.

Rapidminer is available for free download at http://rapid-i.com/; you need Rapidminer Community Edition (should be Rapidminer 5).

Rapidminer is also available on the Caslab machines, under either Windows or Linux.

Tutorial on using Matlab (useful for visualization) and the SVD and ICA matrix decompositions.

Exercises and Assignments

Exercises are a chance for you to get some hands-on experience. The exercise questions will often be open-ended. You might expect to spend 3 or 4 hours on these sheets each week. Each one is marked on this scale: acceptable; inadequate; or not seriously attempted. There will be five exercise sheets in the first half of the term.

(Exercise sheets will appear a week or ten days before they are due.)

Outline

This table describes what we will cover, keyed to the modules and text.

ModuleStatusContentText Refs
Module 0DoneIntroductionChapter 1
Module 1DonePrediction and ExplorationBits of Chapter 2
Module 2DoneData preparation and model qualityChapter 2
Module 3DoneSimple predictors
Module 4DoneDecision treesChapter 4
Module 5DoneMore decision trees
Module 6DoneNeural networksChapter 5.4
Module 7DoneSupport Vector MachinesChapter 5.5
Module 8DoneRule based systemsChapter 5.1
Module 9DoneObject selection: sampling, ensemble techniquesChapter 5.6
Module 10DoneAttribute selectionSection 2.3.4
Module 11DonePrediction case studies
Module 12DoneClustering I: similarities, k-meansChapter 8.2
Module 13DoneClustering II: Expectation-MaximizationSection 9.2.2
Module 14DoneClustering III: Top-down and bottom-up clusteringChapter 8.3
Module 15DoneClustering IV: Matrix decompositions
Module 16DoneClustering V: Clustering Large DatasetsSection 9.5.2,Section 8.4.2
Module 17DoneVisualizationChapter 3.3
Module 18DoneClustering Case Study
Module 19DoneClustering VI: Biclustering
Module 20SkippedBiclustering Case Study: Topic Detection
Module 21DoneMining the Web
Module 22DoneCollaborative Filtering
Module 23DoneCounterterrorism and Fraud
Module 24SkippedSummary

Avail means that the basic powerpoint is available. Done means that the marked-up powerpoint is available.

Text

Tan, Steinbach, Kumar, Introduction to Data Mining, Addison-Wesley, 2006, ISBN 0-321-32136-7 ($104 at Amazon, $97.95 at Campus Bookstore).

Assessment

There are 3 deliverables:

  1. 5 weekly exercise sheets, due in each of weeks 2-6. These exercise sheets are not marked as such. However, for each exercise sheet that is not adequate, your final mark in the course will be reduced by a factor of 0.02. The exercise sheets are useful both for understanding the material and as ground work for the project, so they are worth doing!
  2. A project in which you can apply data mining techniques to a real dataset and report on what you discover, due at the end of week 11, and worth 30% of your final grade.
  3. A final examination, worth the remaining 70% of your grade. You must pass the exam to pass the course. You may bring one 8.5x11 sheet of notes into the exam.

Note that the assessment in the course is backloaded, so please take this into account when planning your procrastination.

Your instruction team

Instructor

David Skillicorn
528 Goodwin Hall
skill cs queensu ca
533 6065

Questions? Try asking me before or after class, or come and find me at my office any time I'm there. I have a schedule posted by my door. I'm teaching this course, another in slot 2, and all Friday afternoon, so there's no use trying around those times.

I have a scheduled office hour from 1-2 on Tuesdays.

Teaching Assistant

Unless the enrollment picks up, we won't have one. Why not encourage your friends to take the course?

Resources

Study skills. You probably know all of the conventional wisdom about how to learn, but perhaps you don't actually use it. Here is an excellent link: Study Hacks.

You may also want to subscribe to, or read: KD Nuggets