CISC873 Data Mining |

This course is offered in Fall 2017. Classes are Wednesdays from 2:00-6:00 in Goodwin 521.

Data mining builds inductive models from data. Almost all organisations, and many individuals, accumulate data from their interactions, and can use this data to improve service, and sometimes profit. Some examples:

- placing foods that are often bought at the same time on the same shelf can increase supermarket revenue;
- spotting those customers who are likely not to renew a subscription or contract, and offering them a small incentive can improve retention rates;
- noticing unusual patterns in bank account or credit card use can detect fraud;
- associating medical tests that commonly occur together helps to find doctors who are ordering bogus tests.

This course is a project course We will examine a number of datasets, with each participant using a particular technique to investigate each dataset and see what structure the technique discovers. You will have a chance to try several different techniques during the course.

Good working knowledge of standard software environments is required, especially the ability to develop scripts and plot data (e.g. Excel, Matlab, Open GL + Perl, Python, Awk). Some elementary knowledge of statistics and probability is required.

The course is an applications token for the Ph.D. programme.

Each participant will use two techniques during the course, the first for datasets 1 and 2; and the second for dataset 3. You will choose your combination in the early weeks of the term. Each combination includes one supervised technique and one unsupervised technique. Here are some possible combinations:

- Independent component analysis -- supervised neural networks
- Random forests -- Bayesian approaches (Bayesian networks)
- Support vector machines -- visualization techniques
- Supervised neural networks -- singular value decomposition/semidiscrete decomposition
- Rule based systems (not association rules) -- non-negative matrix factorization
- Bayesian approaches (Bayesian networks) -- hierarchical clustering
- Singular value decomposition/semidiscrete decomposition -- rule-based systems (not association rules)
- Visualization techniques -- autoclass/k-means/EM
- Non-negative matrix decomposition -- Latent Dirichlet allocation
- Autoclass/k-means/EM -- genetic algorithms
- Hierarchical clustering -- support vector machines
- Latent Dirichlet allocation -- random forests
- Genetic algorithms -- independent component analysis

Assessment will be based partly on performance in class (quality of results, and quality of presentation and discussion). Marks will be generated using input from all class participants. There may be a take-home exam at the end of term in which you will be given a dataset and asked to report on what you can find out about it. The exam would be worth 30% of your mark.

A possible reference for this course is: Hand, Mannila, Smyth, "Principles of Data Mining", MIT Press, 026208290X. You may also be interested in the text for the undergraduate data mining course: Tan, Steinbach, Kumar, Introduction to Data Mining, Addison-Wesley, 2006, ISBN 0-321-32136-7.

I will make the datasets available as we go.

Here is an approximation to the weekly schedule that I expect to follow, at least for the first few weeks:

**Weeks 1 and 2**

I will present introductory material on data mining, particularly about
ways in which data needs to be prepared before mining, and how your
results should be presented.

**Week 2**

You will choose your first data mining technique.
I will explain the protocol for doing this in the first class.
As soon as you have been allocated a method, you
should look for software that will help you. I will be able to give some
advice about this.

**Weeks 2 and 3**

You should spend these weeks finding out about your chosen technique.
Be prepared to make a brief explanation of what your technique does
and how it works in class during Week 3.

**Weeks 4 to 11**

You should be prepared to make some kind of presentation every week,
probably a brief powerpoint presentation.
We will spend time on each
dataset in turn. It's hard to estimate how long we will spend on
each one, because it depends on how successful the modelling is.

Some questions about your technique

**Week 12**

What have you learned about each of the techniques that you've seen being used? Which technique would you use for problems of the following kind:

- a supermarket with a customer loyalty card which wanted to improve their profit margins;
- a manufacturing plant that wanted to reduce the amount of waste from improperly built products;
- an airport security company that wanted to screen for potential terrorists;
- a satellite company that wanted to spot unusally warm locations on the earth's surface;

If you wanted to spend the next ten years using data mining as your source of livelihood, would you (a) develop a product and sell it, or (b) provide consulting to organisations wanting to use data mining themselves, or (c) something else? Why?

I will ask each person to evaluate the performance of each other member of the class based on their contribution to the course. This might be based on help with data manipulation and software, as well as the quality of their work and presentations. Note that it isn't fair to base performance on the quality of the results obtained, since some techniques are intrinsically more powerful than others.

I'll introduce the dataset to be used for the exam if we have one. Notice that the exam task is different from what you've been doing during the term -- you may choose any technique, but you must justify your choice.