CISC 333 Large Project



Due beginning of class, Friday November 25th
Automatic extension to beginning of class, Friday December 2nd to allow for illness, overwork etc.
This project is expected to take you: 15 hours
You must hand in 1-2 pages describing your progress on the Friday of Week 9 (November 11th).

Questions:

This dataset contains gene expression data for people who have rhabdomyosarcoma, a soft-tissue cancer that is common in children. You can read up on gene expression analysis if you want to, but all you need to know is that the values represent measurements of levels of proteins in cells.

The basic data is in the patientsvs8000genes.csv file. This is a stripped down version of much wider data -- the mechanism that measures these protein concentrations collects data about 22,000+ of them and this is tiny by today's standards.

The patient labels are in the file simplepatientlabels.txt, which contains A for those patients who were alive when the data was collected, and D for those who had died. One of the complications with this kind of data is that the labels are necessarily uncertain, since some of those who were alive when the data was collected might now be dead. This uncertainty is, of course, one-sided. But it motivates sophisticated clustering, since those who subsequently died should resemble those who had died, rather than those who didn't.

The identifiers for the genes are in the file 8000genenames.txt.

Step 0: You know what you should do first, right? You might want to try some visualization as well to get a feel for the data.

Step 1: Apply whatever clustering techniques or other unsupervised analysis you think is appropriate to the dataset. Try to understand the structure that you see in the data, and think about how it will guide the choice and construction of a predictor. For example, you might decide that class labels need to be handled differently.

Step 2: Think about whether and how you can do some attribute prediction to cope with the large number of attributes, each of which is only weakly predictive of the target. You might even want to go back and cluster using a smaller number of attributes.

Step 3: Build a predictor that will predict the target class on new data and discuss the results. (You will probably build several predictors, using different prediction techniques, and then choose the one that seems to perform the best.)

Your mark will depend on both how sensible your process is and how well you have carried it out. What you hand in should reflect this: you should explain what your process was, but also why you followed the process you did (you don't need to sanitise this -- explaining the blind alleys you tried isn't a bad thing unless they were clearly silly from the beginning). The results from each step of the process are part of the explanation, and so you need to attach these so you can explain their significance.

Remember that you aren't restricted to Rapidminer, and that you can use techniques other than those we've discussed in class or you've done on previous exercise sheets.

The overall project report should be 5-7 pages, not counting the space you need to display your results.