CISC873 Final Exam 2003

The final exam will be available at noon on Friday November 28th and is due by noon on Wednesday December 3rd. Solutions are to be returned to my office.

The answers to the exam can have a maximum length of 6 pages at 11pt or 7 pages at 12pt, including figures and references. As there won't be room to include everything you might want to say, you will have to be selective and to use references to anything that is standard, rather than including text about it.

The exam will be in the form of a data mining problem. I expect you to do a little actual mining, but what I'm mostly looking for is a coherent explanation of how you would attack the specified problem, and what you would expect to find at each stage. You might want to use data mining tools to validate your assumptions and to check that your reasoning does indeed seem to agree with the models you generate.

Above all, quality not quantity.

Please attach a page with the following content:

I certify that the submitted work was done by me without help from any other person.
and sign it.

The CEO of a health insurance supplier has received an anonymous message telling her that a set of doctors are defrauding the health insurer by sending in bills for services that have not been done. These doctors are cooperating with each other so that they don't all send in fradulent bills for the same services.

The CEO has two month's worth of billing data in which each row represents the dollar amount paid to a particular doctor, and each column represents a particular service. So the ijth entry represents the amount paid to doctor i for service j.

She asks you to use the first month's data to determine if there seems to be fraudulent billing going on. She then wants you to build a predictor for fraud and test it on the second month's data.

You must present an interim report to the board at noon next Wednesday, telling them everything that you've learned from these datasets.

(These datasets represent a real problem faced by many in the health insurance industry, although the datasets have been simplified because this is an exam.)

June dataset

July dataset