Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project's status and history. Using well-established data mining techniques, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain and evolve complex software projects.

This seminar course explores leading research in mining Software Engineering (SE) data, discusses challenges associated with mining SE data, highlights SE data mining success stories, and outlines future research directions. Students will acquire the knowledge needed to perform research or conduct practice in the field. Once completed, students should be able to integrate SE data mining techniques in their own research or practice.

Classes are held on Thursdays (9:30AM-12:30PM) in Goodwin 521. Each class, students will present and discuss around three papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:

  • Visualization of Data in Software Repositories
  • Guiding Softare Development
  • Bug Prediction
  • Bug Detection
  • Mining Process and Social Data
  • Code Duplication
  • Code Reuse, Reuse Patterns, and Code Searching
  • Tools and Mining Challenges
  • Software Evolution and Decay

Students are expected to have some background in software development and software engineering. Knowledge of data mining techniques will be beneficial but not expected.

Students will be evaluated on
1. Classroom participation (10%):
Students are expected to read all papers covered in a week, come to class prepared to discuss their thoughts and take part of the classroom discussions.

2. Paper presentation and discussion (20%):
Each paper will be assigned to one student who will act as a presenter and a discussant. The presentation will last 15 mins and the discussion will last 15-30 mins. Each student should upload the slides to webCT before class.

  • Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative other papers covered in the course (especially the papers covered in that particular week).
  • Role of discussant: As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work. Make sure you list three things you liked and diskliked about the paper.
3. Weekly critique (10%):
Each week, each student should pick one of the papers for that week and submit via email a one page critique of the paper before the start of class. The critique should offer a brief summary of the paper, points in favor, points against, and comments for improvement. You do not need to submit a critique if you are presenting that week.

4. Assignment (20%):
One or Two assignments done in a group of 2 or 3 students. More details in class. Assignment will involve using the WEKA or R toolkits on software engineering data.

5. Project (40%):
One original project (20-25 pages) done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course. Your group needs to submit a project proposal (5 pages) around 1.5 months before end of term. Your project will be graded according to originality and interestingness of your project, depth of your work, and the presentation quality of your written report and class presentation (30 mins presentation done in the last 2 weeks of classes). The MSR Challenge 2007 data, MSR Challenge 2008 data, MSR Challenge 2009 data, and the PROMISE data sets are possible sources of data to be used for the project.