Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project's status and history. Using well-established data mining techniques, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain and evolve complex software projects.

This seminar course explores leading research in mining Software Engineering (SE) data, discusses challenges associated with mining SE data, highlights SE data mining success stories, and outlines future research directions. Students will acquire the knowledge needed to perform research or conduct practice in the field. Once completed, students should be able to integrate SE data mining techniques in their own research or practice.


Classes are held on Thursdays (10:00AM-1:00PM) in Goodwin 521.
Classes starts Sept 11.

Each class, students will present and discuss around three papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:

  • Visualization of Data in Software Repositories
  • Guiding Software Development
  • Bug Prediction
  • Bug Detection
  • Mining Process and Social Data
  • Code Duplication
  • Code Reuse, Reuse Patterns, and Code Searching
  • Tools and Mining Challenges
  • Software Evolution and Decay

Students are expected to have some background in software development and software engineering. Knowledge of data mining techniques will be beneficial but not expected.

Students will be evaluated using the following breakdown:

1. Classroom participation (10%):
Students are expected to read all papers covered in a week, come to class prepared to discuss their thoughts and take part of the classroom discussions.

2. Paper presentation and discussion (20%):
Each paper will be assigned to one student who will act as a presenter and a discussant. The presentation will last 15 mins strict and the discussion will last 15-20 mins. Each student should upload the slides to course account before class.

  • Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative other papers covered in the course (especially the papers covered in that particular week).
  • Role of discussant: As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work.
Your presentations should have
  • one slide that lists the main contributions of the paper.
  • one slide that places the paper relative to any recent work done by the authors of the paper.
  • one slide that links places the paper relative to other papers presented that week.
  • as the final slide, a listing of at least three technical points that you liked and three areas that should be improved.
3. Weekly critique (10%):
Each week, each student should pick one of the papers for that week and submit via email a one page critique of the paper before the start of class. The critique should offer a brief summary of the paper, points in favor, points against, and comments for improvement. You do not need to submit a critique if you are presenting that week. Additional advice for critiquing papers is here.

The one document should have your name at the top. The document name should follow this template: Week#_Paper#_YourName.

4. Assignment (20%):
One assignments done in a group of 3 or 4 students. More details in class. Assignment will involve using the WEKA or R toolkits on software engineering data.

5. Project (40%):
One original project (10 pages IEEE format) done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course.
Your group needs to submit a project proposal (2 pages IEEE format) around 1.5 months before end of term. The proposal should provide a brief motivation of the project,a detailed discussion of the data that will be used in the project, along with a timeline of milestones, and expected outcome. Make sure you have cited at least 3 papers in your proposal. Additional advice for project proposals is here.

Your project will be graded according to originality and interestingness of your project, depth of your work, correctness of your analysis, and the presentation quality of your written report and class presentation (20 mins presentation done in the last 2 weeks of classes). The MSR Challenge 2007 data, MSR Challenge 2008 data, MSR Challenge 2009 data, MSR Challenge 2010 data, MSR Challenge 2011 data, MSR Challenge 2012 data, MSR Challenge 2013 data, MSR Challenge 2014 data, and the PROMISE data sets are possible sources of data to use for your project. Advice on writing a project report are here.