A large amount of data is recorded in software repositories throughout the lifetime of long lived projects. These repositories represent a detailed and rich record of the historical progress of large projects. For example, source control systems store changes to the source code as development progresses, defect tracking systems follow the resolution of software defects, and archived communications between project personnel record the rationale for decisions throughout the life of a project. Most software projects carefully preserve their repositories; until recently, data from these repositories was used primarily for historical record supporting activities such as retrieving old versions of the source code or examining the status of a defect.

This seminar course explores the mining of data in software repositories in order to help researchers gain empirically based understanding of software development practices, and to support practitioners in managing, maintaining, and evolving complex software projects. The course will discuss leading research in the areas of mining software repositories, software evolution, empirical software engineering and reverse engineering. Papers discussed in this course will give students a glimpse of leading research which transforms software repositories from static record keeping repositories to active repositories that are used by researchers and practitioners to better understand and predict software development activities instead of depending on ad hoc guesses and gut feelings. Students who complete the course successfully will be well positioned to perform research in mining of software engineering data.

Classes are held on Thursdays (2:30-5:30) in CLEA201. Each class, students will present and discuss around four papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:

  • Visualization of Data in Software Repositories
  • Guiding Softare Development
  • Bug Prediction
  • Bug Detection
  • Mining Process and Social Data
  • Code Duplication
  • Code Reuse, Reuse Patterns, and Code Searching
  • Tools and Mining Challenges
  • Software Evolution and Decay

Students are expected to have some background in software development and software engineering. Knowledge of data mining techniques will be beneficial but not expected.

Students will be evaluated on
1. Classroom participation (10%):
Students are expected to read all papers covered in a week, come to class prepared to discuss their thoughts and take part of the classroom discussions.

2. Paper presentation and discussion (20%):
Each paper will be assigned a presenter and a discussant. Both presenter and discussant will be assigned the same mark so they should work together as a team. The presentation will last 15 mins and the discussion will last 15-30 mins. After each class, the presenter and discussant should email me their slides so I could post them online.

  • Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative other papers covered in the course (especially the papers covered in that particular week).
  • Role of discussant: As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work.
3. Weekly critique (10%):
Each week, each student should pick one of the papers for that week and submit via email a one page critique of the paper before the start of class. The critique should offer a brief summary of the paper, points in favor, points against, and comments for improvement.

4. Assignments (20%):
Two assignments done in a group of 2 or 3 students.
  1. Assignment 0: Email a list of your group members and pick one open source projects to be used for Assignments 1 and 2. Explain your motivation for picking the project and give a brief overview of available repositories for that project. Hand in a 1-2 page report.
  2. Assignment 1: Pick ten hypotheses that you can conveniently test using data from the repositories of the project which you picked in Assignment 0. Hand in a 10-12 page report. The report should motivate why you picked these hypotheses. The report should also explain your plans for testing each one of your hypotheses and your expectations. Make sure you map each hypothesis to the project repositories which you presented in Assignment 0. You may want to refer to these two papers for possible hypotheses or derive others based on literature and your intuition.
  3. Assignment 2: Test your ten hypotheses. Make sure you explain how you collected your data and summarize your findings. Compare your findings against your expectations in Assignment 1 and other findings in literature. Hand in a 10-12 page report.
5. Project (40%):
One original project (20-25 pages) done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course. Your group needs to submit a project proposal (5 pages) around one month before end of term. Your project will be graded according to originality and interestingness of your project, depth of your work, and the presentation quality of your written report and class presentation (30 mins presentation done in the last 2 weeks of classes).