COURSE OVERVIEW
Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project's status and history. Using well-established data mining techniques, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain and evolve complex software projects.

COURSE OBJECTIVES
This seminar course explores leading research in mining Software Engineering (SE) data, discusses challenges associated with mining SE data, highlights SE data mining success stories, and outlines future research directions. Students will acquire the knowledge needed to perform research or conduct practice in the field. Once completed, students should be able to integrate SE data mining techniques in their own research or practice.

COURSE SCHEDULE

Classes are held on Tuesdays (2:30PM-5:30PM).
First class on Sept 14 will be held in Goodwin 544.


Each class, students will present and discuss around three papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:

  • Visualization of Data in Software Repositories
  • Guiding Softare Development
  • Bug Prediction
  • Bug Detection
  • Mining Process and Social Data
  • Code Duplication
  • Code Reuse, Reuse Patterns, and Code Searching
  • Tools and Mining Challenges
  • Software Evolution and Decay

COURSE REQUIREMENTS
Students are expected to have some background in software development and software engineering. Knowledge of data mining techniques will be beneficial but not expected.

Students will be evaluated on
1. Classroom participation (10%):
Students are expected to read all papers covered in a week, come to class prepared to discuss their thoughts and take part of the classroom discussions.

2. Paper presentation and discussion (20%):
Each paper will be assigned to one student who will act as a presenter and a discussant. The presentation will last 15 mins strict and the discussion will last 15-20 mins. Each student should upload the slides to Moodle before class.

  • Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative other papers covered in the course (especially the papers covered in that particular week).
  • Role of discussant: As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work. Make sure you list three things you liked and diskliked about the paper. You MUST have a slide that lists at least three things you liked and three areas that should be improved.
3. Weekly critique (10%):
Each week, each student should pick one of the papers for that week and submit via email a one page critique of the paper before the start of class. The critique should offer a brief summary of the paper, points in favor, points against, and comments for improvement. You do not need to submit a critique if you are presenting that week.

4. Assignment (20%):
One assignments done in a group of 3 or 4 students. More details in class. Assignment will involve using the WEKA or R toolkits on software engineering data.

5. Project (40%):
One original project (10 pages IEEE format) done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course.
Your group needs to submit a project proposal (2 pages IEEE format) around 1.5 months before end of term. The proposal should provide a brief motivation of the project,a detailed discussion of the data that will be used in the project, along with a timeline of milestones, and expected outcome.
Your project will be graded according to originality and interestingness of your project, depth of your work, and the presentation quality of your written report and class presentation (20 mins presentation done in the last 2 weeks of classes). The MSR Challenge 2007 data, MSR Challenge 2008 data, MSR Challenge 2009 data, MSR Challenge 2010 data, MSR Challenge 2011 data, and the PROMISE data sets are possible sources of data to use for your project.