SCAM 2010: Estimating the Optimal Number of Latent Concepts in Source Code Analysis

Source code, data, and information for "Estimating the Optimal Number of Latent Concepts in Source Code Analysis"
Presented at SCAM2010 by Scott Grant and James R. Cordy

Concept and feature location techniques are designed to extract related subsets of program code in order to aid program comprehension. These location techniques whether supervised or not seek to identify related blocks of code, and aim to ease the difficult process of making sense of large code bases. This can remove a great deal of overhead when trying to understand a set of code, and can even work to prevent related methods from going unnoticed when developing an understanding of unfamiliar source code. The initial roots of concept and feature location can be traced back to program comprehension theories. These early works attempted to determine how a programmer developed the comprehension necessary to debug, modify, or document code. From these, Biggerstaff identified the concept assignment problem, and described it as the problem of discovering individual human-oriented concepts and assigning them to their implementation-oriented counterparts for a given a program.

Access the paper here

A subset of the data used in the study, in xml format and split by function, can be downloaded here.

The proof-of-concept Python code can be downloaded here. It is uncommented, but I am happy to answer any questions about its usage or intent. It assumes the source data has already been placed in lda.txt, and with the parameters at the top of the file, you can specify the location of the required applications and of the data itself.

Please note that this code is being replaced by a new system called COLE (COncept Location Evaluation), and a link to the new tool will be added once it's ready for distribution (and merciful evaluation). :)

If you have any questions or comments, please direct them to Scott Grant.