On the Effectiveness of Simhashing in Clone Detection on Large Scale Software System

Uddin, S.; Roy, C.K.; Schneider, K.A.; Hindle, Abram

doi:doi:10.7939/r3-cg2f-k683

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Computing Science, Department of / Conference Papers (Computing Science)

Usage

64 views
113 downloads

On the Effectiveness of Simhashing in Clone Detection on Large Scale Software System

Author(s) / Creator(s)
Clone detection techniques essentially cluster textually, syntactically and/or semantically similar code fragments in or across software systems. For large datasets, similarity identification is costly both in terms of time and memory, and especially so when detecting near-miss clones where lines could be modified, added and/or deleted in the copied fragments. The capability and effectiveness of a clone detection tool mostly depends on the code similarity measurement technique it uses. A variety of similarity measurement approaches have been used for clone detection, including fingerprint based approaches, which have had varying degrees of success notwithstanding some limitations. In this paper, we investigate the effectiveness of simhash, a state of the art fingerprint based data similarity measurement technique for detecting both exact and near-miss clones in large scale software systems. Our experimental data show that simhash is indeed effective in identifying various types of clones in a software system despite wide variations in experimental circumstances. The approach is also suitable as a core capability for building other tools, such as tools for: incremental clone detection, code searching, and clone management.
Date created

2011
Subjects / Keywords
Type of Item

Conference/Workshop Presentation
DOI

https://doi.org/10.7939/r3-cg2f-k683
License

Attribution 4.0 International

Language
- English