KEYWORD |
Area Engineering
Tools and techniques to detect plagiarism in code
keywords INTELLECTUAL PROPERTY, PROGRAMMING LANGUAGES, TEXT ANALYSIS
Reference persons RENATO FERRERO, PAOLO GIACCONE, ENRICO MASALA
Research Groups DAUIN - GR-05 - ELECTRONIC CAD & RELIABILITY GROUP - CAD, DAUIN - GR-11 - INTERNET MEDIA GROUP - IMG, Telecommunication Networks Group
Thesis type RESEARCH AND DEVELOPMENT
Description Code plagiarism refers to the act of copying or reusing another person's code without proper permission or without giving credit to the original author. It is considered a violation of computer ethics and may result in legal and disciplinary consequences. In addition to direct copying, plagiarism also involves partial rewriting: the original code is rewritten and rearranged so that it looks different, but still retains a large part of the idea or logic behind the original.
Specialized tools designed to detect code plagiarism exist. These tools use different techniques to compare source codes and identify similarities:
- String comparison: text strings within the source code are compared for exact or partial matches.
- Tokenization: the source code is broken down into tokens, such as keywords, operators and identifiers. The token sequences are then compared for similarities.
- Dependency analysis: Dependencies between modules, functions or classes in the code are examined for structural similarities.
- Fingerprinting: Unique "fingerprints" are generated for portions of code and these fingerprints are compared for matches.
- Neural networks: code characteristics can be learned by a neural network, which is then able to recognize similarities with other source codes.
The proposed thesis activity concerns first of all the analysis of advantages and disadvantages of the state-of-the-art plagiarism detection techniques, with the aim of identifying which combination of approaches is able to provide the most accurate results in detecting plagiarism. code plagiarism.
Furthermore, the thesis work will concern the analysis of the programming languages currently supported by anti-plagiarism software. The most common languages in desktop and system application development (e.g. Python, Java, C, C++) are certainly covered by anti-plagiarism software, but support may be lacking for more specific languages, such as Matlab, Javascript/JSX and assembly. In this context, the objective of the thesis is to understand which of the open source anti-plagiarism software are best suited to being modified to extend the set of supported languages.
Required skills programming skills, natural language processing, data analysis
Deadline 24/04/2025
PROPONI LA TUA CANDIDATURA