Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines

Danilo Leal Maciel

Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines

Detalhes bibliográficos
Autor(a) principal:	Danilo Leal Maciel
Data de Publicação:	2014
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Biblioteca Digital de Teses e Dissertações da UFC
Texto Completo:	http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=12195
Resumo:	This work is contextualized in the problem of plagiarism detection among source codes in programming classes. Despite the wide set of tools available for the detection of plagiarism, only few tools are able to effectively identify all lexical and semantic similarities between pairs of codes, because of the complexity inherent to this type of analysis. Therefore to the problem and the scenario in question, it was made a study about the main approaches discussed in the literature on detecting plagiarism in source code and as a main contribution, conceived to be a relevant tool in the field of laboratory practices. The tool is based on Sherlock algorithm, which has been enhanced as of two perspectives: firstly, with changes in the similarity coefficient used by the algorithm in order to improve its sensitivity for comparison of signatures; secondly, proposing intrusive techniques preprocessing that, besides eliminating irrelevant information, are also able to overemphasize structural aspects of the programming language, or gathering separating strings whose meaning is more significant for the comparison or even eliminating sequences less relevant to highlight other enabling better inference about the degree of similarity. The tool, called Sherlock N-Overlap was subjected to rigorous evaluation methodology, both in simulated scenarios as classes in programming, with results exceeding tools currently highlighted in the literature on plagiarism detection.

Metadados do item

id	UFC_192a4322d822f62b0bf3cc309da74124
oai_identifier_str	oai:www.teses.ufc.br:8342
network_acronym_str	UFC
network_name_str	Biblioteca Digital de Teses e Dissertações da UFC
spelling	info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisSherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplinesSherlock N-Overlap: normalizaÃÃo invasiva e coeficiente de sobreposiÃÃo para anÃlise de similaridade entre cÃdigos-fonte em disciplinas de programaÃÃo2014-07-07JosÃ Marques Soares82440026700http://lattes.cnpq.br/3186709749685737Danielo GonÃalves Gomes42593751304//lattes.cnpq.br/6303297687237256George AndrÃ Pereira ThÃ62147390372http://lattes.cnpq.br/6398510210462764Ig Ibert Bittencourt Santana Pinto 04306525422http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4265484E4 60020864396http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4211286Z5Danilo Leal MacielUniversidade Federal do CearÃPrograma de PÃs-GraduaÃÃo em Engenharia de TeleinformÃticaUFCBRDetecÃÃo de plÃgio AnÃlise de similaridadeplagiarism detection, similarity analysis, laboratory programming practicesCIENCIA DA COMPUTACAOThis work is contextualized in the problem of plagiarism detection among source codes in programming classes. Despite the wide set of tools available for the detection of plagiarism, only few tools are able to effectively identify all lexical and semantic similarities between pairs of codes, because of the complexity inherent to this type of analysis. Therefore to the problem and the scenario in question, it was made a study about the main approaches discussed in the literature on detecting plagiarism in source code and as a main contribution, conceived to be a relevant tool in the field of laboratory practices. The tool is based on Sherlock algorithm, which has been enhanced as of two perspectives: firstly, with changes in the similarity coefficient used by the algorithm in order to improve its sensitivity for comparison of signatures; secondly, proposing intrusive techniques preprocessing that, besides eliminating irrelevant information, are also able to overemphasize structural aspects of the programming language, or gathering separating strings whose meaning is more significant for the comparison or even eliminating sequences less relevant to highlight other enabling better inference about the degree of similarity. The tool, called Sherlock N-Overlap was subjected to rigorous evaluation methodology, both in simulated scenarios as classes in programming, with results exceeding tools currently highlighted in the literature on plagiarism detection.Este trabalho se contextualiza no problema da detecÃÃo de plÃgio entre cÃdigos-fonte em turmas de programaÃÃo. Apesar da ampla quantidade de ferramentas disponÃveis para a detecÃÃo de plÃgio, poucas sÃo capazes de identificar, de maneira eficaz, todas as semelhanÃas lÃxicas e semÃnticas entre pares de cÃdigos, o que se deve Ã complexidade inerente a esse tipo de anÃlise. Fez-se, portanto, para o problema e o cenÃrio em questÃo, um estudo das principais abordagens discutidas na literatura sobre detecÃÃo de plÃgio em cÃdigo-fonte e, como principal contribuiÃÃo, concebeu-se uma ferramenta aplicÃvel no domÃnio de prÃticas laboratoriais. A ferramenta tem por base o algoritmo Sherlock, que foi aprimorado sob duas perspectivas: a primeira, com modificaÃÃes no coeficiente de similaridade usado pelo algoritmo, de maneira a melhorar a sua sensibilidade para comparaÃÃo de assinaturas; a segunda, propondo tÃcnicas de prÃ-processamento invasivas que, alÃm de eliminar informaÃÃo irrelevante, sejam tambÃm capazes de sobrevalorizar aspectos estruturais da linguagem de programaÃÃo, reunindo ou separando sequÃncias de caracteres cujo significado seja mais expressivo para a comparaÃÃo ou, ainda, eliminando sequÃncias menos relevantes para destacar outras que permitam melhor inferÃncia sobre o grau de similaridade. A ferramenta, denominada Sherlock N-Overlap, foi submetida a rigorosa metodologia de avaliaÃÃo, tanto em cenÃrios simulados como em turmas de programaÃÃo, apresentando resultados superiores a ferramentas atualmente em destaque na literatura sobre detecÃÃo de plÃgio.CoordenaÃÃo de AperfeÃoamento de Pessoal de NÃvel Superior http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=12195application/pdfinfo:eu-repo/semantics/openAccessporreponame:Biblioteca Digital de Teses e Dissertações da UFCinstname:Universidade Federal do Cearáinstacron:UFC2019-01-21T11:25:23Zmail@mail.com -
dc.title.en.fl_str_mv	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines
dc.title.alternative.pt.fl_str_mv	Sherlock N-Overlap: normalizaÃÃo invasiva e coeficiente de sobreposiÃÃo para anÃlise de similaridade entre cÃdigos-fonte em disciplinas de programaÃÃo
title	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines
spellingShingle	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines Danilo Leal Maciel DetecÃÃo de plÃgio AnÃlise de similaridade plagiarism detection, similarity analysis, laboratory programming practices CIENCIA DA COMPUTACAO
title_short	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines
title_full	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines
title_fullStr	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines
title_full_unstemmed	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines
title_sort	Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines
author	Danilo Leal Maciel
author_facet	Danilo Leal Maciel
author_role	author
dc.contributor.advisor1.fl_str_mv	JosÃ Marques Soares
dc.contributor.advisor1ID.fl_str_mv	82440026700
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/3186709749685737
dc.contributor.referee1.fl_str_mv	Danielo GonÃalves Gomes
dc.contributor.referee1ID.fl_str_mv	42593751304
dc.contributor.referee1Lattes.fl_str_mv	//lattes.cnpq.br/6303297687237256
dc.contributor.referee2.fl_str_mv	George AndrÃ Pereira ThÃ
dc.contributor.referee2ID.fl_str_mv	62147390372
dc.contributor.referee2Lattes.fl_str_mv	http://lattes.cnpq.br/6398510210462764
dc.contributor.referee3.fl_str_mv	Ig Ibert Bittencourt Santana Pinto
dc.contributor.referee3ID.fl_str_mv	04306525422
dc.contributor.referee3Lattes.fl_str_mv	http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4265484E4
dc.contributor.authorID.fl_str_mv	60020864396
dc.contributor.authorLattes.fl_str_mv	http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4211286Z5
dc.contributor.author.fl_str_mv	Danilo Leal Maciel
contributor_str_mv	JosÃ Marques Soares Danielo GonÃalves Gomes George AndrÃ Pereira ThÃ Ig Ibert Bittencourt Santana Pinto
dc.subject.por.fl_str_mv	DetecÃÃo de plÃgio AnÃlise de similaridade
topic	DetecÃÃo de plÃgio AnÃlise de similaridade plagiarism detection, similarity analysis, laboratory programming practices CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	plagiarism detection, similarity analysis, laboratory programming practices
dc.subject.cnpq.fl_str_mv	CIENCIA DA COMPUTACAO
dc.description.sponsorship.fl_txt_mv	CoordenaÃÃo de AperfeÃoamento de Pessoal de NÃvel Superior
dc.description.abstract.por.fl_txt_mv	This work is contextualized in the problem of plagiarism detection among source codes in programming classes. Despite the wide set of tools available for the detection of plagiarism, only few tools are able to effectively identify all lexical and semantic similarities between pairs of codes, because of the complexity inherent to this type of analysis. Therefore to the problem and the scenario in question, it was made a study about the main approaches discussed in the literature on detecting plagiarism in source code and as a main contribution, conceived to be a relevant tool in the field of laboratory practices. The tool is based on Sherlock algorithm, which has been enhanced as of two perspectives: firstly, with changes in the similarity coefficient used by the algorithm in order to improve its sensitivity for comparison of signatures; secondly, proposing intrusive techniques preprocessing that, besides eliminating irrelevant information, are also able to overemphasize structural aspects of the programming language, or gathering separating strings whose meaning is more significant for the comparison or even eliminating sequences less relevant to highlight other enabling better inference about the degree of similarity. The tool, called Sherlock N-Overlap was subjected to rigorous evaluation methodology, both in simulated scenarios as classes in programming, with results exceeding tools currently highlighted in the literature on plagiarism detection. Este trabalho se contextualiza no problema da detecÃÃo de plÃgio entre cÃdigos-fonte em turmas de programaÃÃo. Apesar da ampla quantidade de ferramentas disponÃveis para a detecÃÃo de plÃgio, poucas sÃo capazes de identificar, de maneira eficaz, todas as semelhanÃas lÃxicas e semÃnticas entre pares de cÃdigos, o que se deve Ã complexidade inerente a esse tipo de anÃlise. Fez-se, portanto, para o problema e o cenÃrio em questÃo, um estudo das principais abordagens discutidas na literatura sobre detecÃÃo de plÃgio em cÃdigo-fonte e, como principal contribuiÃÃo, concebeu-se uma ferramenta aplicÃvel no domÃnio de prÃticas laboratoriais. A ferramenta tem por base o algoritmo Sherlock, que foi aprimorado sob duas perspectivas: a primeira, com modificaÃÃes no coeficiente de similaridade usado pelo algoritmo, de maneira a melhorar a sua sensibilidade para comparaÃÃo de assinaturas; a segunda, propondo tÃcnicas de prÃ-processamento invasivas que, alÃm de eliminar informaÃÃo irrelevante, sejam tambÃm capazes de sobrevalorizar aspectos estruturais da linguagem de programaÃÃo, reunindo ou separando sequÃncias de caracteres cujo significado seja mais expressivo para a comparaÃÃo ou, ainda, eliminando sequÃncias menos relevantes para destacar outras que permitam melhor inferÃncia sobre o grau de similaridade. A ferramenta, denominada Sherlock N-Overlap, foi submetida a rigorosa metodologia de avaliaÃÃo, tanto em cenÃrios simulados como em turmas de programaÃÃo, apresentando resultados superiores a ferramentas atualmente em destaque na literatura sobre detecÃÃo de plÃgio.
description	This work is contextualized in the problem of plagiarism detection among source codes in programming classes. Despite the wide set of tools available for the detection of plagiarism, only few tools are able to effectively identify all lexical and semantic similarities between pairs of codes, because of the complexity inherent to this type of analysis. Therefore to the problem and the scenario in question, it was made a study about the main approaches discussed in the literature on detecting plagiarism in source code and as a main contribution, conceived to be a relevant tool in the field of laboratory practices. The tool is based on Sherlock algorithm, which has been enhanced as of two perspectives: firstly, with changes in the similarity coefficient used by the algorithm in order to improve its sensitivity for comparison of signatures; secondly, proposing intrusive techniques preprocessing that, besides eliminating irrelevant information, are also able to overemphasize structural aspects of the programming language, or gathering separating strings whose meaning is more significant for the comparison or even eliminating sequences less relevant to highlight other enabling better inference about the degree of similarity. The tool, called Sherlock N-Overlap was subjected to rigorous evaluation methodology, both in simulated scenarios as classes in programming, with results exceeding tools currently highlighted in the literature on plagiarism detection.
publishDate	2014
dc.date.issued.fl_str_mv	2014-07-07
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
status_str	publishedVersion
format	masterThesis
dc.identifier.uri.fl_str_mv	http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=12195
url	http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=12195
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal do CearÃ
dc.publisher.program.fl_str_mv	Programa de PÃs-GraduaÃÃo em Engenharia de TeleinformÃtica
dc.publisher.initials.fl_str_mv	UFC
dc.publisher.country.fl_str_mv	BR
publisher.none.fl_str_mv	Universidade Federal do CearÃ
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da UFC instname:Universidade Federal do Ceará instacron:UFC
reponame_str	Biblioteca Digital de Teses e Dissertações da UFC
collection	Biblioteca Digital de Teses e Dissertações da UFC
instname_str	Universidade Federal do Ceará
instacron_str	UFC
institution	UFC
repository.name.fl_str_mv	-
repository.mail.fl_str_mv	mail@mail.com
_version_	1643295190275850240

Sherlock N-Overlap: normalization invasive and overlap coefficient for analysis of similarity between source code in programming disciplines

Registros relacionados