Uma medida de similaridade textual para identificação de plágio em fóruns educacionais
Autor(a) principal: | |
---|---|
Data de Publicação: | 2018 |
Tipo de documento: | Dissertação |
Idioma: | por |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da UFRPE |
Texto Completo: | http://www.tede2.ufrpe.br:8080/tede2/handle/tede2/7868 |
Resumo: | With the increasing use of technology as an educational support tool, the use of Virtual Learning Environment (VLE) has increased in recent years. These environments provide several tools to improve the interaction between teachers and students, where some examples are: forum, blog, wiki, among others. These tools have great potential for generating content, which can be used to aid in the process of teaching learning. However, due to the great amount of interactions between the students and the teacher, it is difficult for the teacher to evaluate and follow up all the material that is made available by the students. A tool that stands out in relation to the generation of collaborative content is the forum. Among the possible functionalities of the forums is the question of evaluation. Many distance disciplines use forum interaction as a form of student assessment. However, with the large amount of information posted on the tool, it often becomes impractical for the teacher to manually detect plagiarism in the responses. The fundamental basis for the creation of automatic plagiarism detection systems is the creation of a measure of similarity that can measure the relationship between two texts. The similarity between texts is important in several Natural Language Processing (NLP) applications, such as retrieving information, summarizing text, extracting information, and grouping text. For example, in retrieval of information, the similarity measure is used to assign a classification score between a query and the obtained text. Various measures of similarity between texts can be found; however, in general, they are language dependent. In the case of Portuguese, few measures have been found and most use only statistical techniques, not taking into account semantic aspects of texts. In addition, there are papers in the literature to identify plagiarism in activities, scientific articles or course completion work. However, when context is educational forums the identification of plagiarism becomes even more difficult mainly because of the size of the text and by not requiring a formal language. Therefore, this paper aims to propose a measure that calculates the similarity between sentences written in Portuguese taking into account the semantics of texts. This measure was evaluated on the basis of the ASSIN workshop 2016. The proposed measure achieved better results than the first place in the competition reaching 0.70 Pearson correlation and 0.47 mean squared error. In addition to this evaluation, a case study was carried out to evaluate similarity in postings of educational forums in a discipline of Computer Science. The results were evaluated by the teachers of the discipline who confirmed the effectiveness of the tool. |
id |
URPE_855864fe62aad45028ebde9ec7f5cfe3 |
---|---|
oai_identifier_str |
oai:tede2:tede2/7868 |
network_acronym_str |
URPE |
network_name_str |
Biblioteca Digital de Teses e Dissertações da UFRPE |
repository_id_str |
|
spelling |
MELLO, Rafael Ferreira Leite deMIRANDA, Péricles Barbosa Cunha deMIRANDA, Péricles Barbosa Cunha deLIMA, Rinaldo José deFREITAS, Frederico Luiz Gonçalves dehttp://lattes.cnpq.br/3833454062140432CAVALCANTI, Anderson Pinheiro2019-02-26T14:31:01Z2018-01-31CAVALCANTI, Anderson Pinheiro. Uma medida de similaridade textual para identificação de plágio em fóruns educacionais. 2018. 88 f. Dissertação (Programa de Pós-Graduação em Informática Aplicada) - Universidade Federal Rural de Pernambuco, Recife.http://www.tede2.ufrpe.br:8080/tede2/handle/tede2/7868With the increasing use of technology as an educational support tool, the use of Virtual Learning Environment (VLE) has increased in recent years. These environments provide several tools to improve the interaction between teachers and students, where some examples are: forum, blog, wiki, among others. These tools have great potential for generating content, which can be used to aid in the process of teaching learning. However, due to the great amount of interactions between the students and the teacher, it is difficult for the teacher to evaluate and follow up all the material that is made available by the students. A tool that stands out in relation to the generation of collaborative content is the forum. Among the possible functionalities of the forums is the question of evaluation. Many distance disciplines use forum interaction as a form of student assessment. However, with the large amount of information posted on the tool, it often becomes impractical for the teacher to manually detect plagiarism in the responses. The fundamental basis for the creation of automatic plagiarism detection systems is the creation of a measure of similarity that can measure the relationship between two texts. The similarity between texts is important in several Natural Language Processing (NLP) applications, such as retrieving information, summarizing text, extracting information, and grouping text. For example, in retrieval of information, the similarity measure is used to assign a classification score between a query and the obtained text. Various measures of similarity between texts can be found; however, in general, they are language dependent. In the case of Portuguese, few measures have been found and most use only statistical techniques, not taking into account semantic aspects of texts. In addition, there are papers in the literature to identify plagiarism in activities, scientific articles or course completion work. However, when context is educational forums the identification of plagiarism becomes even more difficult mainly because of the size of the text and by not requiring a formal language. Therefore, this paper aims to propose a measure that calculates the similarity between sentences written in Portuguese taking into account the semantics of texts. This measure was evaluated on the basis of the ASSIN workshop 2016. The proposed measure achieved better results than the first place in the competition reaching 0.70 Pearson correlation and 0.47 mean squared error. In addition to this evaluation, a case study was carried out to evaluate similarity in postings of educational forums in a discipline of Computer Science. The results were evaluated by the teachers of the discipline who confirmed the effectiveness of the tool.Com o crescente uso da tecnologia como ferramenta de apoio educacional, o uso de Ambiente Virtual de Aprendizagem (AVA) tem aumentado nos últimos anos. Estes ambientes disponibilizam várias ferramentas para melhorar a interação entre professores e alunos, tais como fórum, blog, wiki, entre outras. Estas ferramentas possuem um grande potencial para gerar conteúdo, o que pode ser usado para auxiliar no processo de ensino-aprendizagem. Porém, devido a grande quantidade de interações entre os alunos e o professor, torna-se difícil para o professor avaliar e acompanhar todo o material que é disponibilizado pelos alunos. Uma ferramenta que se destaca em relação à geração de conteúdo colaborativo é o fórum. Dentre as possíveis funcionalidades dos fóruns se destaca a questão da avaliação. Muitas disciplinas a distância utilizam a interação no fórum como forma de avaliação dos alunos. Contudo, devido a grande quantidade de dados postado na ferramenta, é difícil para o professor identificar problemas nas postagens, como por exemplo a detecção de plágio. A base fundamental para a criação de sistemas automáticos de detecção de plágio é a criação de uma medida de similaridade que possa medir a relação existente entre dois textos. A similaridade entre textos é importante em diversas aplicações de Processamento de Linguagem Natural (PLN), como recuperação de informação, sumarização de texto, extração de informações e agrupamento de texto. Várias medidas de similaridade entre textos já foram criadas; entretanto, em geral, elas são dependentes de idioma. No caso do português, poucas medidas foram encontradas e a maioria utiliza apenas técnicas estatísticas, não levando em consideração aspectos semânticos dos textos. Além disso, existem trabalhos na literatura para identificação de plágio em atividades, artigos científicos ou trabalhos de conclusão de curso. No entanto, quando o contexto é fóruns educacionais a identificação de plágio se torna ainda mais difícil por causa principalmente do tamanho do texto e por não exigir uma linguagem formal. Diante disso, este trabalho propõe uma medida que calcula a similaridade existente entre sentenças escritas em português levando em consideração a semântica dos textos. Esta medida foi avaliada na base da competição Workshop de Avaliação de Similaridade Semântica e Inferência Textual (ASSIN) 2016. A medida proposta alcançou resultados melhores que o primeiro colocado da competição atingindo 0,70 de correlação de Pearson e 0,47 de erro quadrático médio. Além desta avaliação, foi realizado um estudo de caso para avaliação de similaridade em postagens de fóruns educacionais em uma disciplina de Ciência da Computação. Os resultados foram avaliados pelos professores da disciplina que confirmaram a eficácia da ferramenta.Submitted by Mario BC (mario@bc.ufrpe.br) on 2019-02-26T14:31:01Z No. of bitstreams: 1 Anderson Pinheiro Cavalcanti.pdf: 3360691 bytes, checksum: d2510e8043cac677443d65100e0f9663 (MD5)Made available in DSpace on 2019-02-26T14:31:01Z (GMT). No. of bitstreams: 1 Anderson Pinheiro Cavalcanti.pdf: 3360691 bytes, checksum: d2510e8043cac677443d65100e0f9663 (MD5) Previous issue date: 2018-01-31application/pdfporUniversidade Federal Rural de PernambucoPrograma de Pós-Graduação em Informática AplicadaUFRPEBrasilDepartamento de Estatística e InformáticaEducação a distânciaFórum educacionalSimilaridade semânticaPlágioMineração de textoCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOUma medida de similaridade textual para identificação de plágio em fóruns educacionaisinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis-8268485641417162699600600600-67745551403961205013671711205811204509info:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFRPEinstname:Universidade Federal Rural de Pernambuco (UFRPE)instacron:UFRPEORIGINALAnderson Pinheiro Cavalcanti.pdfAnderson Pinheiro Cavalcanti.pdfapplication/pdf3360691http://www.tede2.ufrpe.br:8080/tede2/bitstream/tede2/7868/2/Anderson+Pinheiro+Cavalcanti.pdfd2510e8043cac677443d65100e0f9663MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82165http://www.tede2.ufrpe.br:8080/tede2/bitstream/tede2/7868/1/license.txtbd3efa91386c1718a7f26a329fdcb468MD51tede2/78682019-02-26 11:31:01.228oai:tede2:tede2/7868Tk9UQTogQ09MT1FVRSBBUVVJIEEgU1VBIFBSw5NQUklBIExJQ0VOw4dBCkVzdGEgbGljZW7Dp2EgZGUgZXhlbXBsbyDDqSBmb3JuZWNpZGEgYXBlbmFzIHBhcmEgZmlucyBpbmZvcm1hdGl2b3MuCgpMSUNFTsOHQSBERSBESVNUUklCVUnDh8ODTyBOw4NPLUVYQ0xVU0lWQQoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSDDoCBVbml2ZXJzaWRhZGUgClhYWCAoU2lnbGEgZGEgVW5pdmVyc2lkYWRlKSBvIGRpcmVpdG8gbsOjby1leGNsdXNpdm8gZGUgcmVwcm9kdXppciwgIHRyYWR1emlyIChjb25mb3JtZSBkZWZpbmlkbyBhYmFpeG8pLCBlL291IApkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlIAplbSBxdWFscXVlciBtZWlvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mgw6F1ZGlvIG91IHbDrWRlby4KClZvY8OqIGNvbmNvcmRhIHF1ZSBhIFNpZ2xhIGRlIFVuaXZlcnNpZGFkZSBwb2RlLCBzZW0gYWx0ZXJhciBvIGNvbnRlw7pkbywgdHJhbnNwb3IgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIApwYXJhIHF1YWxxdWVyIG1laW8gb3UgZm9ybWF0byBwYXJhIGZpbnMgZGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIHRhbWLDqW0gY29uY29yZGEgcXVlIGEgU2lnbGEgZGUgVW5pdmVyc2lkYWRlIHBvZGUgbWFudGVyIG1haXMgZGUgdW1hIGPDs3BpYSBhIHN1YSB0ZXNlIG91IApkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcyAKbmVzdGEgbGljZW7Dp2EuIFZvY8OqIHRhbWLDqW0gZGVjbGFyYSBxdWUgbyBkZXDDs3NpdG8gZGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBuw6NvLCBxdWUgc2VqYSBkZSBzZXUgCmNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiAKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFNpZ2xhIGRlIFVuaXZlcnNpZGFkZSAKb3MgZGlyZWl0b3MgYXByZXNlbnRhZG9zIG5lc3RhIGxpY2Vuw6dhLCBlIHF1ZSBlc3NlIG1hdGVyaWFsIGRlIHByb3ByaWVkYWRlIGRlIHRlcmNlaXJvcyBlc3TDoSBjbGFyYW1lbnRlIAppZGVudGlmaWNhZG8gZSByZWNvbmhlY2lkbyBubyB0ZXh0byBvdSBubyBjb250ZcO6ZG8gZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG9yYSBkZXBvc2l0YWRhLgoKQ0FTTyBBIFRFU0UgT1UgRElTU0VSVEHDh8ODTyBPUkEgREVQT1NJVEFEQSBURU5IQSBTSURPIFJFU1VMVEFETyBERSBVTSBQQVRST0PDjU5JTyBPVSAKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBTSUdMQSBERSAKVU5JVkVSU0lEQURFLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyAKVEFNQsOJTSBBUyBERU1BSVMgT0JSSUdBw4fDlUVTIEVYSUdJREFTIFBPUiBDT05UUkFUTyBPVSBBQ09SRE8uCgpBIFNpZ2xhIGRlIFVuaXZlcnNpZGFkZSBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lIChzKSBvdSBvKHMpIG5vbWUocykgZG8ocykgCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzIApjb25jZWRpZGFzIHBvciBlc3RhIGxpY2Vuw6dhLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://www.tede2.ufrpe.br:8080/tede/PUBhttp://www.tede2.ufrpe.br:8080/oai/requestbdtd@ufrpe.br ||bdtd@ufrpe.bropendoar:2024-05-28T12:36:13.462609Biblioteca Digital de Teses e Dissertações da UFRPE - Universidade Federal Rural de Pernambuco (UFRPE)false |
dc.title.por.fl_str_mv |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais |
title |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais |
spellingShingle |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais CAVALCANTI, Anderson Pinheiro Educação a distância Fórum educacional Similaridade semântica Plágio Mineração de texto CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
title_short |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais |
title_full |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais |
title_fullStr |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais |
title_full_unstemmed |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais |
title_sort |
Uma medida de similaridade textual para identificação de plágio em fóruns educacionais |
author |
CAVALCANTI, Anderson Pinheiro |
author_facet |
CAVALCANTI, Anderson Pinheiro |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
MELLO, Rafael Ferreira Leite de |
dc.contributor.advisor-co1.fl_str_mv |
MIRANDA, Péricles Barbosa Cunha de |
dc.contributor.referee1.fl_str_mv |
MIRANDA, Péricles Barbosa Cunha de |
dc.contributor.referee2.fl_str_mv |
LIMA, Rinaldo José de |
dc.contributor.referee3.fl_str_mv |
FREITAS, Frederico Luiz Gonçalves de |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/3833454062140432 |
dc.contributor.author.fl_str_mv |
CAVALCANTI, Anderson Pinheiro |
contributor_str_mv |
MELLO, Rafael Ferreira Leite de MIRANDA, Péricles Barbosa Cunha de MIRANDA, Péricles Barbosa Cunha de LIMA, Rinaldo José de FREITAS, Frederico Luiz Gonçalves de |
dc.subject.por.fl_str_mv |
Educação a distância Fórum educacional Similaridade semântica Plágio Mineração de texto |
topic |
Educação a distância Fórum educacional Similaridade semântica Plágio Mineração de texto CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
description |
With the increasing use of technology as an educational support tool, the use of Virtual Learning Environment (VLE) has increased in recent years. These environments provide several tools to improve the interaction between teachers and students, where some examples are: forum, blog, wiki, among others. These tools have great potential for generating content, which can be used to aid in the process of teaching learning. However, due to the great amount of interactions between the students and the teacher, it is difficult for the teacher to evaluate and follow up all the material that is made available by the students. A tool that stands out in relation to the generation of collaborative content is the forum. Among the possible functionalities of the forums is the question of evaluation. Many distance disciplines use forum interaction as a form of student assessment. However, with the large amount of information posted on the tool, it often becomes impractical for the teacher to manually detect plagiarism in the responses. The fundamental basis for the creation of automatic plagiarism detection systems is the creation of a measure of similarity that can measure the relationship between two texts. The similarity between texts is important in several Natural Language Processing (NLP) applications, such as retrieving information, summarizing text, extracting information, and grouping text. For example, in retrieval of information, the similarity measure is used to assign a classification score between a query and the obtained text. Various measures of similarity between texts can be found; however, in general, they are language dependent. In the case of Portuguese, few measures have been found and most use only statistical techniques, not taking into account semantic aspects of texts. In addition, there are papers in the literature to identify plagiarism in activities, scientific articles or course completion work. However, when context is educational forums the identification of plagiarism becomes even more difficult mainly because of the size of the text and by not requiring a formal language. Therefore, this paper aims to propose a measure that calculates the similarity between sentences written in Portuguese taking into account the semantics of texts. This measure was evaluated on the basis of the ASSIN workshop 2016. The proposed measure achieved better results than the first place in the competition reaching 0.70 Pearson correlation and 0.47 mean squared error. In addition to this evaluation, a case study was carried out to evaluate similarity in postings of educational forums in a discipline of Computer Science. The results were evaluated by the teachers of the discipline who confirmed the effectiveness of the tool. |
publishDate |
2018 |
dc.date.issued.fl_str_mv |
2018-01-31 |
dc.date.accessioned.fl_str_mv |
2019-02-26T14:31:01Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
CAVALCANTI, Anderson Pinheiro. Uma medida de similaridade textual para identificação de plágio em fóruns educacionais. 2018. 88 f. Dissertação (Programa de Pós-Graduação em Informática Aplicada) - Universidade Federal Rural de Pernambuco, Recife. |
dc.identifier.uri.fl_str_mv |
http://www.tede2.ufrpe.br:8080/tede2/handle/tede2/7868 |
identifier_str_mv |
CAVALCANTI, Anderson Pinheiro. Uma medida de similaridade textual para identificação de plágio em fóruns educacionais. 2018. 88 f. Dissertação (Programa de Pós-Graduação em Informática Aplicada) - Universidade Federal Rural de Pernambuco, Recife. |
url |
http://www.tede2.ufrpe.br:8080/tede2/handle/tede2/7868 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.program.fl_str_mv |
-8268485641417162699 |
dc.relation.confidence.fl_str_mv |
600 600 600 |
dc.relation.department.fl_str_mv |
-6774555140396120501 |
dc.relation.cnpq.fl_str_mv |
3671711205811204509 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal Rural de Pernambuco |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Informática Aplicada |
dc.publisher.initials.fl_str_mv |
UFRPE |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
Departamento de Estatística e Informática |
publisher.none.fl_str_mv |
Universidade Federal Rural de Pernambuco |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da UFRPE instname:Universidade Federal Rural de Pernambuco (UFRPE) instacron:UFRPE |
instname_str |
Universidade Federal Rural de Pernambuco (UFRPE) |
instacron_str |
UFRPE |
institution |
UFRPE |
reponame_str |
Biblioteca Digital de Teses e Dissertações da UFRPE |
collection |
Biblioteca Digital de Teses e Dissertações da UFRPE |
bitstream.url.fl_str_mv |
http://www.tede2.ufrpe.br:8080/tede2/bitstream/tede2/7868/2/Anderson+Pinheiro+Cavalcanti.pdf http://www.tede2.ufrpe.br:8080/tede2/bitstream/tede2/7868/1/license.txt |
bitstream.checksum.fl_str_mv |
d2510e8043cac677443d65100e0f9663 bd3efa91386c1718a7f26a329fdcb468 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da UFRPE - Universidade Federal Rural de Pernambuco (UFRPE) |
repository.mail.fl_str_mv |
bdtd@ufrpe.br ||bdtd@ufrpe.br |
_version_ |
1810102255898066944 |