Development of new models for authorship recognition using complex networks

Detalhes bibliográficos
Autor(a) principal: Marinho, Vanessa Queiroz
Data de Publicação: 2017
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: http://www.teses.usp.br/teses/disponiveis/55/55134/tde-14112017-091805/
Resumo: Complex networks have been successfully applied to different fields, being the subject of study in different areas that include, for example, physics and computer science. The finding that methods of complex networks can be used to analyze texts in their different complexity levels has implied in advances in natural language processing (NLP) tasks. Examples of applications analyzed with the methods of complex networks are keyword identification, development of automatic summarizers, and authorship attribution systems. The latter task has been studied with some success through the representation of co-occurrence (or adjacency) networks that connect only the closest words in the text. Despite this success, only a few works have attempted to extend this representation or employ different ones. Moreover, many approaches use a similar set of measurements to characterize the networks and do not combine their techniques with the ones traditionally used for the authorship attribution task. This Masters research proposes some extensions to the traditional co-occurrence model and investigates new attributes and other representations (such as mesoscopic and named entity networks) for the task. The connectivity information of function words is used to complement the characterization of authors writing styles, as these words are relevant for the task. Finally, the main contribution of this research is the development of hybrid classifiers, called labelled motifs, that combine traditional factors with properties obtained with the topological analysis of complex networks. The relevance of these classifiers is verified in the context of authorship attribution and translationese identification. With this hybrid approach, we show that it is possible to improve the performance of networkbased techniques when they are combined with traditional ones usually employed in NLP. By adapting, combining and improving the model, not only the performance of authorship attribution systems was improved, but also it was possible to better understand what are the textual quantitative factors (measured through networks) that can be used in stylometry studies. The advances obtained during this project may be useful to study related applications, such as the analysis of stylistic inconsistencies and plagiarism, and the analysis of text complexity. Furthermore, most of the methods proposed in this work can be easily applied to many natural languages.
id USP_a1fc963e745e123e64c1e7d3d56deccf
oai_identifier_str oai:teses.usp.br:tde-14112017-091805
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Development of new models for authorship recognition using complex networksDesenvolvimento de novos modelos para reconhecimento de autoria com a utilização de redes complexasAuthorship attributionComplex networksNatural language processingProcessamento de línguas naturaisReconhecimento de autoriaRedes complexasComplex networks have been successfully applied to different fields, being the subject of study in different areas that include, for example, physics and computer science. The finding that methods of complex networks can be used to analyze texts in their different complexity levels has implied in advances in natural language processing (NLP) tasks. Examples of applications analyzed with the methods of complex networks are keyword identification, development of automatic summarizers, and authorship attribution systems. The latter task has been studied with some success through the representation of co-occurrence (or adjacency) networks that connect only the closest words in the text. Despite this success, only a few works have attempted to extend this representation or employ different ones. Moreover, many approaches use a similar set of measurements to characterize the networks and do not combine their techniques with the ones traditionally used for the authorship attribution task. This Masters research proposes some extensions to the traditional co-occurrence model and investigates new attributes and other representations (such as mesoscopic and named entity networks) for the task. The connectivity information of function words is used to complement the characterization of authors writing styles, as these words are relevant for the task. Finally, the main contribution of this research is the development of hybrid classifiers, called labelled motifs, that combine traditional factors with properties obtained with the topological analysis of complex networks. The relevance of these classifiers is verified in the context of authorship attribution and translationese identification. With this hybrid approach, we show that it is possible to improve the performance of networkbased techniques when they are combined with traditional ones usually employed in NLP. By adapting, combining and improving the model, not only the performance of authorship attribution systems was improved, but also it was possible to better understand what are the textual quantitative factors (measured through networks) that can be used in stylometry studies. The advances obtained during this project may be useful to study related applications, such as the analysis of stylistic inconsistencies and plagiarism, and the analysis of text complexity. Furthermore, most of the methods proposed in this work can be easily applied to many natural languages.Redes complexas vem sendo aplicadas com sucesso em diferentes domínios, sendo o tema de estudo de distintas áreas que incluem, por exemplo, a física e a computação. A descoberta de que métodos de redes complexas podem ser utilizados para analisar textos em seus distintos níveis de complexidade proporcionou avanços em tarefas de processamento de línguas naturais (PLN). Exemplos de aplicações analisadas com os métodos de redes complexas são a detecção de palavras-chave, a criação de sumarizadores automáticos e o reconhecimento de autoria. Esta última tarefa tem sido estudada com certo sucesso através da representação de redes de co-ocorrência (ou adjacência) de palavras que conectam apenas as palavras mais próximas no texto. Apesar deste sucesso, poucos trabalhos tentaram estender essas redes ou utilizar diferentes representações. Além disso, muitas das abordagens utilizam um conjunto semelhante de medidas de redes complexas e não combinam suas técnicas com as utilizadas tradicionalmente na tarefa de reconhecimento de autoria. Esta pesquisa de mestrado propõe extensões à modelagem tradicional de co-ocorrência e investiga a adequabilidade de novos atributos e de outras modelagens (como as redes mesoscópicas e de entidades nomeadas) para a tarefa. A informação de conectividade de palavras funcionais é utilizada para complementar a caracterização da escrita dos autores, uma vez que essas palavras são relevantes para a tarefa. Finalmente, a maior contribuição deste trabalho consiste no desenvolvimento de classificadores híbridos, denominados labelled motifs, que combinam fatores tradicionais com as propriedades fornecidas pela análise topológica de redes complexas. A relevância desses classificadores é verificada no contexto de reconhecimento de autoria e identificação de translationese. Com esta abordagem híbrida, mostra-se que é possível melhorar o desempenho de técnicas baseadas em rede ao combiná-las com técnicas tradicionais em PLN. Através da adaptação, combinação e aperfeiçoamento da modelagem, não apenas o desempenho dos sistemas de reconhecimento de autoria foi melhorado, mas também foi possível entender melhor quais são os fatores quantitativos textuais (medidos via redes) que podem ser utilizados na área de estilometria. Os avanços obtidos durante este projeto podem ser utilizados para estudar aplicações relacionadas, como é o caso da análise de inconsistências estilísticas e plagiarismos, e análise da complexidade textual. Além disso, muitos dos métodos propostos neste trabalho podem ser facilmente aplicados em diversas línguas naturais.Biblioteca Digitais de Teses e Dissertações da USPAmancio, Diego RaphaelMarinho, Vanessa Queiroz2017-07-14info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://www.teses.usp.br/teses/disponiveis/55/55134/tde-14112017-091805/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2018-07-17T16:38:18Zoai:teses.usp.br:tde-14112017-091805Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212018-07-17T16:38:18Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Development of new models for authorship recognition using complex networks
Desenvolvimento de novos modelos para reconhecimento de autoria com a utilização de redes complexas
title Development of new models for authorship recognition using complex networks
spellingShingle Development of new models for authorship recognition using complex networks
Marinho, Vanessa Queiroz
Authorship attribution
Complex networks
Natural language processing
Processamento de línguas naturais
Reconhecimento de autoria
Redes complexas
title_short Development of new models for authorship recognition using complex networks
title_full Development of new models for authorship recognition using complex networks
title_fullStr Development of new models for authorship recognition using complex networks
title_full_unstemmed Development of new models for authorship recognition using complex networks
title_sort Development of new models for authorship recognition using complex networks
author Marinho, Vanessa Queiroz
author_facet Marinho, Vanessa Queiroz
author_role author
dc.contributor.none.fl_str_mv Amancio, Diego Raphael
dc.contributor.author.fl_str_mv Marinho, Vanessa Queiroz
dc.subject.por.fl_str_mv Authorship attribution
Complex networks
Natural language processing
Processamento de línguas naturais
Reconhecimento de autoria
Redes complexas
topic Authorship attribution
Complex networks
Natural language processing
Processamento de línguas naturais
Reconhecimento de autoria
Redes complexas
description Complex networks have been successfully applied to different fields, being the subject of study in different areas that include, for example, physics and computer science. The finding that methods of complex networks can be used to analyze texts in their different complexity levels has implied in advances in natural language processing (NLP) tasks. Examples of applications analyzed with the methods of complex networks are keyword identification, development of automatic summarizers, and authorship attribution systems. The latter task has been studied with some success through the representation of co-occurrence (or adjacency) networks that connect only the closest words in the text. Despite this success, only a few works have attempted to extend this representation or employ different ones. Moreover, many approaches use a similar set of measurements to characterize the networks and do not combine their techniques with the ones traditionally used for the authorship attribution task. This Masters research proposes some extensions to the traditional co-occurrence model and investigates new attributes and other representations (such as mesoscopic and named entity networks) for the task. The connectivity information of function words is used to complement the characterization of authors writing styles, as these words are relevant for the task. Finally, the main contribution of this research is the development of hybrid classifiers, called labelled motifs, that combine traditional factors with properties obtained with the topological analysis of complex networks. The relevance of these classifiers is verified in the context of authorship attribution and translationese identification. With this hybrid approach, we show that it is possible to improve the performance of networkbased techniques when they are combined with traditional ones usually employed in NLP. By adapting, combining and improving the model, not only the performance of authorship attribution systems was improved, but also it was possible to better understand what are the textual quantitative factors (measured through networks) that can be used in stylometry studies. The advances obtained during this project may be useful to study related applications, such as the analysis of stylistic inconsistencies and plagiarism, and the analysis of text complexity. Furthermore, most of the methods proposed in this work can be easily applied to many natural languages.
publishDate 2017
dc.date.none.fl_str_mv 2017-07-14
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://www.teses.usp.br/teses/disponiveis/55/55134/tde-14112017-091805/
url http://www.teses.usp.br/teses/disponiveis/55/55134/tde-14112017-091805/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1815256976590897152