Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados

Santana, Demetrius Silva de

Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados

Detalhes bibliográficos
Autor(a) principal:	Santana, Demetrius Silva de
Data de Publicação:	2020
Tipo de documento:	Trabalho de conclusão de curso
Idioma:	por
Título da fonte:	Repositório Institucional da UFS
Texto Completo:	http://ri.ufs.br/jspui/handle/riufs/13484
Resumo:	Scientific output includes the production of textual artifacts, as scientific papers and research projects. Scientific writing, in turn, presents its own challenges. One of them is to face information overload in literature, which hinders contextualization of a new document with the most relevant information of its research area. In face of this difficulty, in this work we will investigate ways of embedding sentences and paragraphs of scientific texts, with the purpose of content recomendation along scientific writing. Through automated scanning of databases of publishers Springer and Elsevier, a local mirror was built for access of research papers, from which it was possible to obtain a corpus with 40 thousand papers from Computer Science journals. Pre-trained language models were used to obtain embedding for words or wordpieces for a sample of 1605 papers. Using these representations, aggregation methods for generating embeddings for sentences and paragraphs were investigated. Embedding strategies for these textual elements were evaluated in two aspects. Firstly, in their capacity of reflecting the aggregation of paragraphs within sections of already published papers. Secondly, they were used frozen to train a bidirectional long short-term memory (BiLSTM) neural network for the task of classifying wether a text fragment and an abstract belonged to the same paper. In both situations, a better performance was obtained with embeddings from bidirectional encoder representations from Transformers (BERT), specifically with a version pre-trained on a scientific corpus (SciBERT). Also in both circumstances, embeddings of sentences obtained by taking the mean of pre-trained word embeddings, weighted considering the word frequency, showed worse performance when compared to the mean of vectors after removing stopwords. Paragraph embeddings using the encoding of a sequence of sentences with a BiLSTM presented superior performance when compared to simple mean of sentence vectors and, when applied to the introduction of a scientific paper in test set, was able to return its own original paper within the 5% most likely ones, on average. A qualitative demonstration of content recommendation which integrates the result of both analyses is presented. Considering the studied strategies, automated bibligraphical assistance during production of scientific texts was feasible, with potential for improvement with more optimized hierarchical BiLSTMs.

Metadados do item

id	UFS-2_1fa1a9eeacc25a8a2f295403703e4654
oai_identifier_str	oai:ufs.br:riufs/13484
network_acronym_str	UFS-2
network_name_str	Repositório Institucional da UFS
repository_id_str
spelling	Santana, Demetrius Silva deMacedo, Hendrik TeixeiraSantos, Flávio Arthur Oliveira2020-06-03T00:49:11Z2020-06-03T00:49:11Z2020-02-19Santana, Demetrius Silva de. Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados. São Cristóvão, SE, 2019. Monografia (graduação em Engenharia da Computação) – Curso de Engenharia de Computação, Departamento de Computação, Centro de Ciências Exatas e Tecnologia, Universidade Federal de Sergipe, São Cristóvão, 2019http://ri.ufs.br/jspui/handle/riufs/13484Scientific output includes the production of textual artifacts, as scientific papers and research projects. Scientific writing, in turn, presents its own challenges. One of them is to face information overload in literature, which hinders contextualization of a new document with the most relevant information of its research area. In face of this difficulty, in this work we will investigate ways of embedding sentences and paragraphs of scientific texts, with the purpose of content recomendation along scientific writing. Through automated scanning of databases of publishers Springer and Elsevier, a local mirror was built for access of research papers, from which it was possible to obtain a corpus with 40 thousand papers from Computer Science journals. Pre-trained language models were used to obtain embedding for words or wordpieces for a sample of 1605 papers. Using these representations, aggregation methods for generating embeddings for sentences and paragraphs were investigated. Embedding strategies for these textual elements were evaluated in two aspects. Firstly, in their capacity of reflecting the aggregation of paragraphs within sections of already published papers. Secondly, they were used frozen to train a bidirectional long short-term memory (BiLSTM) neural network for the task of classifying wether a text fragment and an abstract belonged to the same paper. In both situations, a better performance was obtained with embeddings from bidirectional encoder representations from Transformers (BERT), specifically with a version pre-trained on a scientific corpus (SciBERT). Also in both circumstances, embeddings of sentences obtained by taking the mean of pre-trained word embeddings, weighted considering the word frequency, showed worse performance when compared to the mean of vectors after removing stopwords. Paragraph embeddings using the encoding of a sequence of sentences with a BiLSTM presented superior performance when compared to simple mean of sentence vectors and, when applied to the introduction of a scientific paper in test set, was able to return its own original paper within the 5% most likely ones, on average. A qualitative demonstration of content recommendation which integrates the result of both analyses is presented. Considering the studied strategies, automated bibligraphical assistance during production of scientific texts was feasible, with potential for improvement with more optimized hierarchical BiLSTMs.A produção científica inclui a geração de artefatos textuais, como artigos científicos e projetos de pesquisa. A escrita científica, por sua vez, apresenta desafios próprios. Um deles é lidar com a sobrecarga de informação na literatura, que dificulta a contextualização de um novo documento com as informações mais relevantes da área de pesquisa. Diante desse desafio, neste trabalho são investigadas formas de se representar vetorialmente sentenças e parágrafos de textos científicos, com o propósito de recomendar conteúdo relevante durante sua escrita. Por varredura automatizada das bases de dados das editoras Springer e Elsevier, foi construído um espelho para acesso local de artigos científicos, do qual foi possível ser extraído corpus com 40 mil artigos de periódicos de Ciência da Computação. Modelos de linguagem pré-treinados foram usados para se obter vetorizações de palavras ou fragmentos de palavras para uma amostra de 1605 artigos. A partir dessas representações, métodos de agregação para gerar representações de sentenças e de parágrafos foram investigados. Estratégias de vetorização para os elementos textuais foram avaliadas em dois aspectos. Primeiramente, na capacidade de refletir a agregação de parágrafos dentro das seções dos artigos já publicados. Em seguida, foram usadas fixadas para treinar uma rede recorrente de longa memória de curto prazo bidirecional (BiLSTM) na tarefa de determinar se um fragmento de texto e um resumo pertenciam a um mesmo artigo científico. Nas duas situações, atingiu melhor desempenho a vetorização com representações de codificador bidirecional por transformadores (BERT), na variante pré-treinada em corpus de textos científicos (SciBERT). Também nos dois cenários, a representação de sentenças pela média dos vetores pré-treinados, ponderada a partir da frequência do elemento, mostrou desempenho inferior quando comparada à média simples dos vetores após remoção das palavras de parada (stopwords). A representação de parágrafos a partir da codificação de uma sequência de sentenças por uma BiLSTM se mostrou superior quando comparada à simples média dos vetores de sentenças e, quando aplicada à introdução de um artigo científico no subconjunto de teste, foi capaz de retornar o resumo do próprio artigo dentre os 5% mais prováveis, em média. Uma demonstração qualitativa da recomendação de conteúdo que integra o resultado das duas abordagens é apresentada. A partir das estratégias investigadas, a assistência bibliográfica automatizada durante a produção de textos científicos se mostrou viável, podendo ser melhorada com a otimização de redes com BiLSTMs hierárquicas.São Cristóvão, SEporEngenharia de computaçãoEnsino de engenharia de computaçãoProcessamento de linguagem naturalRedação científicaMineração de textoVetorização de palavrasRecomendação de conteúdoAssistência bibliográficaNatural language processingScientific writingText miningWord embeddingsContent recommendationBibliographical assistanceCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::LOGICAS E SEMANTICA DE PROGRAMASAssistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinadosinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesisUniversidade Federal de SergipeDCOMP - Departamento de Computação – Engenharia de Computação – São Cristóvão - Presencialreponame:Repositório Institucional da UFSinstname:Universidade Federal de Sergipe (UFS)instacron:UFSinfo:eu-repo/semantics/openAccessORIGINALDemetrius_Silva_Santana.pdfDemetrius_Silva_Santana.pdfapplication/pdf1986951https://ri.ufs.br/jspui/bitstream/riufs/13484/2/Demetrius_Silva_Santana.pdfbecc883c13f248e595888a66582dd020MD52TEXTDemetrius_Silva_Santana.pdf.txtDemetrius_Silva_Santana.pdf.txtExtracted texttext/plain102873https://ri.ufs.br/jspui/bitstream/riufs/13484/3/Demetrius_Silva_Santana.pdf.txt1d636032604254968d831623d50a865bMD53THUMBNAILDemetrius_Silva_Santana.pdf.jpgDemetrius_Silva_Santana.pdf.jpgGenerated Thumbnailimage/jpeg1281https://ri.ufs.br/jspui/bitstream/riufs/13484/4/Demetrius_Silva_Santana.pdf.jpg10c0658f25321ccdd83ac14116c6165dMD54LICENSElicense.txtlicense.txttext/plain; charset=utf-81475https://ri.ufs.br/jspui/bitstream/riufs/13484/1/license.txt098cbbf65c2c15e1fb2e49c5d306a44cMD51riufs/134842020-06-02 21:49:11.973oai:ufs.br:riufs/13484TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvcihlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSByZXByb2R1emlyIHNldSB0cmFiYWxobyBubyBmb3JtYXRvIGVsZXRyw7RuaWNvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mgw6F1ZGlvIG91IHbDrWRlby4KClZvY8OqIGNvbmNvcmRhIHF1ZSBhIFVuaXZlcnNpZGFkZSBGZWRlcmFsIGRlIFNlcmdpcGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZcO6ZG8sIHRyYW5zcG9yIHNldSB0cmFiYWxobyBwYXJhIHF1YWxxdWVyIG1laW8gb3UgZm9ybWF0byBwYXJhIGZpbnMgZGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIHRhbWLDqW0gY29uY29yZGEgcXVlIGEgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgZGUgc2V1IHRyYWJhbGhvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIHNldSB0cmFiYWxobyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0bywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgbsOjbyBpbmZyaW5nZSBkaXJlaXRvcyBhdXRvcmFpcyBkZSBuaW5ndcOpbS4KCkNhc28gbyB0cmFiYWxobyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgw6AgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvLgoKQSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIHNlIGNvbXByb21ldGUgYSBpZGVudGlmaWNhciBjbGFyYW1lbnRlIG8gc2V1IG5vbWUocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRvIHRyYWJhbGhvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzIGNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuIAo=Repositório InstitucionalPUBhttps://ri.ufs.br/oai/requestrepositorio@academico.ufs.bropendoar:2020-06-03T00:49:11Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS)false
dc.title.pt_BR.fl_str_mv	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados
title	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados
spellingShingle	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados Santana, Demetrius Silva de Engenharia de computação Ensino de engenharia de computação Processamento de linguagem natural Redação científica Mineração de texto Vetorização de palavras Recomendação de conteúdo Assistência bibliográfica Natural language processing Scientific writing Text mining Word embeddings Content recommendation Bibliographical assistance CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::LOGICAS E SEMANTICA DE PROGRAMAS
title_short	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados
title_full	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados
title_fullStr	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados
title_full_unstemmed	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados
title_sort	Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados
author	Santana, Demetrius Silva de
author_facet	Santana, Demetrius Silva de
author_role	author
dc.contributor.author.fl_str_mv	Santana, Demetrius Silva de
dc.contributor.advisor1.fl_str_mv	Macedo, Hendrik Teixeira
dc.contributor.advisor-co1.fl_str_mv	Santos, Flávio Arthur Oliveira
contributor_str_mv	Macedo, Hendrik Teixeira Santos, Flávio Arthur Oliveira
dc.subject.por.fl_str_mv	Engenharia de computação Ensino de engenharia de computação Processamento de linguagem natural Redação científica Mineração de texto Vetorização de palavras Recomendação de conteúdo Assistência bibliográfica
topic	Engenharia de computação Ensino de engenharia de computação Processamento de linguagem natural Redação científica Mineração de texto Vetorização de palavras Recomendação de conteúdo Assistência bibliográfica Natural language processing Scientific writing Text mining Word embeddings Content recommendation Bibliographical assistance CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::LOGICAS E SEMANTICA DE PROGRAMAS
dc.subject.eng.fl_str_mv	Natural language processing Scientific writing Text mining Word embeddings Content recommendation Bibliographical assistance
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::LOGICAS E SEMANTICA DE PROGRAMAS
description	Scientific output includes the production of textual artifacts, as scientific papers and research projects. Scientific writing, in turn, presents its own challenges. One of them is to face information overload in literature, which hinders contextualization of a new document with the most relevant information of its research area. In face of this difficulty, in this work we will investigate ways of embedding sentences and paragraphs of scientific texts, with the purpose of content recomendation along scientific writing. Through automated scanning of databases of publishers Springer and Elsevier, a local mirror was built for access of research papers, from which it was possible to obtain a corpus with 40 thousand papers from Computer Science journals. Pre-trained language models were used to obtain embedding for words or wordpieces for a sample of 1605 papers. Using these representations, aggregation methods for generating embeddings for sentences and paragraphs were investigated. Embedding strategies for these textual elements were evaluated in two aspects. Firstly, in their capacity of reflecting the aggregation of paragraphs within sections of already published papers. Secondly, they were used frozen to train a bidirectional long short-term memory (BiLSTM) neural network for the task of classifying wether a text fragment and an abstract belonged to the same paper. In both situations, a better performance was obtained with embeddings from bidirectional encoder representations from Transformers (BERT), specifically with a version pre-trained on a scientific corpus (SciBERT). Also in both circumstances, embeddings of sentences obtained by taking the mean of pre-trained word embeddings, weighted considering the word frequency, showed worse performance when compared to the mean of vectors after removing stopwords. Paragraph embeddings using the encoding of a sequence of sentences with a BiLSTM presented superior performance when compared to simple mean of sentence vectors and, when applied to the introduction of a scientific paper in test set, was able to return its own original paper within the 5% most likely ones, on average. A qualitative demonstration of content recommendation which integrates the result of both analyses is presented. Considering the studied strategies, automated bibligraphical assistance during production of scientific texts was feasible, with potential for improvement with more optimized hierarchical BiLSTMs.
publishDate	2020
dc.date.accessioned.fl_str_mv	2020-06-03T00:49:11Z
dc.date.available.fl_str_mv	2020-06-03T00:49:11Z
dc.date.issued.fl_str_mv	2020-02-19
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/bachelorThesis
format	bachelorThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	Santana, Demetrius Silva de. Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados. São Cristóvão, SE, 2019. Monografia (graduação em Engenharia da Computação) – Curso de Engenharia de Computação, Departamento de Computação, Centro de Ciências Exatas e Tecnologia, Universidade Federal de Sergipe, São Cristóvão, 2019
dc.identifier.uri.fl_str_mv	http://ri.ufs.br/jspui/handle/riufs/13484
identifier_str_mv	Santana, Demetrius Silva de. Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados. São Cristóvão, SE, 2019. Monografia (graduação em Engenharia da Computação) – Curso de Engenharia de Computação, Departamento de Computação, Centro de Ciências Exatas e Tecnologia, Universidade Federal de Sergipe, São Cristóvão, 2019
url	http://ri.ufs.br/jspui/handle/riufs/13484
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.initials.fl_str_mv	Universidade Federal de Sergipe
dc.publisher.department.fl_str_mv	DCOMP - Departamento de Computação – Engenharia de Computação – São Cristóvão - Presencial
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFS instname:Universidade Federal de Sergipe (UFS) instacron:UFS
instname_str	Universidade Federal de Sergipe (UFS)
instacron_str	UFS
institution	UFS
reponame_str	Repositório Institucional da UFS
collection	Repositório Institucional da UFS
bitstream.url.fl_str_mv	https://ri.ufs.br/jspui/bitstream/riufs/13484/2/Demetrius_Silva_Santana.pdf https://ri.ufs.br/jspui/bitstream/riufs/13484/3/Demetrius_Silva_Santana.pdf.txt https://ri.ufs.br/jspui/bitstream/riufs/13484/4/Demetrius_Silva_Santana.pdf.jpg https://ri.ufs.br/jspui/bitstream/riufs/13484/1/license.txt
bitstream.checksum.fl_str_mv	becc883c13f248e595888a66582dd020 1d636032604254968d831623d50a865b 10c0658f25321ccdd83ac14116c6165d 098cbbf65c2c15e1fb2e49c5d306a44c
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS)
repository.mail.fl_str_mv	repositorio@academico.ufs.br
_version_	1802110731924013056

Assistência bibliográfica durante a escrita de textos científicos : uma abordagem com modelos de linguagem pré-treinados

Registros relacionados