A Data Augmentation approach to Automated Readability Assessment

Detalhes bibliográficos
Autor(a) principal: Cunha de Menezes, Luiza
Data de Publicação: 2023
Outros Autores: Paes, Aline, Finatto, Maria José Bocorny
Tipo de documento: Artigo
Idioma: por
Título da fonte: Domínios de Lingu@gem
Texto Completo: https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318
Resumo: Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.
id UFU-12_db3ae752871d2556b8abd35b7ac1a4a1
oai_identifier_str oai:ojs.www.seer.ufu.br:article/68318
network_acronym_str UFU-12
network_name_str Domínios de Lingu@gem
repository_id_str
spelling A Data Augmentation approach to Automated Readability AssessmentAbordagem baseada em Aumento de Dados para Avaliação Automática de LeiturabilidadeProcessamento de Linguagem NaturalSubstituição por SinônimoRetrotraduçãoAumento de DadosAvaliação Automática de LeiturabilidadeNatural Language ProcessingSynonym ReplacementBack-translationData AugmentationAutomatic Readability AssessmentStudies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.Embora estudos sobre como medir a leiturabilidade de um texto remontem ao século passado, ainda não há um consenso sobre quais seriam as melhores métricas. Ferramentas de Processamento de Linguagem Natural (PLN) podem apoiar esta tarefa, mas dependem de um grande número de amostras para treinamento, o que é uma barreira para seu avanço. O objetivo principal deste artigo é analisar o impacto de determinados métodos de aumento de dados (AD) para enfrentar essa barreira e apoiar a classificação de leiturabilidade no português brasileiro (PB). Para tanto, foi estabelecido um corpus pareado e classificado, com textos originais complexos e suas versões simplificadas sobre temas de Ciências, desenvolvido por linguistas. Esse corpus foi aumentado com técnicas agnósticas de AD: substituição por sinônimos (SS) e retrotradução (RT). Foram avaliados 75 modelos com diferentes técnicas e combinações de atributos de entrada. O melhor resultado obtido para o conjunto dos textos do corpus sem aumento foi de 94,0% de taxa de acerto. Este resultado subiu para 95,2% combinando-se as métricas do sistema NILC-Metrix com representações vetoriais de texto contextualizadas. Quando comparados a outros trabalhos voltados para o PB, a metodologia proposta gerou um aumento na taxa de acerto em um domínio distinto ao de treino. Conclui-se que o modelo treinado com AD demonstra capacidade igual ou superior àqueles treinados sem aumento e, ao mesmo tempo, apresenta maior generalização quando aplicado a outros domínios.PPUFU2023-04-05info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdftext/xmlhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/6831810.14393/DLv17a2023-21Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721Domínios de Lingu@gem; v. 17 (2023): Domínios de Lingu@gem; e17211980-5799reponame:Domínios de Lingu@geminstname:Universidade Federal de Uberlândia (UFU)instacron:UFUporhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/35955https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/37460Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finattohttp://creativecommons.org/licenses/by-nc-nd/4.0info:eu-repo/semantics/openAccessCunha de Menezes, LuizaPaes, AlineFinatto, Maria José Bocorny2023-12-29T19:51:24Zoai:ojs.www.seer.ufu.br:article/68318Revistahttps://seer.ufu.br/index.php/dominiosdelinguagemPUBhttps://seer.ufu.br/index.php/dominiosdelinguagem/oairevistadominios@ileel.ufu.br||1980-57991980-5799opendoar:2023-12-29T19:51:24Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)false
dc.title.none.fl_str_mv A Data Augmentation approach to Automated Readability Assessment
Abordagem baseada em Aumento de Dados para Avaliação Automática de Leiturabilidade
title A Data Augmentation approach to Automated Readability Assessment
spellingShingle A Data Augmentation approach to Automated Readability Assessment
Cunha de Menezes, Luiza
Processamento de Linguagem Natural
Substituição por Sinônimo
Retrotradução
Aumento de Dados
Avaliação Automática de Leiturabilidade
Natural Language Processing
Synonym Replacement
Back-translation
Data Augmentation
Automatic Readability Assessment
title_short A Data Augmentation approach to Automated Readability Assessment
title_full A Data Augmentation approach to Automated Readability Assessment
title_fullStr A Data Augmentation approach to Automated Readability Assessment
title_full_unstemmed A Data Augmentation approach to Automated Readability Assessment
title_sort A Data Augmentation approach to Automated Readability Assessment
author Cunha de Menezes, Luiza
author_facet Cunha de Menezes, Luiza
Paes, Aline
Finatto, Maria José Bocorny
author_role author
author2 Paes, Aline
Finatto, Maria José Bocorny
author2_role author
author
dc.contributor.author.fl_str_mv Cunha de Menezes, Luiza
Paes, Aline
Finatto, Maria José Bocorny
dc.subject.por.fl_str_mv Processamento de Linguagem Natural
Substituição por Sinônimo
Retrotradução
Aumento de Dados
Avaliação Automática de Leiturabilidade
Natural Language Processing
Synonym Replacement
Back-translation
Data Augmentation
Automatic Readability Assessment
topic Processamento de Linguagem Natural
Substituição por Sinônimo
Retrotradução
Aumento de Dados
Avaliação Automática de Leiturabilidade
Natural Language Processing
Synonym Replacement
Back-translation
Data Augmentation
Automatic Readability Assessment
description Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.
publishDate 2023
dc.date.none.fl_str_mv 2023-04-05
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318
10.14393/DLv17a2023-21
url https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318
identifier_str_mv 10.14393/DLv17a2023-21
dc.language.iso.fl_str_mv por
language por
dc.relation.none.fl_str_mv https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/35955
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/37460
dc.rights.driver.fl_str_mv Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finatto
http://creativecommons.org/licenses/by-nc-nd/4.0
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finatto
http://creativecommons.org/licenses/by-nc-nd/4.0
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
text/xml
dc.publisher.none.fl_str_mv PPUFU
publisher.none.fl_str_mv PPUFU
dc.source.none.fl_str_mv Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721
Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721
Domínios de Lingu@gem; v. 17 (2023): Domínios de Lingu@gem; e1721
1980-5799
reponame:Domínios de Lingu@gem
instname:Universidade Federal de Uberlândia (UFU)
instacron:UFU
instname_str Universidade Federal de Uberlândia (UFU)
instacron_str UFU
institution UFU
reponame_str Domínios de Lingu@gem
collection Domínios de Lingu@gem
repository.name.fl_str_mv Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)
repository.mail.fl_str_mv revistadominios@ileel.ufu.br||
_version_ 1797067712521830400