A Data Augmentation approach to Automated Readability Assessment
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Outros Autores: | , |
Tipo de documento: | Artigo |
Idioma: | por |
Título da fonte: | Domínios de Lingu@gem |
Texto Completo: | https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318 |
Resumo: | Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains. |
id |
UFU-12_db3ae752871d2556b8abd35b7ac1a4a1 |
---|---|
oai_identifier_str |
oai:ojs.www.seer.ufu.br:article/68318 |
network_acronym_str |
UFU-12 |
network_name_str |
Domínios de Lingu@gem |
repository_id_str |
|
spelling |
A Data Augmentation approach to Automated Readability AssessmentAbordagem baseada em Aumento de Dados para Avaliação Automática de LeiturabilidadeProcessamento de Linguagem NaturalSubstituição por SinônimoRetrotraduçãoAumento de DadosAvaliação Automática de LeiturabilidadeNatural Language ProcessingSynonym ReplacementBack-translationData AugmentationAutomatic Readability AssessmentStudies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.Embora estudos sobre como medir a leiturabilidade de um texto remontem ao século passado, ainda não há um consenso sobre quais seriam as melhores métricas. Ferramentas de Processamento de Linguagem Natural (PLN) podem apoiar esta tarefa, mas dependem de um grande número de amostras para treinamento, o que é uma barreira para seu avanço. O objetivo principal deste artigo é analisar o impacto de determinados métodos de aumento de dados (AD) para enfrentar essa barreira e apoiar a classificação de leiturabilidade no português brasileiro (PB). Para tanto, foi estabelecido um corpus pareado e classificado, com textos originais complexos e suas versões simplificadas sobre temas de Ciências, desenvolvido por linguistas. Esse corpus foi aumentado com técnicas agnósticas de AD: substituição por sinônimos (SS) e retrotradução (RT). Foram avaliados 75 modelos com diferentes técnicas e combinações de atributos de entrada. O melhor resultado obtido para o conjunto dos textos do corpus sem aumento foi de 94,0% de taxa de acerto. Este resultado subiu para 95,2% combinando-se as métricas do sistema NILC-Metrix com representações vetoriais de texto contextualizadas. Quando comparados a outros trabalhos voltados para o PB, a metodologia proposta gerou um aumento na taxa de acerto em um domínio distinto ao de treino. Conclui-se que o modelo treinado com AD demonstra capacidade igual ou superior àqueles treinados sem aumento e, ao mesmo tempo, apresenta maior generalização quando aplicado a outros domínios.PPUFU2023-04-05info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdftext/xmlhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/6831810.14393/DLv17a2023-21Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721Domínios de Lingu@gem; v. 17 (2023): Domínios de Lingu@gem; e17211980-5799reponame:Domínios de Lingu@geminstname:Universidade Federal de Uberlândia (UFU)instacron:UFUporhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/35955https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/37460Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finattohttp://creativecommons.org/licenses/by-nc-nd/4.0info:eu-repo/semantics/openAccessCunha de Menezes, LuizaPaes, AlineFinatto, Maria José Bocorny2023-12-29T19:51:24Zoai:ojs.www.seer.ufu.br:article/68318Revistahttps://seer.ufu.br/index.php/dominiosdelinguagemPUBhttps://seer.ufu.br/index.php/dominiosdelinguagem/oairevistadominios@ileel.ufu.br||1980-57991980-5799opendoar:2023-12-29T19:51:24Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)false |
dc.title.none.fl_str_mv |
A Data Augmentation approach to Automated Readability Assessment Abordagem baseada em Aumento de Dados para Avaliação Automática de Leiturabilidade |
title |
A Data Augmentation approach to Automated Readability Assessment |
spellingShingle |
A Data Augmentation approach to Automated Readability Assessment Cunha de Menezes, Luiza Processamento de Linguagem Natural Substituição por Sinônimo Retrotradução Aumento de Dados Avaliação Automática de Leiturabilidade Natural Language Processing Synonym Replacement Back-translation Data Augmentation Automatic Readability Assessment |
title_short |
A Data Augmentation approach to Automated Readability Assessment |
title_full |
A Data Augmentation approach to Automated Readability Assessment |
title_fullStr |
A Data Augmentation approach to Automated Readability Assessment |
title_full_unstemmed |
A Data Augmentation approach to Automated Readability Assessment |
title_sort |
A Data Augmentation approach to Automated Readability Assessment |
author |
Cunha de Menezes, Luiza |
author_facet |
Cunha de Menezes, Luiza Paes, Aline Finatto, Maria José Bocorny |
author_role |
author |
author2 |
Paes, Aline Finatto, Maria José Bocorny |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Cunha de Menezes, Luiza Paes, Aline Finatto, Maria José Bocorny |
dc.subject.por.fl_str_mv |
Processamento de Linguagem Natural Substituição por Sinônimo Retrotradução Aumento de Dados Avaliação Automática de Leiturabilidade Natural Language Processing Synonym Replacement Back-translation Data Augmentation Automatic Readability Assessment |
topic |
Processamento de Linguagem Natural Substituição por Sinônimo Retrotradução Aumento de Dados Avaliação Automática de Leiturabilidade Natural Language Processing Synonym Replacement Back-translation Data Augmentation Automatic Readability Assessment |
description |
Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-04-05 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318 10.14393/DLv17a2023-21 |
url |
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318 |
identifier_str_mv |
10.14393/DLv17a2023-21 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.none.fl_str_mv |
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/35955 https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/37460 |
dc.rights.driver.fl_str_mv |
Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finatto http://creativecommons.org/licenses/by-nc-nd/4.0 info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finatto http://creativecommons.org/licenses/by-nc-nd/4.0 |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf text/xml |
dc.publisher.none.fl_str_mv |
PPUFU |
publisher.none.fl_str_mv |
PPUFU |
dc.source.none.fl_str_mv |
Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721 Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721 Domínios de Lingu@gem; v. 17 (2023): Domínios de Lingu@gem; e1721 1980-5799 reponame:Domínios de Lingu@gem instname:Universidade Federal de Uberlândia (UFU) instacron:UFU |
instname_str |
Universidade Federal de Uberlândia (UFU) |
instacron_str |
UFU |
institution |
UFU |
reponame_str |
Domínios de Lingu@gem |
collection |
Domínios de Lingu@gem |
repository.name.fl_str_mv |
Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU) |
repository.mail.fl_str_mv |
revistadominios@ileel.ufu.br|| |
_version_ |
1797067712521830400 |