A Data Augmentation approach to Automated Readability Assessment

Cunha de Menezes, Luiza; Paes, Aline; Finatto, Maria José Bocorny

A Data Augmentation approach to Automated Readability Assessment

Detalhes bibliográficos
Autor(a) principal:	Cunha de Menezes, Luiza
Data de Publicação:	2023
Outros Autores:	Paes, Aline, Finatto, Maria José Bocorny
Tipo de documento:	Artigo
Idioma:	por
Título da fonte:	Domínios de Lingu@gem
Texto Completo:	https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318
Resumo:	Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.

Metadados do item

id	UFU-12_db3ae752871d2556b8abd35b7ac1a4a1
oai_identifier_str	oai:ojs.www.seer.ufu.br:article/68318
network_acronym_str	UFU-12
network_name_str	Domínios de Lingu@gem
repository_id_str
spelling	A Data Augmentation approach to Automated Readability AssessmentAbordagem baseada em Aumento de Dados para Avaliação Automática de LeiturabilidadeProcessamento de Linguagem NaturalSubstituição por SinônimoRetrotraduçãoAumento de DadosAvaliação Automática de LeiturabilidadeNatural Language ProcessingSynonym ReplacementBack-translationData AugmentationAutomatic Readability AssessmentStudies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.Embora estudos sobre como medir a leiturabilidade de um texto remontem ao século passado, ainda não há um consenso sobre quais seriam as melhores métricas. Ferramentas de Processamento de Linguagem Natural (PLN) podem apoiar esta tarefa, mas dependem de um grande número de amostras para treinamento, o que é uma barreira para seu avanço. O objetivo principal deste artigo é analisar o impacto de determinados métodos de aumento de dados (AD) para enfrentar essa barreira e apoiar a classificação de leiturabilidade no português brasileiro (PB). Para tanto, foi estabelecido um corpus pareado e classificado, com textos originais complexos e suas versões simplificadas sobre temas de Ciências, desenvolvido por linguistas. Esse corpus foi aumentado com técnicas agnósticas de AD: substituição por sinônimos (SS) e retrotradução (RT). Foram avaliados 75 modelos com diferentes técnicas e combinações de atributos de entrada. O melhor resultado obtido para o conjunto dos textos do corpus sem aumento foi de 94,0% de taxa de acerto. Este resultado subiu para 95,2% combinando-se as métricas do sistema NILC-Metrix com representações vetoriais de texto contextualizadas. Quando comparados a outros trabalhos voltados para o PB, a metodologia proposta gerou um aumento na taxa de acerto em um domínio distinto ao de treino. Conclui-se que o modelo treinado com AD demonstra capacidade igual ou superior àqueles treinados sem aumento e, ao mesmo tempo, apresenta maior generalização quando aplicado a outros domínios.PPUFU2023-04-05info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdftext/xmlhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/6831810.14393/DLv17a2023-21Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721Domínios de Lingu@gem; v. 17 (2023): Domínios de Lingu@gem; e17211980-5799reponame:Domínios de Lingu@geminstname:Universidade Federal de Uberlândia (UFU)instacron:UFUporhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/35955https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/37460Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finattohttp://creativecommons.org/licenses/by-nc-nd/4.0info:eu-repo/semantics/openAccessCunha de Menezes, LuizaPaes, AlineFinatto, Maria José Bocorny2023-12-29T19:51:24Zoai:ojs.www.seer.ufu.br:article/68318Revistahttps://seer.ufu.br/index.php/dominiosdelinguagemPUBhttps://seer.ufu.br/index.php/dominiosdelinguagem/oairevistadominios@ileel.ufu.br\|\|1980-57991980-5799opendoar:2023-12-29T19:51:24Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)false
dc.title.none.fl_str_mv	A Data Augmentation approach to Automated Readability Assessment Abordagem baseada em Aumento de Dados para Avaliação Automática de Leiturabilidade
title	A Data Augmentation approach to Automated Readability Assessment
spellingShingle	A Data Augmentation approach to Automated Readability Assessment Cunha de Menezes, Luiza Processamento de Linguagem Natural Substituição por Sinônimo Retrotradução Aumento de Dados Avaliação Automática de Leiturabilidade Natural Language Processing Synonym Replacement Back-translation Data Augmentation Automatic Readability Assessment
title_short	A Data Augmentation approach to Automated Readability Assessment
title_full	A Data Augmentation approach to Automated Readability Assessment
title_fullStr	A Data Augmentation approach to Automated Readability Assessment
title_full_unstemmed	A Data Augmentation approach to Automated Readability Assessment
title_sort	A Data Augmentation approach to Automated Readability Assessment
author	Cunha de Menezes, Luiza
author_facet	Cunha de Menezes, Luiza Paes, Aline Finatto, Maria José Bocorny
author_role	author
author2	Paes, Aline Finatto, Maria José Bocorny
author2_role	author author
dc.contributor.author.fl_str_mv	Cunha de Menezes, Luiza Paes, Aline Finatto, Maria José Bocorny
dc.subject.por.fl_str_mv	Processamento de Linguagem Natural Substituição por Sinônimo Retrotradução Aumento de Dados Avaliação Automática de Leiturabilidade Natural Language Processing Synonym Replacement Back-translation Data Augmentation Automatic Readability Assessment
topic	Processamento de Linguagem Natural Substituição por Sinônimo Retrotradução Aumento de Dados Avaliação Automática de Leiturabilidade Natural Language Processing Synonym Replacement Back-translation Data Augmentation Automatic Readability Assessment
description	Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.
publishDate	2023
dc.date.none.fl_str_mv	2023-04-05
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318 10.14393/DLv17a2023-21
url	https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318
identifier_str_mv	10.14393/DLv17a2023-21
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/35955 https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/68318/37460
dc.rights.driver.fl_str_mv	Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finatto http://creativecommons.org/licenses/by-nc-nd/4.0 info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finatto http://creativecommons.org/licenses/by-nc-nd/4.0
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf text/xml
dc.publisher.none.fl_str_mv	PPUFU
publisher.none.fl_str_mv	PPUFU
dc.source.none.fl_str_mv	Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721 Domínios de Lingu@gem; Vol. 17 (2023): Domínios de Lingu@gem; e1721 Domínios de Lingu@gem; v. 17 (2023): Domínios de Lingu@gem; e1721 1980-5799 reponame:Domínios de Lingu@gem instname:Universidade Federal de Uberlândia (UFU) instacron:UFU
instname_str	Universidade Federal de Uberlândia (UFU)
instacron_str	UFU
institution	UFU
reponame_str	Domínios de Lingu@gem
collection	Domínios de Lingu@gem
repository.name.fl_str_mv	Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)
repository.mail.fl_str_mv	revistadominios@ileel.ufu.br\|\|
_version_	1797067712521830400

A Data Augmentation approach to Automated Readability Assessment

Registros relacionados