Evaluating features for rhetorical structure classification in scientific abstracts

Iriguti, Alessandra Harumi; Feltrim, Valéria Delisandra

Evaluating features for rhetorical structure classification in scientific abstracts

Detalhes bibliográficos
Autor(a) principal:	Iriguti, Alessandra Harumi
Data de Publicação:	2019
Outros Autores:	Feltrim, Valéria Delisandra
Tipo de documento:	Artigo
Idioma:	por
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://doi.org/10.21814/lm.11.1.273
Resumo:	Rhetorical structure classification is a NLP task in which we want to identify the rhetorical components of a discourse and its relationships. In this work, we aimed at automatically identifying propositions at the sentential level that make up the rhetorical structure of scientific abstracts. Specifically, the purpose was to evaluate the impact of different sets of attributes on the implementation of rhetorical classifiers for scientific abstracts written in Portuguese. For this, we used superficial features (extracted as TF-IDF values and selected with the $\chi^2$ test), morphosyntactic features (implemented by the AZPort classifier) and features extracted from \textit {word embeddings} models (Word2Vec, Wang2Vec and GloVe, all of them previously trained). These sets of features, as well as its combinations, were used for the training of the following supervised learning classifiers: Support Vector Machines, Naive Bayes, K-Nearest Neighbors, Decision Trees and Conditional Random Fields (CRF). They were trained and tested through cross-validation on three \textit{corpora} composed by abstracts of theses and dissertations. The best result, $94\%$ of F1, was obtained by the CRF classifier with the following combinations of features: (i) Wang2Vec--Skip-gram of $100$ dimension with the features from AZPort; (ii) TF-IDF, AZPort and \textit{embeddings} extracted with the Word2Vec--Skip-gram and GloVe models of dimensions $1000$ and $300$, respectively. From the results, we concluded that the AZPort features were fundamental for the performance of the CRF and that the combination with \textit{word embeddings} proved valid.

Metadados do item

id	RCAP_c580acae6f1174428515c8938f1fb727
oai_identifier_str	oai:linguamatica.com:article/273
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Evaluating features for rhetorical structure classification in scientific abstractsAvaliando atributos para a classificação de estrutura retórica em resumos científicosnatural language processingrhetorical structure classificationscientific abstracts in Portugueseprocessamento de linguagem naturalclassificação de estrutura retóricaresumos científicos em portuguêsRhetorical structure classification is a NLP task in which we want to identify the rhetorical components of a discourse and its relationships. In this work, we aimed at automatically identifying propositions at the sentential level that make up the rhetorical structure of scientific abstracts. Specifically, the purpose was to evaluate the impact of different sets of attributes on the implementation of rhetorical classifiers for scientific abstracts written in Portuguese. For this, we used superficial features (extracted as TF-IDF values and selected with the $\chi^2$ test), morphosyntactic features (implemented by the AZPort classifier) and features extracted from \textit {word embeddings} models (Word2Vec, Wang2Vec and GloVe, all of them previously trained). These sets of features, as well as its combinations, were used for the training of the following supervised learning classifiers: Support Vector Machines, Naive Bayes, K-Nearest Neighbors, Decision Trees and Conditional Random Fields (CRF). They were trained and tested through cross-validation on three \textit{corpora} composed by abstracts of theses and dissertations. The best result, $94\%$ of F1, was obtained by the CRF classifier with the following combinations of features: (i) Wang2Vec--Skip-gram of $100$ dimension with the features from AZPort; (ii) TF-IDF, AZPort and \textit{embeddings} extracted with the Word2Vec--Skip-gram and GloVe models of dimensions $1000$ and $300$, respectively. From the results, we concluded that the AZPort features were fundamental for the performance of the CRF and that the combination with \textit{word embeddings} proved valid.A classificação de estrutura retórica é uma tarefa de PLN na qual se busca identificar os componentes retóricos de um discurso e seus relacionamentos. No caso deste trabalho, buscou-se identificar automaticamente categorias em nível de sentenças que compõem a estrutura retórica de resumos científicos. Especificamente, o objetivo foi avaliar o impacto de diferentes conjuntos de atributos na implementação de classificadores retóricos para resumos científicos escritos em português. Para isso, foram utilizados atributos superficiais (extraídos como valores TF-IDF e selecionados com o teste chi-quadrado), atributos morfossintáticos (implementados pelo classificador AZPort) e atributos extraídos a partir de modelos de word embeddings (Word2Vec, Wang2Vec e GloVe, todos previamente treinados). Tais conjuntos de atributos, bem como as suas combinações, foram usados para o treinamento de classificadores usando os seguintes algoritmos de aprendizado supervisionado: Support Vector Machines, Naive Bayes, K-Nearest Neighbors, Decision Trees e Conditional Random Fields (CRF). Os classificadores foram avaliados por meio de validação cruzada sobre três corpora compostos por resumos de teses e dissertações. O melhor resultado, 94% de F1, foi obtido pelo classificador CRF com as seguintes combinações de atributos: (i) Wang2Vec--Skip-gram de dimensões 100 com os atributos provenientes do AZPort; (ii) Wang2Vec--Skip-gram e GloVe de dimensão 300 com os atributos do AZPort; (iii) TF-IDF, AZPort e embeddings extraídos com os modelos Wang2Vec--Skip-gram de dimensões 100 e 300 e GloVe de dimensão 300. A partir dos resultados obtidos, conclui-se que os atributos provenientes do classificador AZPort foram fundamentais para o bom desempenho do classificador CRF, enquanto que a combinação com word embeddings se mostrou válida para a melhoria dos resultados.A classificação de estrutura retórica é uma tarefa de PLN na qual se busca identificar os componentes retóricos de um discurso e seus relacionamentos. No caso deste trabalho, buscou-se identificar automaticamente categorias em nível de sentenças que compõem a estrutura retórica de resumos científicos. Especificamente, o objetivo foi avaliar o impacto de diferentes conjuntos de atributos na implementação de classificadores retóricos para resumos científicos escritos em português. Para isso, foram utilizados atributos superficiais (extraídos como valores TF-IDF e selecionados com o teste chi-quadrado), atributos morfossintáticos (implementados pelo classificador AZPort) e atributos extraídos a partir de modelos de word embeddings (Word2Vec, Wang2Vec e GloVe, todos previamente treinados). Tais conjuntos de atributos, bem como as suas combinações, foram usados para o treinamento de classificadores usando os seguintes algoritmos de aprendizado supervisionado: Support Vector Machines, Naive Bayes, K-Nearest Neighbors, Decision Trees e Conditional Random Fields (CRF). Os classificadores foram avaliados por meio de validação cruzada sobre três corpora compostos por resumos de teses e dissertações. O melhor resultado, 94% de F1, foi obtido pelo classificador CRF com as seguintes combinações de atributos: (i) Wang2Vec--Skip-gram de dimensões 100 com os atributos provenientes do AZPort; (ii) Wang2Vec--Skip-gram e GloVe de dimensão 300 com os atributos do AZPort; (iii) TF-IDF, AZPort e embeddings extraídos com os modelos Wang2Vec--Skip-gram de dimensões 100 e 300 e GloVe de dimensão 300. A partir dos resultados obtidos, conclui-se que os atributos provenientes do classificador AZPort foram fundamentais para o bom desempenho do classificador CRF, enquanto que a combinação com word embeddings se mostrou válida para a melhoria dos resultados.Universidade do Minho e Universidade de Vigo2019-07-20info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://doi.org/10.21814/lm.11.1.273https://doi.org/10.21814/lm.11.1.273Linguamática; Vol. 11 No. 1; 41-53Linguamática; Vol. 11 Núm. 1; 41-53Linguamática; v. 11 n. 1; 41-531647-0818reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPporhttps://linguamatica.com/index.php/linguamatica/article/view/273https://linguamatica.com/index.php/linguamatica/article/view/273/451Direitos de Autor (c) 2019 Alessandra Harumi Iriguti, Valéria Delisandra Feltrimhttp://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessIriguti, Alessandra HarumiFeltrim, Valéria Delisandra2023-09-08T13:46:38Zoai:linguamatica.com:article/273Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:28:38.766740Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Evaluating features for rhetorical structure classification in scientific abstracts Avaliando atributos para a classificação de estrutura retórica em resumos científicos
title	Evaluating features for rhetorical structure classification in scientific abstracts
spellingShingle	Evaluating features for rhetorical structure classification in scientific abstracts Iriguti, Alessandra Harumi natural language processing rhetorical structure classification scientific abstracts in Portuguese processamento de linguagem natural classificação de estrutura retórica resumos científicos em português
title_short	Evaluating features for rhetorical structure classification in scientific abstracts
title_full	Evaluating features for rhetorical structure classification in scientific abstracts
title_fullStr	Evaluating features for rhetorical structure classification in scientific abstracts
title_full_unstemmed	Evaluating features for rhetorical structure classification in scientific abstracts
title_sort	Evaluating features for rhetorical structure classification in scientific abstracts
author	Iriguti, Alessandra Harumi
author_facet	Iriguti, Alessandra Harumi Feltrim, Valéria Delisandra
author_role	author
author2	Feltrim, Valéria Delisandra
author2_role	author
dc.contributor.author.fl_str_mv	Iriguti, Alessandra Harumi Feltrim, Valéria Delisandra
dc.subject.por.fl_str_mv	natural language processing rhetorical structure classification scientific abstracts in Portuguese processamento de linguagem natural classificação de estrutura retórica resumos científicos em português
topic	natural language processing rhetorical structure classification scientific abstracts in Portuguese processamento de linguagem natural classificação de estrutura retórica resumos científicos em português
description	Rhetorical structure classification is a NLP task in which we want to identify the rhetorical components of a discourse and its relationships. In this work, we aimed at automatically identifying propositions at the sentential level that make up the rhetorical structure of scientific abstracts. Specifically, the purpose was to evaluate the impact of different sets of attributes on the implementation of rhetorical classifiers for scientific abstracts written in Portuguese. For this, we used superficial features (extracted as TF-IDF values and selected with the $\chi^2$ test), morphosyntactic features (implemented by the AZPort classifier) and features extracted from \textit {word embeddings} models (Word2Vec, Wang2Vec and GloVe, all of them previously trained). These sets of features, as well as its combinations, were used for the training of the following supervised learning classifiers: Support Vector Machines, Naive Bayes, K-Nearest Neighbors, Decision Trees and Conditional Random Fields (CRF). They were trained and tested through cross-validation on three \textit{corpora} composed by abstracts of theses and dissertations. The best result, $94\%$ of F1, was obtained by the CRF classifier with the following combinations of features: (i) Wang2Vec--Skip-gram of $100$ dimension with the features from AZPort; (ii) TF-IDF, AZPort and \textit{embeddings} extracted with the Word2Vec--Skip-gram and GloVe models of dimensions $1000$ and $300$, respectively. From the results, we concluded that the AZPort features were fundamental for the performance of the CRF and that the combination with \textit{word embeddings} proved valid.
publishDate	2019
dc.date.none.fl_str_mv	2019-07-20
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://doi.org/10.21814/lm.11.1.273 https://doi.org/10.21814/lm.11.1.273
url	https://doi.org/10.21814/lm.11.1.273
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	https://linguamatica.com/index.php/linguamatica/article/view/273 https://linguamatica.com/index.php/linguamatica/article/view/273/451
dc.rights.driver.fl_str_mv	Direitos de Autor (c) 2019 Alessandra Harumi Iriguti, Valéria Delisandra Feltrim http://creativecommons.org/licenses/by/4.0 info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Direitos de Autor (c) 2019 Alessandra Harumi Iriguti, Valéria Delisandra Feltrim http://creativecommons.org/licenses/by/4.0
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade do Minho e Universidade de Vigo
publisher.none.fl_str_mv	Universidade do Minho e Universidade de Vigo
dc.source.none.fl_str_mv	Linguamática; Vol. 11 No. 1; 41-53 Linguamática; Vol. 11 Núm. 1; 41-53 Linguamática; v. 11 n. 1; 41-53 1647-0818 reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799133553991090176

Evaluating features for rhetorical structure classification in scientific abstracts

Registros relacionados