An investigation of linguistic problems in automatic multi-document summaries
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Outros Autores: | , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFLA |
Texto Completo: | http://repositorio.ufla.br/jspui/handle/1/50347 |
Resumo: | Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured. |
id |
UFLA_64be6458626eee6da9d56d1af9cdd4b4 |
---|---|
oai_identifier_str |
oai:localhost:1/50347 |
network_acronym_str |
UFLA |
network_name_str |
Repositório Institucional da UFLA |
repository_id_str |
|
spelling |
An investigation of linguistic problems in automatic multi-document summariesUma investigação de problemas linguísticos em sumários automáticos multidocumentoAutomatic summarizationMulti-document summaryLinguistic problemCorpus annotationSumarização automáticaSumário multidocumentoProblema linguísticoAnotação de corpusAutomatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.Sumários automáticos geralmente apresentam vários problemas linguísticos que afetam a sua qualidade textual e, consequentemente, sua compreensão pelos usuários. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarização. Neste artigo, investigaram-se os problemas em extratos (isto é, sumários produzidos pela concatenação de sentenças extraídas na íntegra dos textos-fonte) multidocumento em Português do Brasil gerados por sistemas que apresentam diferentes abordagens (isto é, superficial e profunda) e desempenho (isto é, métodos baseline e do estado-da-arte). Para tanto, as principais caracterizações dos problemas linguísticos em sumários automáticos foram investigadas, resultando em uma tipologia mais adequada à sumarização multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas são significativamente mais recorrentes que outros. Assim, essa anotação gera subsídios para as tarefas automáticas de detecção e correção de problemas linguísticos com vistas à produção de sumários automáticos não só mais informativos (isto é, que cobrem o conteúdo do material de origem), como também linguisticamente bem-estruturados.Universidade Federal de Minas Gerais (UFMG), Faculdade de Letras (FALE)2022-06-27T12:44:35Z2022-06-27T12:44:35Z2021info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfDIAS, M. de S. et al. An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, Belo Horizonte, v. 29, n. 2, p. 859-907, 2021. DOI: 10.17851/2237-2083.29.2.859-907.http://repositorio.ufla.br/jspui/handle/1/50347Revista de Estudos da Linguagemreponame:Repositório Institucional da UFLAinstname:Universidade Federal de Lavras (UFLA)instacron:UFLAAttribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessDias, Márcio de SouzaDi Felippo, ArianiRassi, Amanda PontesCardoso, Paula Christina FigueiraNóbrega, Fernando Antônio AsevedoPardo, Thiago Alexandre Salgueiroeng2023-05-03T13:18:54Zoai:localhost:1/50347Repositório InstitucionalPUBhttp://repositorio.ufla.br/oai/requestnivaldo@ufla.br || repositorio.biblioteca@ufla.bropendoar:2023-05-03T13:18:54Repositório Institucional da UFLA - Universidade Federal de Lavras (UFLA)false |
dc.title.none.fl_str_mv |
An investigation of linguistic problems in automatic multi-document summaries Uma investigação de problemas linguísticos em sumários automáticos multidocumento |
title |
An investigation of linguistic problems in automatic multi-document summaries |
spellingShingle |
An investigation of linguistic problems in automatic multi-document summaries Dias, Márcio de Souza Automatic summarization Multi-document summary Linguistic problem Corpus annotation Sumarização automática Sumário multidocumento Problema linguístico Anotação de corpus |
title_short |
An investigation of linguistic problems in automatic multi-document summaries |
title_full |
An investigation of linguistic problems in automatic multi-document summaries |
title_fullStr |
An investigation of linguistic problems in automatic multi-document summaries |
title_full_unstemmed |
An investigation of linguistic problems in automatic multi-document summaries |
title_sort |
An investigation of linguistic problems in automatic multi-document summaries |
author |
Dias, Márcio de Souza |
author_facet |
Dias, Márcio de Souza Di Felippo, Ariani Rassi, Amanda Pontes Cardoso, Paula Christina Figueira Nóbrega, Fernando Antônio Asevedo Pardo, Thiago Alexandre Salgueiro |
author_role |
author |
author2 |
Di Felippo, Ariani Rassi, Amanda Pontes Cardoso, Paula Christina Figueira Nóbrega, Fernando Antônio Asevedo Pardo, Thiago Alexandre Salgueiro |
author2_role |
author author author author author |
dc.contributor.author.fl_str_mv |
Dias, Márcio de Souza Di Felippo, Ariani Rassi, Amanda Pontes Cardoso, Paula Christina Figueira Nóbrega, Fernando Antônio Asevedo Pardo, Thiago Alexandre Salgueiro |
dc.subject.por.fl_str_mv |
Automatic summarization Multi-document summary Linguistic problem Corpus annotation Sumarização automática Sumário multidocumento Problema linguístico Anotação de corpus |
topic |
Automatic summarization Multi-document summary Linguistic problem Corpus annotation Sumarização automática Sumário multidocumento Problema linguístico Anotação de corpus |
description |
Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021 2022-06-27T12:44:35Z 2022-06-27T12:44:35Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
DIAS, M. de S. et al. An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, Belo Horizonte, v. 29, n. 2, p. 859-907, 2021. DOI: 10.17851/2237-2083.29.2.859-907. http://repositorio.ufla.br/jspui/handle/1/50347 |
identifier_str_mv |
DIAS, M. de S. et al. An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, Belo Horizonte, v. 29, n. 2, p. 859-907, 2021. DOI: 10.17851/2237-2083.29.2.859-907. |
url |
http://repositorio.ufla.br/jspui/handle/1/50347 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
Attribution 4.0 International http://creativecommons.org/licenses/by/4.0/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution 4.0 International http://creativecommons.org/licenses/by/4.0/ |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais (UFMG), Faculdade de Letras (FALE) |
publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais (UFMG), Faculdade de Letras (FALE) |
dc.source.none.fl_str_mv |
Revista de Estudos da Linguagem reponame:Repositório Institucional da UFLA instname:Universidade Federal de Lavras (UFLA) instacron:UFLA |
instname_str |
Universidade Federal de Lavras (UFLA) |
instacron_str |
UFLA |
institution |
UFLA |
reponame_str |
Repositório Institucional da UFLA |
collection |
Repositório Institucional da UFLA |
repository.name.fl_str_mv |
Repositório Institucional da UFLA - Universidade Federal de Lavras (UFLA) |
repository.mail.fl_str_mv |
nivaldo@ufla.br || repositorio.biblioteca@ufla.br |
_version_ |
1815439299363995648 |