An investigation of linguistic problems in automatic multi-document summaries

Detalhes bibliográficos
Autor(a) principal: Dias, Márcio de Souza
Data de Publicação: 2021
Outros Autores: Di Felippo, Ariani, Rassi, Amanda Pontes, Cardoso, Paula Christina Figueira, Nóbrega, Fernando Antônio Asevedo, Pardo, Thiago Alexandre Salgueiro
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da UFLA
Texto Completo: http://repositorio.ufla.br/jspui/handle/1/50347
Resumo: Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.
id UFLA_64be6458626eee6da9d56d1af9cdd4b4
oai_identifier_str oai:localhost:1/50347
network_acronym_str UFLA
network_name_str Repositório Institucional da UFLA
repository_id_str
spelling An investigation of linguistic problems in automatic multi-document summariesUma investigação de problemas linguísticos em sumários automáticos multidocumentoAutomatic summarizationMulti-document summaryLinguistic problemCorpus annotationSumarização automáticaSumário multidocumentoProblema linguísticoAnotação de corpusAutomatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.Sumários automáticos geralmente apresentam vários problemas linguísticos que afetam a sua qualidade textual e, consequentemente, sua compreensão pelos usuários. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarização. Neste artigo, investigaram-se os problemas em extratos (isto é, sumários produzidos pela concatenação de sentenças extraídas na íntegra dos textos-fonte) multidocumento em Português do Brasil gerados por sistemas que apresentam diferentes abordagens (isto é, superficial e profunda) e desempenho (isto é, métodos baseline e do estado-da-arte). Para tanto, as principais caracterizações dos problemas linguísticos em sumários automáticos foram investigadas, resultando em uma tipologia mais adequada à sumarização multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas são significativamente mais recorrentes que outros. Assim, essa anotação gera subsídios para as tarefas automáticas de detecção e correção de problemas linguísticos com vistas à produção de sumários automáticos não só mais informativos (isto é, que cobrem o conteúdo do material de origem), como também linguisticamente bem-estruturados.Universidade Federal de Minas Gerais (UFMG), Faculdade de Letras (FALE)2022-06-27T12:44:35Z2022-06-27T12:44:35Z2021info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfDIAS, M. de S. et al. An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, Belo Horizonte, v. 29, n. 2, p. 859-907, 2021. DOI: 10.17851/2237-2083.29.2.859-907.http://repositorio.ufla.br/jspui/handle/1/50347Revista de Estudos da Linguagemreponame:Repositório Institucional da UFLAinstname:Universidade Federal de Lavras (UFLA)instacron:UFLAAttribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessDias, Márcio de SouzaDi Felippo, ArianiRassi, Amanda PontesCardoso, Paula Christina FigueiraNóbrega, Fernando Antônio AsevedoPardo, Thiago Alexandre Salgueiroeng2023-05-03T13:18:54Zoai:localhost:1/50347Repositório InstitucionalPUBhttp://repositorio.ufla.br/oai/requestnivaldo@ufla.br || repositorio.biblioteca@ufla.bropendoar:2023-05-03T13:18:54Repositório Institucional da UFLA - Universidade Federal de Lavras (UFLA)false
dc.title.none.fl_str_mv An investigation of linguistic problems in automatic multi-document summaries
Uma investigação de problemas linguísticos em sumários automáticos multidocumento
title An investigation of linguistic problems in automatic multi-document summaries
spellingShingle An investigation of linguistic problems in automatic multi-document summaries
Dias, Márcio de Souza
Automatic summarization
Multi-document summary
Linguistic problem
Corpus annotation
Sumarização automática
Sumário multidocumento
Problema linguístico
Anotação de corpus
title_short An investigation of linguistic problems in automatic multi-document summaries
title_full An investigation of linguistic problems in automatic multi-document summaries
title_fullStr An investigation of linguistic problems in automatic multi-document summaries
title_full_unstemmed An investigation of linguistic problems in automatic multi-document summaries
title_sort An investigation of linguistic problems in automatic multi-document summaries
author Dias, Márcio de Souza
author_facet Dias, Márcio de Souza
Di Felippo, Ariani
Rassi, Amanda Pontes
Cardoso, Paula Christina Figueira
Nóbrega, Fernando Antônio Asevedo
Pardo, Thiago Alexandre Salgueiro
author_role author
author2 Di Felippo, Ariani
Rassi, Amanda Pontes
Cardoso, Paula Christina Figueira
Nóbrega, Fernando Antônio Asevedo
Pardo, Thiago Alexandre Salgueiro
author2_role author
author
author
author
author
dc.contributor.author.fl_str_mv Dias, Márcio de Souza
Di Felippo, Ariani
Rassi, Amanda Pontes
Cardoso, Paula Christina Figueira
Nóbrega, Fernando Antônio Asevedo
Pardo, Thiago Alexandre Salgueiro
dc.subject.por.fl_str_mv Automatic summarization
Multi-document summary
Linguistic problem
Corpus annotation
Sumarização automática
Sumário multidocumento
Problema linguístico
Anotação de corpus
topic Automatic summarization
Multi-document summary
Linguistic problem
Corpus annotation
Sumarização automática
Sumário multidocumento
Problema linguístico
Anotação de corpus
description Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.
publishDate 2021
dc.date.none.fl_str_mv 2021
2022-06-27T12:44:35Z
2022-06-27T12:44:35Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv DIAS, M. de S. et al. An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, Belo Horizonte, v. 29, n. 2, p. 859-907, 2021. DOI: 10.17851/2237-2083.29.2.859-907.
http://repositorio.ufla.br/jspui/handle/1/50347
identifier_str_mv DIAS, M. de S. et al. An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, Belo Horizonte, v. 29, n. 2, p. 859-907, 2021. DOI: 10.17851/2237-2083.29.2.859-907.
url http://repositorio.ufla.br/jspui/handle/1/50347
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais (UFMG), Faculdade de Letras (FALE)
publisher.none.fl_str_mv Universidade Federal de Minas Gerais (UFMG), Faculdade de Letras (FALE)
dc.source.none.fl_str_mv Revista de Estudos da Linguagem
reponame:Repositório Institucional da UFLA
instname:Universidade Federal de Lavras (UFLA)
instacron:UFLA
instname_str Universidade Federal de Lavras (UFLA)
instacron_str UFLA
institution UFLA
reponame_str Repositório Institucional da UFLA
collection Repositório Institucional da UFLA
repository.name.fl_str_mv Repositório Institucional da UFLA - Universidade Federal de Lavras (UFLA)
repository.mail.fl_str_mv nivaldo@ufla.br || repositorio.biblioteca@ufla.br
_version_ 1807835198127079424