Investigação de estratégias de sumarização humana multidocumento
Autor(a) principal: | |
---|---|
Data de Publicação: | 2013 |
Tipo de documento: | Dissertação |
Idioma: | por |
Título da fonte: | Repositório Institucional da UFSCAR |
Texto Completo: | https://repositorio.ufscar.br/handle/ufscar/5781 |
Resumo: | The multi-document human summarization (MHS), which is the production of a manual summary from a collection of texts from different sources on the same subject, is a little explored linguistic task. Considering the fact that single document summaries comprise information that present recurrent features which are able to reveal summarization strategies, we aimed to investigate multi-document summaries in order to identify MHS strategies. For the identification of MHS strategies, the source texts sentences from the CSTNews corpus (CARDOSO et al., 2011) were manually aligned to their human summaries. The corpus has 50 clusters of news texts and their multi-document summaries in Portuguese. Thus, the alignment revealed the origin of the selected information to compose the summaries. In order to identify whether the selected information show recurrent features, the aligned (and nonaligned) sentences were semi automatically characterized considering a set of linguistic attributes identified in some related works. These attributes translate the content selection strategies from the single document summarization and the clues about MHS. Through the manual analysis of the characterizations of the aligned and non-aligned sentences, we identified that the selected sentences commonly have certain attributes such as sentence location in the text and redundancy. This observation was confirmed by a set of formal rules learned by a Machine Learning (ML) algorithm from the same characterizations. Thus, these rules translate MHS strategies. When the rules were learned and tested in CSTNews by ML, the precision rate was 71.25%. To assess the relevance of the rules, we performed 3 different kinds of intrinsic evaluations: (i) verification of the occurrence of the same strategies in another corpus, and (ii) comparison of the quality of summaries produced by the HMS strategies with the quality of summaries produced by different strategies. Regarding the evaluation (i), which was automatically performed by ML, the rules learned from the CSTNews were tested in a different newspaper corpus and its precision was 70%, which is very close to the precision obtained in the training corpus (CSTNews). Concerning the evaluating (ii), the quality, which was manually evaluated by 10 computational linguists, was considered better than the quality of other summaries. Besides describing features concerning multi-document summaries, this work has the potential to support the multi-document automatic summarization, which may help it to become more linguistically motivated. This task consists of automatically generating multi-document summaries and, therefore, it has been based on the adjustment of strategies identified in single document summarization or only on not confirmed clues about MHS. Based on this work, the automatic process of content selection in multi-document summarization methods may be performed based on strategies systematically identified in MHS. |
id |
SCAR_6637bc5db024ee254f3183eedccc6a2a |
---|---|
oai_identifier_str |
oai:repositorio.ufscar.br:ufscar/5781 |
network_acronym_str |
SCAR |
network_name_str |
Repositório Institucional da UFSCAR |
repository_id_str |
4322 |
spelling |
Camargo, Renata Tironi deDi Felippo, Arianihttp://lattes.cnpq.br/8648412103197455http://lattes.cnpq.br/4011327590298193d909624a-1764-42a3-b7dc-18aa87f808b62016-06-02T20:25:21Z2013-11-282016-06-02T20:25:21Z2013-08-30CAMARGO, Renata Tironi de. Investigação de estratégias de sumarização humana multidocumento. 2013. 135 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2013.https://repositorio.ufscar.br/handle/ufscar/5781The multi-document human summarization (MHS), which is the production of a manual summary from a collection of texts from different sources on the same subject, is a little explored linguistic task. Considering the fact that single document summaries comprise information that present recurrent features which are able to reveal summarization strategies, we aimed to investigate multi-document summaries in order to identify MHS strategies. For the identification of MHS strategies, the source texts sentences from the CSTNews corpus (CARDOSO et al., 2011) were manually aligned to their human summaries. The corpus has 50 clusters of news texts and their multi-document summaries in Portuguese. Thus, the alignment revealed the origin of the selected information to compose the summaries. In order to identify whether the selected information show recurrent features, the aligned (and nonaligned) sentences were semi automatically characterized considering a set of linguistic attributes identified in some related works. These attributes translate the content selection strategies from the single document summarization and the clues about MHS. Through the manual analysis of the characterizations of the aligned and non-aligned sentences, we identified that the selected sentences commonly have certain attributes such as sentence location in the text and redundancy. This observation was confirmed by a set of formal rules learned by a Machine Learning (ML) algorithm from the same characterizations. Thus, these rules translate MHS strategies. When the rules were learned and tested in CSTNews by ML, the precision rate was 71.25%. To assess the relevance of the rules, we performed 3 different kinds of intrinsic evaluations: (i) verification of the occurrence of the same strategies in another corpus, and (ii) comparison of the quality of summaries produced by the HMS strategies with the quality of summaries produced by different strategies. Regarding the evaluation (i), which was automatically performed by ML, the rules learned from the CSTNews were tested in a different newspaper corpus and its precision was 70%, which is very close to the precision obtained in the training corpus (CSTNews). Concerning the evaluating (ii), the quality, which was manually evaluated by 10 computational linguists, was considered better than the quality of other summaries. Besides describing features concerning multi-document summaries, this work has the potential to support the multi-document automatic summarization, which may help it to become more linguistically motivated. This task consists of automatically generating multi-document summaries and, therefore, it has been based on the adjustment of strategies identified in single document summarization or only on not confirmed clues about MHS. Based on this work, the automatic process of content selection in multi-document summarization methods may be performed based on strategies systematically identified in MHS.A sumarização humana multidocumento (SHM), que consiste na produção manual de um sumário a partir de uma coleção de textos, provenientes de fontes-distintas, que abordam um mesmo assunto, é uma tarefa linguística até então pouco explorada. Tomando-se como motivação o fato de que sumários monodocumento são compostos por informações que apresentam características recorrentes, a ponto de revelar estratégias de sumarização, objetivou-se investigar sumários multidocumento com o objetivo de identificar estratégias de SHM. Para a identificação das estratégias de SHM, os textos-fonte (isto é, notícias) das 50 coleções do corpus multidocumento em português CSTNews (CARDOSO et al., 2011) foram manualmente alinhados em nível sentencial aos seus respectivos sumários humanos, relevando, assim, a origem das informações selecionadas para compor os sumários. Com o intuito de identificar se as informações selecionadas para compor os sumários apresentam características recorrentes, as sentenças alinhadas (e não-alinhadas) foram caracterizadas de forma semiautomática em função de um conjunto de atributos linguísticos identificados na literatura. Esses atributos traduzem as estratégias de seleção de conteúdo da sumarização monodocumento e os indícios sobre a SHM. Por meio da análise manual das caracterizações das sentenças alinhadas e não-alinhadas, identificou-se que as sentenças selecionadas para compor os sumários multidocumento comumente apresentam certos atributos, como localização das sentenças no texto e redundância. Essa constatação foi confirmada pelo conjunto de regras formais aprendidas por um algoritmo de Aprendizado de Máquina (AM) a partir das mesmas caracterizações. Tais regras traduzem, assim, estratégias de SHM. Quando aprendidas e testadas no CSTNews pelo AM, as regras obtiveram precisão de 71,25%. Para avaliar a pertinência das regras, 2 avaliações intrínsecas foram realizadas, a saber: (i) verificação da ocorrência das estratégias em outro corpus, e (ii) comparação da qualidade de sumários produzidos pelas estratégias de SHM com a qualidade de sumários produzidos por estratégias diferentes. Na avaliação (i), realizada automaticamente por AM, as regras aprendidas a partir do CSTNews foram testadas em um corpus jornalístico distinto e obtiveram a precisão de 70%, muito próxima da obtida no corpus de treinamento (CSTNews). Na avaliação (ii), a qualidade, avaliada de forma manual por 10 linguistas computacionais, foi considerada superior à qualidade dos demais sumários de comparação. Além de descrever características relativas aos sumários multidocumento, este trabalho, uma vez que gera regras formais (ou seja, explícitas e não-ambíguas), tem potencial de subsidiar a Sumarização Automática Multidocumento (SAM), tornando-a mais linguisticamente motivada. A SAM consiste em gerar sumários multidocumento de forma automática e, para tanto, baseava-se na adaptação das estratégias identificadas na sumarização monodocumento ou apenas em indícios, não comprovados sistematicamente, sobre a SHM. Com base neste trabalho, a seleção de conteúdo em métodos de SAM poderá ser feita com base em estratégias identificadas de forma sistemática na SHM.Universidade Federal de Minas Geraisapplication/pdfporUniversidade Federal de São CarlosPrograma de Pós-Graduação em Linguística - PPGLUFSCarBRLinguísticaSumarização automáticaSumarização humana multidocumentoEstratégias de seleção de conteúdoMulti-document human summarizationContent selection strategyMultidocument automatic summarizationLINGUISTICA, LETRAS E ARTES::LINGUISTICAInvestigação de estratégias de sumarização humana multidocumentoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis-1-126c5db60-6612-41e6-a8f9-f94fb475ca58info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINAL5583.pdfapplication/pdf2165924https://repositorio.ufscar.br/bitstream/ufscar/5781/1/5583.pdf9508776d3397fc5a516393218f88c50fMD51TEXT5583.pdf.txt5583.pdf.txtExtracted texttext/plain0https://repositorio.ufscar.br/bitstream/ufscar/5781/2/5583.pdf.txtd41d8cd98f00b204e9800998ecf8427eMD52THUMBNAIL5583.pdf.jpg5583.pdf.jpgIM Thumbnailimage/jpeg10308https://repositorio.ufscar.br/bitstream/ufscar/5781/3/5583.pdf.jpgdfe5c59126815eed3984204d11255b46MD53ufscar/57812023-09-18 18:31:24.198oai:repositorio.ufscar.br:ufscar/5781Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:24Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false |
dc.title.por.fl_str_mv |
Investigação de estratégias de sumarização humana multidocumento |
title |
Investigação de estratégias de sumarização humana multidocumento |
spellingShingle |
Investigação de estratégias de sumarização humana multidocumento Camargo, Renata Tironi de Linguística Sumarização automática Sumarização humana multidocumento Estratégias de seleção de conteúdo Multi-document human summarization Content selection strategy Multidocument automatic summarization LINGUISTICA, LETRAS E ARTES::LINGUISTICA |
title_short |
Investigação de estratégias de sumarização humana multidocumento |
title_full |
Investigação de estratégias de sumarização humana multidocumento |
title_fullStr |
Investigação de estratégias de sumarização humana multidocumento |
title_full_unstemmed |
Investigação de estratégias de sumarização humana multidocumento |
title_sort |
Investigação de estratégias de sumarização humana multidocumento |
author |
Camargo, Renata Tironi de |
author_facet |
Camargo, Renata Tironi de |
author_role |
author |
dc.contributor.authorlattes.por.fl_str_mv |
http://lattes.cnpq.br/4011327590298193 |
dc.contributor.author.fl_str_mv |
Camargo, Renata Tironi de |
dc.contributor.advisor1.fl_str_mv |
Di Felippo, Ariani |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/8648412103197455 |
dc.contributor.authorID.fl_str_mv |
d909624a-1764-42a3-b7dc-18aa87f808b6 |
contributor_str_mv |
Di Felippo, Ariani |
dc.subject.por.fl_str_mv |
Linguística Sumarização automática Sumarização humana multidocumento Estratégias de seleção de conteúdo |
topic |
Linguística Sumarização automática Sumarização humana multidocumento Estratégias de seleção de conteúdo Multi-document human summarization Content selection strategy Multidocument automatic summarization LINGUISTICA, LETRAS E ARTES::LINGUISTICA |
dc.subject.eng.fl_str_mv |
Multi-document human summarization Content selection strategy Multidocument automatic summarization |
dc.subject.cnpq.fl_str_mv |
LINGUISTICA, LETRAS E ARTES::LINGUISTICA |
description |
The multi-document human summarization (MHS), which is the production of a manual summary from a collection of texts from different sources on the same subject, is a little explored linguistic task. Considering the fact that single document summaries comprise information that present recurrent features which are able to reveal summarization strategies, we aimed to investigate multi-document summaries in order to identify MHS strategies. For the identification of MHS strategies, the source texts sentences from the CSTNews corpus (CARDOSO et al., 2011) were manually aligned to their human summaries. The corpus has 50 clusters of news texts and their multi-document summaries in Portuguese. Thus, the alignment revealed the origin of the selected information to compose the summaries. In order to identify whether the selected information show recurrent features, the aligned (and nonaligned) sentences were semi automatically characterized considering a set of linguistic attributes identified in some related works. These attributes translate the content selection strategies from the single document summarization and the clues about MHS. Through the manual analysis of the characterizations of the aligned and non-aligned sentences, we identified that the selected sentences commonly have certain attributes such as sentence location in the text and redundancy. This observation was confirmed by a set of formal rules learned by a Machine Learning (ML) algorithm from the same characterizations. Thus, these rules translate MHS strategies. When the rules were learned and tested in CSTNews by ML, the precision rate was 71.25%. To assess the relevance of the rules, we performed 3 different kinds of intrinsic evaluations: (i) verification of the occurrence of the same strategies in another corpus, and (ii) comparison of the quality of summaries produced by the HMS strategies with the quality of summaries produced by different strategies. Regarding the evaluation (i), which was automatically performed by ML, the rules learned from the CSTNews were tested in a different newspaper corpus and its precision was 70%, which is very close to the precision obtained in the training corpus (CSTNews). Concerning the evaluating (ii), the quality, which was manually evaluated by 10 computational linguists, was considered better than the quality of other summaries. Besides describing features concerning multi-document summaries, this work has the potential to support the multi-document automatic summarization, which may help it to become more linguistically motivated. This task consists of automatically generating multi-document summaries and, therefore, it has been based on the adjustment of strategies identified in single document summarization or only on not confirmed clues about MHS. Based on this work, the automatic process of content selection in multi-document summarization methods may be performed based on strategies systematically identified in MHS. |
publishDate |
2013 |
dc.date.available.fl_str_mv |
2013-11-28 2016-06-02T20:25:21Z |
dc.date.issued.fl_str_mv |
2013-08-30 |
dc.date.accessioned.fl_str_mv |
2016-06-02T20:25:21Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
CAMARGO, Renata Tironi de. Investigação de estratégias de sumarização humana multidocumento. 2013. 135 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2013. |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufscar.br/handle/ufscar/5781 |
identifier_str_mv |
CAMARGO, Renata Tironi de. Investigação de estratégias de sumarização humana multidocumento. 2013. 135 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2013. |
url |
https://repositorio.ufscar.br/handle/ufscar/5781 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.confidence.fl_str_mv |
-1 -1 |
dc.relation.authority.fl_str_mv |
26c5db60-6612-41e6-a8f9-f94fb475ca58 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de São Carlos |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Linguística - PPGL |
dc.publisher.initials.fl_str_mv |
UFSCar |
dc.publisher.country.fl_str_mv |
BR |
publisher.none.fl_str_mv |
Universidade Federal de São Carlos |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR |
instname_str |
Universidade Federal de São Carlos (UFSCAR) |
instacron_str |
UFSCAR |
institution |
UFSCAR |
reponame_str |
Repositório Institucional da UFSCAR |
collection |
Repositório Institucional da UFSCAR |
bitstream.url.fl_str_mv |
https://repositorio.ufscar.br/bitstream/ufscar/5781/1/5583.pdf https://repositorio.ufscar.br/bitstream/ufscar/5781/2/5583.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/5781/3/5583.pdf.jpg |
bitstream.checksum.fl_str_mv |
9508776d3397fc5a516393218f88c50f d41d8cd98f00b204e9800998ecf8427e dfe5c59126815eed3984204d11255b46 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR) |
repository.mail.fl_str_mv |
|
_version_ |
1813715546541129728 |