Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue

Tosta, Fabricio Elder da Silva

Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue

Detalhes bibliográficos
Autor(a) principal:	Tosta, Fabricio Elder da Silva
Data de Publicação:	2014
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Repositório Institucional da UFSCAR
Texto Completo:	https://repositorio.ufscar.br/handle/ufscar/5796
Resumo:	Traditionally, Multilingual Multi-document Automatic Summarization (MMAS) is a computational application that, from a single collection of source-texts on the same subject/topic in at least two languages, produces an informative and generic summary (extract) in one of these languages. The simplest methods automatically translate the source-texts and, from a monolingual collection, apply content selection strategies based on shallow and/or deep linguistic knowledge. Therefore, the MMAS applications need to identify the main information of the collection, avoiding the redundancy, but also treating the problems caused by the machine translation (MT) of the full source-texts. Looking for alternatives to the traditional scenario of MMAS, we investigated two methods (Method 1 and 2) that once based on deep linguistic knowledge of lexical-conceptual level avoid the full MT of the sourcetexts, generating informative and cohesive/coherent summaries. In these methods, the content selection starts with the score and the ranking of the original sentences based on the frequency of occurrence of the concepts in the collection, expressed by their common names. In Method 1, only the most well-scored and non redundant sentences from the user s language are selected to compose the extract, until it reaches the compression rate. In Method 2, the original sentences which are better ranked and non redundant are selected to the summary without privileging the user s language; in cases which sentences that are not in the user s language are selected, they are automatically translated. In order to producing automatic summaries according to Methods 1 and 2 and their subsequent evaluation, the CM2News corpus was built. The corpus has 20 collections of news texts, 1 original text in English and 1 original text in Portuguese, both on the same topic. The common names of CM2News were identified through morphosyntactic annotation and then it was semiautomatically annotated with the concepts in Princeton WordNet through the Mulsen graphic editor, which was especially developed for the task. For the production of extracts according to Method 1, only the best ranked sentences in Portuguese were selected until the compression rate was reached. For the production of extracts according to Method 2, the best ranked sentences were selected, without privileging the language of the user. If English sentences were selected, they were automatically translated into Portuguese by the Bing translator. The Methods 1 and 2 were evaluated intrinsically considering the linguistic quality and informativeness of the summaries. To evaluate linguistic quality, 15 computational linguists analyzed manually the grammaticality, non-redundancy, referential clarity, focus and structure / coherence of the summaries and to evaluate the informativeness of the sumaries, they were automatically compared to reference sumaries by ROUGE measures. In both evaluations, the results have shown the better performance of Method 1, which might be explained by the fact that sentences were selected from a single source text. Furthermore, we highlight the best performance of both methods based on lexicalconceptual knowledge compared to simpler methods of MMAS, which adopted the full MT of the source-texts. Finally, it is noted that, besides the promising results on the application of lexical-conceptual knowledge, this work has generated important resources and tools for MMAS, such as the CM2News corpus and the Mulsen editor.

Metadados do item

id	SCAR_0c91503aaa5c135dc7aa1ac498d0a98a
oai_identifier_str	oai:repositorio.ufscar.br:ufscar/5796
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str	4322
spelling	Tosta, Fabricio Elder da SilvaDi Felippo, Arianihttp://lattes.cnpq.br/8648412103197455http://lattes.cnpq.br/00119308548544665560c6dd-11c3-4c32-9116-997b793ac9fa2016-06-02T20:25:23Z2015-03-112016-06-02T20:25:23Z2014-02-27TOSTA, Fabricio Elder da Silva. Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue. 2014. 119 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2014.https://repositorio.ufscar.br/handle/ufscar/5796Traditionally, Multilingual Multi-document Automatic Summarization (MMAS) is a computational application that, from a single collection of source-texts on the same subject/topic in at least two languages, produces an informative and generic summary (extract) in one of these languages. The simplest methods automatically translate the source-texts and, from a monolingual collection, apply content selection strategies based on shallow and/or deep linguistic knowledge. Therefore, the MMAS applications need to identify the main information of the collection, avoiding the redundancy, but also treating the problems caused by the machine translation (MT) of the full source-texts. Looking for alternatives to the traditional scenario of MMAS, we investigated two methods (Method 1 and 2) that once based on deep linguistic knowledge of lexical-conceptual level avoid the full MT of the sourcetexts, generating informative and cohesive/coherent summaries. In these methods, the content selection starts with the score and the ranking of the original sentences based on the frequency of occurrence of the concepts in the collection, expressed by their common names. In Method 1, only the most well-scored and non redundant sentences from the user s language are selected to compose the extract, until it reaches the compression rate. In Method 2, the original sentences which are better ranked and non redundant are selected to the summary without privileging the user s language; in cases which sentences that are not in the user s language are selected, they are automatically translated. In order to producing automatic summaries according to Methods 1 and 2 and their subsequent evaluation, the CM2News corpus was built. The corpus has 20 collections of news texts, 1 original text in English and 1 original text in Portuguese, both on the same topic. The common names of CM2News were identified through morphosyntactic annotation and then it was semiautomatically annotated with the concepts in Princeton WordNet through the Mulsen graphic editor, which was especially developed for the task. For the production of extracts according to Method 1, only the best ranked sentences in Portuguese were selected until the compression rate was reached. For the production of extracts according to Method 2, the best ranked sentences were selected, without privileging the language of the user. If English sentences were selected, they were automatically translated into Portuguese by the Bing translator. The Methods 1 and 2 were evaluated intrinsically considering the linguistic quality and informativeness of the summaries. To evaluate linguistic quality, 15 computational linguists analyzed manually the grammaticality, non-redundancy, referential clarity, focus and structure / coherence of the summaries and to evaluate the informativeness of the sumaries, they were automatically compared to reference sumaries by ROUGE measures. In both evaluations, the results have shown the better performance of Method 1, which might be explained by the fact that sentences were selected from a single source text. Furthermore, we highlight the best performance of both methods based on lexicalconceptual knowledge compared to simpler methods of MMAS, which adopted the full MT of the source-texts. Finally, it is noted that, besides the promising results on the application of lexical-conceptual knowledge, this work has generated important resources and tools for MMAS, such as the CM2News corpus and the Mulsen editor.Tradicionalmente, a Sumarização Automática Multidocumento Multilíngue (SAMM) é uma aplicação que, a partir de uma coleção de textos sobre um mesmo assunto em ao menos duas línguas distintas, produz um sumário (extrato) informativo e genérico em uma das línguas-fonte. Os métodos mais simples realizam a tradução automática (TA) dos textos-fonte e, a partir de uma coleção monolíngue, aplicam estratégias superficiais e/ou profundas de seleção de conteúdo. Dessa forma, a SAMM precisa não só identificar a informação principal da coleção para compor o sumário, evitando-se a redundância, mas também lidar com os problemas causados pela TA integral dos textos-fonte. Buscando alternativas para esse cenário, investigaram-se dois métodos (Método 1 e 2) que, uma vez pautados em conhecimento profundo do tipo léxico-conceitual, evitam a TA integral dos textos-fonte, gerando sumários informativos e coesos/coerentes. Neles, a seleção do conteúdo tem início com a pontuação e o ranqueamento das sentenças originais em função da frequência de ocorrência na coleção dos conceitos expressos por seus nomes comuns. No Método 1, apenas as sentenças mais bem pontuadas na língua do usuário e não redundantes entre si são selecionadas para compor o sumário até que se atinja a taxa de compressão. No Método 2, as sentenças originais mais bem ranqueadas e não redundantes entre si são selecionadas para compor o sumário sem que se privilegie a língua do usuário; caso sentenças que não estejam na língua do usuário sejam selecionadas, estas são automaticamente traduzidas. Para a produção dos sumários automáticos segundo os Métodos 1 e 2 e subsequente avaliação dos mesmos, construiu-se o corpus CM2News, que possui 20 coleções de notícias jornalísticas, cada uma delas composta por 1 texto original em inglês e 1 texto original em português sobre um mesmo assunto. Os nomes comuns do CM2News foram identificados via anotação morfossintática e anotados com os conceitos da WordNet de Princeton de forma semiautomática, ou seja, por meio do editor gráfico MulSen desenvolvido para a tarefa. Para a produção dos sumários segundo o Método 1, somente as sentenças em português mais bem pontuadas foram selecionadas até que se atingisse determinada taxa de compressão. Para a produção dos sumários segundo o Método 2, as sentenças mais pontuadas foram selecionadas sem privilegiar a língua do usuário. Caso as sentenças selecionadas estivessem em inglês, estas foram automaticamente traduzidas para o português pelo tradutor Bing. Os Métodos 1 e 2 foram avaliados de forma intrínseca, considerando-se a qualidade linguística e a informatividade dos sumários. Para avaliar a qualidade linguística, 15 linguistas computacionais analisaram manualmente a gramaticalidade, a não-redundância, a clareza referencial, o foco e a estrutura/coerência dos sumários e, para avaliar a informatividade, os sumários foram automaticamente comparados a sumários de referência pelo pacote de medidas ROUGE. Em ambas as avaliações, os resultados evidenciam o melhor desempenho do Método 1, o que pode ser justificado pelo fato de que as sentenças selecionadas são provenientes de um mesmo texto-fonte. Além disso, ressalta-se o melhor desempenho dos dois métodos baseados em conhecimento léxico-conceitual frente aos métodos mais simples de SAMM, os quais realizam a TA integral dos textos-fonte. Por fim, salienta-se que, além dos resultados promissores sobre a aplicação de conhecimento léxico-conceitual, este trabalho gerou recursos e ferramentas importantes para a SAMM, como o corpus CM2News e o editor MulSen.Financiadora de Estudos e Projetosapplication/pdfporUniversidade Federal de São CarlosPrograma de Pós-Graduação em Linguística - PPGLUFSCarBRLinguísticaSumarização automáticaSumarização multidocumento multilíngueConhecimento léxico-conceitualEstratégias de seleção de conteúdoMultilingual multi-document automatic summarizationLexical-conceptual knowledgeContent selectionLINGUISTICA, LETRAS E ARTES::LINGUISTICAAplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngueinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis-1-126c5db60-6612-41e6-a8f9-f94fb475ca58info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINAL6554.pdfapplication/pdf2657931https://repositorio.ufscar.br/bitstream/ufscar/5796/1/6554.pdf11403ad2acdeafd11148154c92757f20MD51TEXT6554.pdf.txt6554.pdf.txtExtracted texttext/plain0https://repositorio.ufscar.br/bitstream/ufscar/5796/2/6554.pdf.txtd41d8cd98f00b204e9800998ecf8427eMD52THUMBNAIL6554.pdf.jpg6554.pdf.jpgIM Thumbnailimage/jpeg10529https://repositorio.ufscar.br/bitstream/ufscar/5796/3/6554.pdf.jpg794fa079bc26e1ccf4d04bc80944b92dMD53ufscar/57962023-09-18 18:31:37.056oai:repositorio.ufscar.br:ufscar/5796Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:37Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
title	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
spellingShingle	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue Tosta, Fabricio Elder da Silva Linguística Sumarização automática Sumarização multidocumento multilíngue Conhecimento léxico-conceitual Estratégias de seleção de conteúdo Multilingual multi-document automatic summarization Lexical-conceptual knowledge Content selection LINGUISTICA, LETRAS E ARTES::LINGUISTICA
title_short	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
title_full	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
title_fullStr	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
title_full_unstemmed	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
title_sort	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
author	Tosta, Fabricio Elder da Silva
author_facet	Tosta, Fabricio Elder da Silva
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/0011930854854466
dc.contributor.author.fl_str_mv	Tosta, Fabricio Elder da Silva
dc.contributor.advisor1.fl_str_mv	Di Felippo, Ariani
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/8648412103197455
dc.contributor.authorID.fl_str_mv	5560c6dd-11c3-4c32-9116-997b793ac9fa
contributor_str_mv	Di Felippo, Ariani
dc.subject.por.fl_str_mv	Linguística Sumarização automática Sumarização multidocumento multilíngue Conhecimento léxico-conceitual Estratégias de seleção de conteúdo
topic	Linguística Sumarização automática Sumarização multidocumento multilíngue Conhecimento léxico-conceitual Estratégias de seleção de conteúdo Multilingual multi-document automatic summarization Lexical-conceptual knowledge Content selection LINGUISTICA, LETRAS E ARTES::LINGUISTICA
dc.subject.eng.fl_str_mv	Multilingual multi-document automatic summarization Lexical-conceptual knowledge Content selection
dc.subject.cnpq.fl_str_mv	LINGUISTICA, LETRAS E ARTES::LINGUISTICA
description	Traditionally, Multilingual Multi-document Automatic Summarization (MMAS) is a computational application that, from a single collection of source-texts on the same subject/topic in at least two languages, produces an informative and generic summary (extract) in one of these languages. The simplest methods automatically translate the source-texts and, from a monolingual collection, apply content selection strategies based on shallow and/or deep linguistic knowledge. Therefore, the MMAS applications need to identify the main information of the collection, avoiding the redundancy, but also treating the problems caused by the machine translation (MT) of the full source-texts. Looking for alternatives to the traditional scenario of MMAS, we investigated two methods (Method 1 and 2) that once based on deep linguistic knowledge of lexical-conceptual level avoid the full MT of the sourcetexts, generating informative and cohesive/coherent summaries. In these methods, the content selection starts with the score and the ranking of the original sentences based on the frequency of occurrence of the concepts in the collection, expressed by their common names. In Method 1, only the most well-scored and non redundant sentences from the user s language are selected to compose the extract, until it reaches the compression rate. In Method 2, the original sentences which are better ranked and non redundant are selected to the summary without privileging the user s language; in cases which sentences that are not in the user s language are selected, they are automatically translated. In order to producing automatic summaries according to Methods 1 and 2 and their subsequent evaluation, the CM2News corpus was built. The corpus has 20 collections of news texts, 1 original text in English and 1 original text in Portuguese, both on the same topic. The common names of CM2News were identified through morphosyntactic annotation and then it was semiautomatically annotated with the concepts in Princeton WordNet through the Mulsen graphic editor, which was especially developed for the task. For the production of extracts according to Method 1, only the best ranked sentences in Portuguese were selected until the compression rate was reached. For the production of extracts according to Method 2, the best ranked sentences were selected, without privileging the language of the user. If English sentences were selected, they were automatically translated into Portuguese by the Bing translator. The Methods 1 and 2 were evaluated intrinsically considering the linguistic quality and informativeness of the summaries. To evaluate linguistic quality, 15 computational linguists analyzed manually the grammaticality, non-redundancy, referential clarity, focus and structure / coherence of the summaries and to evaluate the informativeness of the sumaries, they were automatically compared to reference sumaries by ROUGE measures. In both evaluations, the results have shown the better performance of Method 1, which might be explained by the fact that sentences were selected from a single source text. Furthermore, we highlight the best performance of both methods based on lexicalconceptual knowledge compared to simpler methods of MMAS, which adopted the full MT of the source-texts. Finally, it is noted that, besides the promising results on the application of lexical-conceptual knowledge, this work has generated important resources and tools for MMAS, such as the CM2News corpus and the Mulsen editor.
publishDate	2014
dc.date.issued.fl_str_mv	2014-02-27
dc.date.available.fl_str_mv	2015-03-11 2016-06-02T20:25:23Z
dc.date.accessioned.fl_str_mv	2016-06-02T20:25:23Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	TOSTA, Fabricio Elder da Silva. Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue. 2014. 119 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2014.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/ufscar/5796
identifier_str_mv	TOSTA, Fabricio Elder da Silva. Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue. 2014. 119 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2014.
url	https://repositorio.ufscar.br/handle/ufscar/5796
dc.language.iso.fl_str_mv	por
language	por
dc.relation.confidence.fl_str_mv	-1 -1
dc.relation.authority.fl_str_mv	26c5db60-6612-41e6-a8f9-f94fb475ca58
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Linguística - PPGL
dc.publisher.initials.fl_str_mv	UFSCar
dc.publisher.country.fl_str_mv	BR
publisher.none.fl_str_mv	Universidade Federal de São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstream/ufscar/5796/1/6554.pdf https://repositorio.ufscar.br/bitstream/ufscar/5796/2/6554.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/5796/3/6554.pdf.jpg
bitstream.checksum.fl_str_mv	11403ad2acdeafd11148154c92757f20 d41d8cd98f00b204e9800998ecf8427e 794fa079bc26e1ccf4d04bc80944b92d
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_	1813715546589364224

Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue

Registros relacionados