Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento

Luca, Rejeane Cassia de

Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento

Detalhes bibliográficos
Autor(a) principal:	Luca, Rejeane Cassia de
Data de Publicação:	2019
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Repositório Institucional da UFSCAR
Texto Completo:	https://repositorio.ufscar.br/handle/ufscar/11163
Resumo:	Automatic Multi-document Summarization (MDS) aims at creating automatically a single summary from a collection of texts on the same topic in order to provide an alternative way to deal with the massive amount of information on the web. Since such summary is often an extract (i.e., a summary composed of unchanged excerpts extracted from the source texts that convey the main idea of the collection), it is required the selection of the most important sentences of the collection. For sentence selection, there are superficial (linguistic or statistical), deep linguistic, and hybrid methods. Despite being less robust and more expensive, the deep methods produce extracts that are not only more informative but also have higher linguistic quality. Considering the promising results of lexical-conceptual methods in incipient MDS or in multilingual MDS surveys, we investigated 4 methods in monolingual MDS for Portuguese, which is based on the frequency the lexical concepts in the cluster for content selection. We selected CSTNews, a reference multi-document corpus in Portuguese, whose verbs and 10% of the most frequent nouns are annotated with their correspondent synsets from Princeton WordNet. Specifically, we selected 5 clusters from the 50 in CSTNews, and extended the conceptual annotation to all nouns. Then, we applied 4 methods to the 5 clusters (i) LCFSummN, based on simple frequency of nominal concepts in the cluster, (ii) based on simple frequency of nominal and verbal concepts in the cluster, (iii) based on weighted-average for nominal concepts, and (iv) based on weighted-average frequency for nominal and verbal concepts. We intrinsically evaluated the extracts generated by each method regarding linguistic quality and informativeness. When compared to a deep state-of-art MDS method for Portuguese, the results of our investigation show the good performances of the lexical-conceptual methods.

Metadados do item

id	SCAR_cc91bddb93632cf83723d8bfde1edea0
oai_identifier_str	oai:repositorio.ufscar.br:ufscar/11163
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str	4322
spelling	Luca, Rejeane Cassia deDi Felippo, Arianihttp://lattes.cnpq.br/8648412103197455http://lattes.cnpq.br/1599276853975000fadf5154-9d68-4164-a98a-646e1c6336fe2019-03-28T17:44:54Z2019-03-28T17:44:54Z2019-02-28LUCA, Rejeane Cassia de. Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento. 2019. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2019. Disponível em: https://repositorio.ufscar.br/handle/ufscar/11163.https://repositorio.ufscar.br/handle/ufscar/11163Automatic Multi-document Summarization (MDS) aims at creating automatically a single summary from a collection of texts on the same topic in order to provide an alternative way to deal with the massive amount of information on the web. Since such summary is often an extract (i.e., a summary composed of unchanged excerpts extracted from the source texts that convey the main idea of the collection), it is required the selection of the most important sentences of the collection. For sentence selection, there are superficial (linguistic or statistical), deep linguistic, and hybrid methods. Despite being less robust and more expensive, the deep methods produce extracts that are not only more informative but also have higher linguistic quality. Considering the promising results of lexical-conceptual methods in incipient MDS or in multilingual MDS surveys, we investigated 4 methods in monolingual MDS for Portuguese, which is based on the frequency the lexical concepts in the cluster for content selection. We selected CSTNews, a reference multi-document corpus in Portuguese, whose verbs and 10% of the most frequent nouns are annotated with their correspondent synsets from Princeton WordNet. Specifically, we selected 5 clusters from the 50 in CSTNews, and extended the conceptual annotation to all nouns. Then, we applied 4 methods to the 5 clusters (i) LCFSummN, based on simple frequency of nominal concepts in the cluster, (ii) based on simple frequency of nominal and verbal concepts in the cluster, (iii) based on weighted-average for nominal concepts, and (iv) based on weighted-average frequency for nominal and verbal concepts. We intrinsically evaluated the extracts generated by each method regarding linguistic quality and informativeness. When compared to a deep state-of-art MDS method for Portuguese, the results of our investigation show the good performances of the lexical-conceptual methods.Na Sumarização Automática Multidocumento (SAM), produz-se automaticamente um único sumário (resumo) a partir de uma coleção de textos de diferentes fontes que versam sobre um mesmo tópico, com o objetivo de facilitar o acesso à informação. Tais sumários são comumente extratos informativos (isto é, sumários compostos por trechos inalterados dos textos-fonte que veiculam a ideia principal da coleção), o que requer a seleção das sentenças mais importantes. Para tanto, pode-se empregar conhecimento linguístico superficial (ou estatística), conhecimento profundo ou híbrido. Os métodos profundos, apesar de mais caros e menos robustos, produzem extratos mais informativos e com mais qualidade linguística. Tendo em vista os resultados promissores do uso de conhecimento profundo do tipo léxico-conceitual em pesquisas incipientes sobre a SAM ou na SAM multilíngue, investigaram-se 4 métodos distintos na SAM monolíngue para o português, os quais se baseiam primordialmente na frequência de ocorrência de conceitos lexicais na coleção para a seleção de conteúdo. Para tanto, selecionou-se o CSTNews, corpus multidocumento de referência para o português que possui todos os verbos e 10% dos nomes mais frequentes anotados com os conceitos (synsets) da WordNet de Princeton. Para a aplicação dos métodos léxico-conceituais, selecionaram-se 5 coleções do total de 50 do CTSNews, cuja anotação dos nomes foi completada. A partir das 5 coleções com nomes e verbos anotados em nível conceitual, testaram-se os 4 métodos: (i) LCFSummN, baseado na frequência de ocorrência dos nomes na coleção, (ii) LCFSummN-V, baseado na combinação da frequência dos nomes e verbos, (iii) LCFSummN-pond, baseado na média ponderada da frequência dos nomes e (iv) LCFSummN-V-pond, baseado na média ponderada da frequência dos nomes e verbos. Os extratos gerados foram avaliados intrinsecamente quanto à qualidade linguística e informatividade. Quando comparados a um método profundo do estado-da-arte em SAM monolíngue para o português, os resultados do trabalho evidenciam que os métodos léxico-conceituais apresentam bom desempenho.Não recebi financiamentoporUniversidade Federal de São CarlosCâmpus São CarlosPrograma de Pós-Graduação em Linguística - PPGLUFSCarSumarização Automática MultidocumentoConhecimento léxico-conceitualProcessamento Automático de Linguas NaturaisLINGUISTICA, LETRAS E ARTES::LINGUISTICAAplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumentoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisOnline60060026c5db60-6612-41e6-a8f9-f94fb475ca58info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALRejeane C. de Luca - Dissertação Final.pdfRejeane C. de Luca - Dissertação Final.pdfapplication/pdf1752723https://repositorio.ufscar.br/bitstream/ufscar/11163/1/Rejeane%20C.%20de%20Luca%20-%20Disserta%c3%a7%c3%a3o%20Final.pdf74642d79e9571b2cab82fc117154c6b3MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81957https://repositorio.ufscar.br/bitstream/ufscar/11163/3/license.txtae0398b6f8b235e40ad82cba6c50031dMD53TEXTRejeane C. de Luca - Dissertação Final.pdf.txtRejeane C. de Luca - Dissertação Final.pdf.txtExtracted texttext/plain231329https://repositorio.ufscar.br/bitstream/ufscar/11163/4/Rejeane%20C.%20de%20Luca%20-%20Disserta%c3%a7%c3%a3o%20Final.pdf.txtd195fb16a1461547ae25403d8d64c585MD54THUMBNAILRejeane C. de Luca - Dissertação Final.pdf.jpgRejeane C. de Luca - Dissertação Final.pdf.jpgIM Thumbnailimage/jpeg10955https://repositorio.ufscar.br/bitstream/ufscar/11163/5/Rejeane%20C.%20de%20Luca%20-%20Disserta%c3%a7%c3%a3o%20Final.pdf.jpg5d6372bd56bc31a85ef3c9ed642609aeMD55ufscar/111632023-09-18 18:31:21.328oai:repositorio.ufscar.br:ufscar/11163TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlCkZlZGVyYWwgZGUgU8OjbyBDYXJsb3MgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdQpkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlCmVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVUZTQ2FyIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28KcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVGU0NhciBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdQpkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcwpuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldQpjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6oKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVGU0NhcgpvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVRlNDYXIsClZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVRlNDYXIgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzCmNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:21Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento
title	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento
spellingShingle	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento Luca, Rejeane Cassia de Sumarização Automática Multidocumento Conhecimento léxico-conceitual Processamento Automático de Linguas Naturais LINGUISTICA, LETRAS E ARTES::LINGUISTICA
title_short	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento
title_full	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento
title_fullStr	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento
title_full_unstemmed	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento
title_sort	Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento
author	Luca, Rejeane Cassia de
author_facet	Luca, Rejeane Cassia de
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/1599276853975000
dc.contributor.author.fl_str_mv	Luca, Rejeane Cassia de
dc.contributor.advisor1.fl_str_mv	Di Felippo, Ariani
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/8648412103197455
dc.contributor.authorID.fl_str_mv	fadf5154-9d68-4164-a98a-646e1c6336fe
contributor_str_mv	Di Felippo, Ariani
dc.subject.por.fl_str_mv	Sumarização Automática Multidocumento Conhecimento léxico-conceitual Processamento Automático de Linguas Naturais
topic	Sumarização Automática Multidocumento Conhecimento léxico-conceitual Processamento Automático de Linguas Naturais LINGUISTICA, LETRAS E ARTES::LINGUISTICA
dc.subject.cnpq.fl_str_mv	LINGUISTICA, LETRAS E ARTES::LINGUISTICA
description	Automatic Multi-document Summarization (MDS) aims at creating automatically a single summary from a collection of texts on the same topic in order to provide an alternative way to deal with the massive amount of information on the web. Since such summary is often an extract (i.e., a summary composed of unchanged excerpts extracted from the source texts that convey the main idea of the collection), it is required the selection of the most important sentences of the collection. For sentence selection, there are superficial (linguistic or statistical), deep linguistic, and hybrid methods. Despite being less robust and more expensive, the deep methods produce extracts that are not only more informative but also have higher linguistic quality. Considering the promising results of lexical-conceptual methods in incipient MDS or in multilingual MDS surveys, we investigated 4 methods in monolingual MDS for Portuguese, which is based on the frequency the lexical concepts in the cluster for content selection. We selected CSTNews, a reference multi-document corpus in Portuguese, whose verbs and 10% of the most frequent nouns are annotated with their correspondent synsets from Princeton WordNet. Specifically, we selected 5 clusters from the 50 in CSTNews, and extended the conceptual annotation to all nouns. Then, we applied 4 methods to the 5 clusters (i) LCFSummN, based on simple frequency of nominal concepts in the cluster, (ii) based on simple frequency of nominal and verbal concepts in the cluster, (iii) based on weighted-average for nominal concepts, and (iv) based on weighted-average frequency for nominal and verbal concepts. We intrinsically evaluated the extracts generated by each method regarding linguistic quality and informativeness. When compared to a deep state-of-art MDS method for Portuguese, the results of our investigation show the good performances of the lexical-conceptual methods.
publishDate	2019
dc.date.accessioned.fl_str_mv	2019-03-28T17:44:54Z
dc.date.available.fl_str_mv	2019-03-28T17:44:54Z
dc.date.issued.fl_str_mv	2019-02-28
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	LUCA, Rejeane Cassia de. Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento. 2019. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2019. Disponível em: https://repositorio.ufscar.br/handle/ufscar/11163.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/ufscar/11163
identifier_str_mv	LUCA, Rejeane Cassia de. Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento. 2019. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2019. Disponível em: https://repositorio.ufscar.br/handle/ufscar/11163.
url	https://repositorio.ufscar.br/handle/ufscar/11163
dc.language.iso.fl_str_mv	por
language	por
dc.relation.confidence.fl_str_mv	600 600
dc.relation.authority.fl_str_mv	26c5db60-6612-41e6-a8f9-f94fb475ca58
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Linguística - PPGL
dc.publisher.initials.fl_str_mv	UFSCar
publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstream/ufscar/11163/1/Rejeane%20C.%20de%20Luca%20-%20Disserta%c3%a7%c3%a3o%20Final.pdf https://repositorio.ufscar.br/bitstream/ufscar/11163/3/license.txt https://repositorio.ufscar.br/bitstream/ufscar/11163/4/Rejeane%20C.%20de%20Luca%20-%20Disserta%c3%a7%c3%a3o%20Final.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/11163/5/Rejeane%20C.%20de%20Luca%20-%20Disserta%c3%a7%c3%a3o%20Final.pdf.jpg
bitstream.checksum.fl_str_mv	74642d79e9571b2cab82fc117154c6b3 ae0398b6f8b235e40ad82cba6c50031d d195fb16a1461547ae25403d8d64c585 5d6372bd56bc31a85ef3c9ed642609ae
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_	1813715601977245696

Aplicação de conhecimento léxico-conceitual na Sumarização Automática Multidocumento

Registros relacionados