A solution to extractive summarization based on document type and a new measure for sentence similarity

Detalhes bibliográficos
Autor(a) principal: MELLO, Rafael Ferreira Leite de
Data de Publicação: 2015
Tipo de documento: Tese
Idioma: por
Título da fonte: Repositório Institucional da UFPE
Texto Completo: https://repositorio.ufpe.br/handle/123456789/15257
Resumo: The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing users to save time and resources. Automatic text summarization techniques may offer a way out to this problem. Text summarization (TS) aims at automatically compress one or more documents to present their main ideas in less space. TS platforms receive one or more documents as input to generate a summary. In recent years, a variety of text summarization methods has been proposed. However, due to the different document types (such as news, blogs, and scientific articles) it became difficult to create a general TS application to create expressive summaries for each type. Another related relevant problem is measuring the degree of similarity between sentences, which is used in applications, such as: text summarization, information retrieval, image retrieval, text categorization, and machine translation. Recent works report several efforts to evaluate sentence similarity by representing sentences using vectors of bag of words or a tree of the syntactic information among words. However, most of these approaches do not take in consideration the sentence meaning and the words order. This thesis proposes: (i) a new text summarization solution which identifies the document type before perform the summarization, (ii) the creation of a new sentence similarity measure based on lexical, syntactic and semantic evaluation to deal with meaning and word order problems. The previous identification of the document types allows the summarization solution to select the methods that is more suitable to each type of text. This thesis also perform a detailed assessment with the most used text summarization methods to selects which create more informative summaries for news, blogs and scientific articles contexts.The sentence similarity measure proposed is completely unsupervised and reaches results similar to humans annotator using the dataset proposed by Li et al. The proposed measure was satisfactorily applied to evaluate the similarity between summaries and to eliminate redundancy in multi-document summarization.
id UFPE_6d4f6cafdc6d7ab899b953dd54a0aa35
oai_identifier_str oai:repositorio.ufpe.br:123456789/15257
network_acronym_str UFPE
network_name_str Repositório Institucional da UFPE
repository_id_str 2221
spelling MELLO, Rafael Ferreira Leite dehttp://lattes.cnpq.br/6190254569597745http://lattes.cnpq.br/6195215666638965FREITAS, Frederico Gonçalves deLINS, Rafael Dueire2016-02-19T18:25:04Z2016-02-19T18:25:04Z2015-03-20https://repositorio.ufpe.br/handle/123456789/15257The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing users to save time and resources. Automatic text summarization techniques may offer a way out to this problem. Text summarization (TS) aims at automatically compress one or more documents to present their main ideas in less space. TS platforms receive one or more documents as input to generate a summary. In recent years, a variety of text summarization methods has been proposed. However, due to the different document types (such as news, blogs, and scientific articles) it became difficult to create a general TS application to create expressive summaries for each type. Another related relevant problem is measuring the degree of similarity between sentences, which is used in applications, such as: text summarization, information retrieval, image retrieval, text categorization, and machine translation. Recent works report several efforts to evaluate sentence similarity by representing sentences using vectors of bag of words or a tree of the syntactic information among words. However, most of these approaches do not take in consideration the sentence meaning and the words order. This thesis proposes: (i) a new text summarization solution which identifies the document type before perform the summarization, (ii) the creation of a new sentence similarity measure based on lexical, syntactic and semantic evaluation to deal with meaning and word order problems. The previous identification of the document types allows the summarization solution to select the methods that is more suitable to each type of text. This thesis also perform a detailed assessment with the most used text summarization methods to selects which create more informative summaries for news, blogs and scientific articles contexts.The sentence similarity measure proposed is completely unsupervised and reaches results similar to humans annotator using the dataset proposed by Li et al. The proposed measure was satisfactorily applied to evaluate the similarity between summaries and to eliminate redundancy in multi-document summarization.Atualmente a quantidade de documentos de texto aumentou consideravelmente principalmente com o grande crescimento da internet. Existem milhares de artigos de notícias, livros eletrônicos, artigos científicos, blog, etc. Com isso é necessário aplicar técnicas automáticas para extrair informações dessa grande massa de dados. Sumarização de texto pode ser usada para lidar com esse problema. Sumarização de texto (ST) cria versões comprimidas de um ou mais documentos de texto. Em outras palavras, palataformas de ST recebem um ou mais documentos como entrada e gera um sumário deles. Nos últimos anos, uma grande quantidade de técnicas de sumarização foram propostas. Contudo, dado a grande quantidade de tipos de documentos (por exemplo, notícias, blogs e artigos científicos) é difícil encontrar uma técnica seja genérica suficiente para criar sumários para todos os tipos de forma eficiente. Além disto, outro tópico bastante trabalhado na área de mineração de texto é a análise de similaridade entre sentenças. Essa similaridade pode ser usada em aplicações como: sumarização de texto, recuperação de infromação, recuperação de imagem, categorização de texto e tradução. Em geral, as técnicas propostas são baseados em vetores de palavras ou árvores sintáticas, com isso dois problemas não são abordados: o problema de significado e de ordem das palavras. Essa tese propõe: (i) Uma nova solução em sumarização de texto que identifica o tipo de documento antes de realizar a sumarização. (ii) A criação de uma nova medida de similaridade entre sentenças baseada nas análises léxica, sintática e semântica. A identificação de tipo de documento permite que a solução de sumarização selecione os melhores métodos para cada tipo de texto. Essa tese também realizar um estudo detalhado sobre os métodos de sumarização para selecinoar os que criam sumários mais informativos nos contextos de notícias blogs e artigos científicos. A medida de similaridade entre sentences é completamente não supervisionada e alcança resultados similarires dos anotadores humanos usando o dataset proposed por Li et al. A medida proposta também foi satisfatoriamente aplicada na avaliação de similaridade entre resumos e para eliminar redundância em sumarização multi-documento.porUNIVERSIDADE FEDERAL DE PERNAMBUCOPrograma de Pos Graduacao em Ciencia da ComputacaoUFPEBrasilAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessCiência da computaçãoInteligência artificialMineração de textoProcessamento de linguagem naturalA solution to extractive summarization based on document type and a new measure for sentence similarityinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisdoutoradoreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPETHUMBNAILTESE Rafael Ferreira Leite de Mello.pdf.jpgTESE Rafael Ferreira Leite de Mello.pdf.jpgGenerated Thumbnailimage/jpeg1255https://repositorio.ufpe.br/bitstream/123456789/15257/5/TESE%20Rafael%20Ferreira%20Leite%20de%20Mello.pdf.jpga150a111f3c8cc25df2048b1d03c1f96MD55ORIGINALTESE Rafael Ferreira Leite de Mello.pdfTESE Rafael Ferreira Leite de Mello.pdfapplication/pdf1860839https://repositorio.ufpe.br/bitstream/123456789/15257/1/TESE%20Rafael%20Ferreira%20Leite%20de%20Mello.pdf4d54a6ef5e3c40f8bce57e3cc957a8f4MD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-81232https://repositorio.ufpe.br/bitstream/123456789/15257/2/license_rdf66e71c371cc565284e70f40736c94386MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82311https://repositorio.ufpe.br/bitstream/123456789/15257/3/license.txt4b8a02c7f2818eaf00dcf2260dd5eb08MD53TEXTTESE Rafael Ferreira Leite de Mello.pdf.txtTESE Rafael Ferreira Leite de Mello.pdf.txtExtracted texttext/plain338124https://repositorio.ufpe.br/bitstream/123456789/15257/4/TESE%20Rafael%20Ferreira%20Leite%20de%20Mello.pdf.txtb45aad5f56a48929697aa0c0ff1e89f7MD54123456789/152572019-10-25 21:46:38.518oai:repositorio.ufpe.br:123456789/15257TGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKClRvZG8gZGVwb3NpdGFudGUgZGUgbWF0ZXJpYWwgbm8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgKFJJKSBkZXZlIGNvbmNlZGVyLCDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBQZXJuYW1idWNvIChVRlBFKSwgdW1hIExpY2Vuw6dhIGRlIERpc3RyaWJ1acOnw6NvIE7Do28gRXhjbHVzaXZhIHBhcmEgbWFudGVyIGUgdG9ybmFyIGFjZXNzw612ZWlzIG9zIHNldXMgZG9jdW1lbnRvcywgZW0gZm9ybWF0byBkaWdpdGFsLCBuZXN0ZSByZXBvc2l0w7NyaW8uCgpDb20gYSBjb25jZXNzw6NvIGRlc3RhIGxpY2Vuw6dhIG7Do28gZXhjbHVzaXZhLCBvIGRlcG9zaXRhbnRlIG1hbnTDqW0gdG9kb3Mgb3MgZGlyZWl0b3MgZGUgYXV0b3IuCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwoKTGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKCkFvIGNvbmNvcmRhciBjb20gZXN0YSBsaWNlbsOnYSBlIGFjZWl0w6EtbGEsIHZvY8OqIChhdXRvciBvdSBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMpOgoKYSkgRGVjbGFyYSBxdWUgY29uaGVjZSBhIHBvbMOtdGljYSBkZSBjb3B5cmlnaHQgZGEgZWRpdG9yYSBkbyBzZXUgZG9jdW1lbnRvOwpiKSBEZWNsYXJhIHF1ZSBjb25oZWNlIGUgYWNlaXRhIGFzIERpcmV0cml6ZXMgcGFyYSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGUEU7CmMpIENvbmNlZGUgw6AgVUZQRSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZGUgYXJxdWl2YXIsIHJlcHJvZHV6aXIsIGNvbnZlcnRlciAoY29tbyBkZWZpbmlkbyBhIHNlZ3VpciksIGNvbXVuaWNhciBlL291IGRpc3RyaWJ1aXIsIG5vIFJJLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgcG9yIG91dHJvIG1laW87CmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgVUZQRSBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgcGFyYSBxdWFscXVlciBmb3JtYXRvIGRlIGZpY2hlaXJvLCBtZWlvIG91IHN1cG9ydGUsIHBhcmEgZWZlaXRvcyBkZSBzZWd1cmFuw6dhLCBwcmVzZXJ2YcOnw6NvIChiYWNrdXApIGUgYWNlc3NvOwplKSBEZWNsYXJhIHF1ZSBvIGRvY3VtZW50byBzdWJtZXRpZG8gw6kgbyBzZXUgdHJhYmFsaG8gb3JpZ2luYWwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBhIGVudHJlZ2EgZG8gZG9jdW1lbnRvIG7Do28gaW5mcmluZ2Ugb3MgZGlyZWl0b3MgZGUgb3V0cmEgcGVzc29hIG91IGVudGlkYWRlOwpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlCmF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIGlycmVzdHJpdGEgZG8gcmVzcGVjdGl2byBkZXRlbnRvciBkZXNzZXMgZGlyZWl0b3MgcGFyYSBjZWRlciDDoApVRlBFIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgTGljZW7Dp2EgZSBhdXRvcml6YXIgYSB1bml2ZXJzaWRhZGUgYSB1dGlsaXrDoS1sb3MgbGVnYWxtZW50ZS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBlc3NlIG1hdGVyaWFsIGN1am9zIGRpcmVpdG9zIHPDo28gZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZTsKZykgU2UgbyBkb2N1bWVudG8gZW50cmVndWUgw6kgYmFzZWFkbyBlbSB0cmFiYWxobyBmaW5hbmNpYWRvIG91IGFwb2lhZG8gcG9yIG91dHJhIGluc3RpdHVpw6fDo28gcXVlIG7Do28gYSBVRlBFLMKgZGVjbGFyYSBxdWUgY3VtcHJpdSBxdWFpc3F1ZXIgb2JyaWdhw6fDtWVzIGV4aWdpZGFzIHBlbG8gcmVzcGVjdGl2byBjb250cmF0byBvdSBhY29yZG8uCgpBIFVGUEUgaWRlbnRpZmljYXLDoSBjbGFyYW1lbnRlIG8ocykgbm9tZShzKSBkbyhzKSBhdXRvciAoZXMpIGRvcyBkaXJlaXRvcyBkbyBkb2N1bWVudG8gZW50cmVndWUgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBwYXJhIGFsw6ltIGRvIHByZXZpc3RvIG5hIGFsw61uZWEgYykuCg==Repositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212019-10-26T00:46:38Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false
dc.title.pt_BR.fl_str_mv A solution to extractive summarization based on document type and a new measure for sentence similarity
title A solution to extractive summarization based on document type and a new measure for sentence similarity
spellingShingle A solution to extractive summarization based on document type and a new measure for sentence similarity
MELLO, Rafael Ferreira Leite de
Ciência da computação
Inteligência artificial
Mineração de texto
Processamento de linguagem natural
title_short A solution to extractive summarization based on document type and a new measure for sentence similarity
title_full A solution to extractive summarization based on document type and a new measure for sentence similarity
title_fullStr A solution to extractive summarization based on document type and a new measure for sentence similarity
title_full_unstemmed A solution to extractive summarization based on document type and a new measure for sentence similarity
title_sort A solution to extractive summarization based on document type and a new measure for sentence similarity
author MELLO, Rafael Ferreira Leite de
author_facet MELLO, Rafael Ferreira Leite de
author_role author
dc.contributor.authorLattes.pt_BR.fl_str_mv http://lattes.cnpq.br/6190254569597745
dc.contributor.advisorLattes.pt_BR.fl_str_mv http://lattes.cnpq.br/6195215666638965
dc.contributor.author.fl_str_mv MELLO, Rafael Ferreira Leite de
dc.contributor.advisor1.fl_str_mv FREITAS, Frederico Gonçalves de
dc.contributor.advisor-co1.fl_str_mv LINS, Rafael Dueire
contributor_str_mv FREITAS, Frederico Gonçalves de
LINS, Rafael Dueire
dc.subject.por.fl_str_mv Ciência da computação
Inteligência artificial
Mineração de texto
Processamento de linguagem natural
topic Ciência da computação
Inteligência artificial
Mineração de texto
Processamento de linguagem natural
description The Internet is a enormous and fast growing digital repository encompassing billions of documents in a diversity of subjects, quality, reliability, etc. It is increasingly difficult to scavenge useful information from it. Thus, it is necessary to provide automatically techniques that allowing users to save time and resources. Automatic text summarization techniques may offer a way out to this problem. Text summarization (TS) aims at automatically compress one or more documents to present their main ideas in less space. TS platforms receive one or more documents as input to generate a summary. In recent years, a variety of text summarization methods has been proposed. However, due to the different document types (such as news, blogs, and scientific articles) it became difficult to create a general TS application to create expressive summaries for each type. Another related relevant problem is measuring the degree of similarity between sentences, which is used in applications, such as: text summarization, information retrieval, image retrieval, text categorization, and machine translation. Recent works report several efforts to evaluate sentence similarity by representing sentences using vectors of bag of words or a tree of the syntactic information among words. However, most of these approaches do not take in consideration the sentence meaning and the words order. This thesis proposes: (i) a new text summarization solution which identifies the document type before perform the summarization, (ii) the creation of a new sentence similarity measure based on lexical, syntactic and semantic evaluation to deal with meaning and word order problems. The previous identification of the document types allows the summarization solution to select the methods that is more suitable to each type of text. This thesis also perform a detailed assessment with the most used text summarization methods to selects which create more informative summaries for news, blogs and scientific articles contexts.The sentence similarity measure proposed is completely unsupervised and reaches results similar to humans annotator using the dataset proposed by Li et al. The proposed measure was satisfactorily applied to evaluate the similarity between summaries and to eliminate redundancy in multi-document summarization.
publishDate 2015
dc.date.issued.fl_str_mv 2015-03-20
dc.date.accessioned.fl_str_mv 2016-02-19T18:25:04Z
dc.date.available.fl_str_mv 2016-02-19T18:25:04Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://repositorio.ufpe.br/handle/123456789/15257
url https://repositorio.ufpe.br/handle/123456789/15257
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv UNIVERSIDADE FEDERAL DE PERNAMBUCO
dc.publisher.program.fl_str_mv Programa de Pos Graduacao em Ciencia da Computacao
dc.publisher.initials.fl_str_mv UFPE
dc.publisher.country.fl_str_mv Brasil
publisher.none.fl_str_mv UNIVERSIDADE FEDERAL DE PERNAMBUCO
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFPE
instname:Universidade Federal de Pernambuco (UFPE)
instacron:UFPE
instname_str Universidade Federal de Pernambuco (UFPE)
instacron_str UFPE
institution UFPE
reponame_str Repositório Institucional da UFPE
collection Repositório Institucional da UFPE
bitstream.url.fl_str_mv https://repositorio.ufpe.br/bitstream/123456789/15257/5/TESE%20Rafael%20Ferreira%20Leite%20de%20Mello.pdf.jpg
https://repositorio.ufpe.br/bitstream/123456789/15257/1/TESE%20Rafael%20Ferreira%20Leite%20de%20Mello.pdf
https://repositorio.ufpe.br/bitstream/123456789/15257/2/license_rdf
https://repositorio.ufpe.br/bitstream/123456789/15257/3/license.txt
https://repositorio.ufpe.br/bitstream/123456789/15257/4/TESE%20Rafael%20Ferreira%20Leite%20de%20Mello.pdf.txt
bitstream.checksum.fl_str_mv a150a111f3c8cc25df2048b1d03c1f96
4d54a6ef5e3c40f8bce57e3cc957a8f4
66e71c371cc565284e70f40736c94386
4b8a02c7f2818eaf00dcf2260dd5eb08
b45aad5f56a48929697aa0c0ff1e89f7
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)
repository.mail.fl_str_mv attena@ufpe.br
_version_ 1802310865810096128