Extracting structured information from text to augment knowledge bases
Autor(a) principal: | |
---|---|
Data de Publicação: | 2019 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFPE |
dARK ID: | ark:/64986/001300000dg5t |
Texto Completo: | https://repositorio.ufpe.br/handle/123456789/34145 |
Resumo: | Knowledge graphs (or knowledge bases) allow data organization and exploration, making easier the semantic understanding and use of data by machines. Traditional strategies for knowledge base construction have mostly relied on manual effort, or have been automatically extracted from structured and semi-structured data. Considering the large amount of unstructured information on theWeb, new approaches on knowledge bases construction and maintenance are trying to leverage this information to improve the quality and coverage of knowledge graphs. In this work, focusing in the completeness problem of existing knowledge bases, we are interested in extracting from unstructured text missing attributes of entities in knowledge bases. For this study, in particular, we use the infoboxes of entities in Wikipedia articles as instances of the knowledge graph and their respective text as source of unstructured data. More specifically, given Wikipedia articles of entities in a particular domain, the structured information of the entity’s attributes in the infobox is used by a distant supervision strategy to identify sentences that mention those attributes in the text. These sentences are provided as labels to train a sequence-based neural network (Bidirectional Long Short-Term Memory or Convolutional Neural Network), which then performs the extraction of the attributes on unseen articles. We have compared our strategy with two traditional approaches for this problem, Kylin and iPopulator. Our distant supervision model have presented a considerable amount of positive and negative training examples, obtaining representative training examples when compared with the other two traditional systems. Also, our pipeline extraction have shown better performance filling the proposed schema. Overall, the extraction pipeline proposed in this work outperforms the baseline models with an average increase of 0.29 points in F-Score, showing significant difference in performance. In this work we have proposed a modification of the Distant Supervision paradigm for automatic labeling of training examples and an extraction pipeline for filling out a given schema with better performance than the analyzed baseline systems. |
id |
UFPE_5151abc041f938188f1f13cafe61cb80 |
---|---|
oai_identifier_str |
oai:repositorio.ufpe.br:123456789/34145 |
network_acronym_str |
UFPE |
network_name_str |
Repositório Institucional da UFPE |
repository_id_str |
2221 |
spelling |
SILVA, Johny Moreira dahttp://lattes.cnpq.br/0022427692093493http://lattes.cnpq.br/7113249247656195BARBOSA, Luciano de Andrade2019-10-03T18:22:52Z2019-10-03T18:22:52Z2019-02-25https://repositorio.ufpe.br/handle/123456789/34145ark:/64986/001300000dg5tKnowledge graphs (or knowledge bases) allow data organization and exploration, making easier the semantic understanding and use of data by machines. Traditional strategies for knowledge base construction have mostly relied on manual effort, or have been automatically extracted from structured and semi-structured data. Considering the large amount of unstructured information on theWeb, new approaches on knowledge bases construction and maintenance are trying to leverage this information to improve the quality and coverage of knowledge graphs. In this work, focusing in the completeness problem of existing knowledge bases, we are interested in extracting from unstructured text missing attributes of entities in knowledge bases. For this study, in particular, we use the infoboxes of entities in Wikipedia articles as instances of the knowledge graph and their respective text as source of unstructured data. More specifically, given Wikipedia articles of entities in a particular domain, the structured information of the entity’s attributes in the infobox is used by a distant supervision strategy to identify sentences that mention those attributes in the text. These sentences are provided as labels to train a sequence-based neural network (Bidirectional Long Short-Term Memory or Convolutional Neural Network), which then performs the extraction of the attributes on unseen articles. We have compared our strategy with two traditional approaches for this problem, Kylin and iPopulator. Our distant supervision model have presented a considerable amount of positive and negative training examples, obtaining representative training examples when compared with the other two traditional systems. Also, our pipeline extraction have shown better performance filling the proposed schema. Overall, the extraction pipeline proposed in this work outperforms the baseline models with an average increase of 0.29 points in F-Score, showing significant difference in performance. In this work we have proposed a modification of the Distant Supervision paradigm for automatic labeling of training examples and an extraction pipeline for filling out a given schema with better performance than the analyzed baseline systems.FACEPEGrafos de Conhecimento (ou Bases de Conhecimento) permitem a organização e exploração de dados, tornando mais fácil o seu entendimento semântico e utilização por máquinas. Estratégias tradicionais para construção de bases de conhecimento tem dependido na maior parte das vezes de esforço manual, ou tem utilizado extração automática de fontes de dados estruturadas e semi-estruturadas. Considerando a grande quantidade de informação não estruturada na Web, novas abordagens para construção e manutenção de bases de conhecimento tem tentado alavancar o uso dessa fonte como forma de melhorar a qualidade e a cobertura dos grafos de conhecimento. Este trabalho está voltado para o problema de completude de bases de conhecimento, nós estamos interessados em extrair de textos não estruturados os atributos faltosos de entidades. Para este estudo em particular, nós fazemos uso de Infoboxes de entidades de artigos da Wikipédia como instâncias do grafo de conhecimento, e os textos desses artigos são utilizados como fonte de dados não estruturados. Mais especificamente, dados artigos de entidades da Wikipédia de um determinado domínio, a informação estruturada dos atributos de Infobox da entidade são usados por uma estratégia de supervisão distante, de forma a identificar sentenças que mencionam esses atributos. Essas sentenças são rotuladas e utilizadas para treino de uma rede neural baseada em sequência (Rede Bidirecional de Memória de Curto- Longo Prazo ou Rede Neural Convolucional), que realizam a extração de atributos em novos artigos. Nós comparamos nossa estratégia com duas abordagens tradicionais para o mesmo problema, Kylin e iPopulator. Nosso modelo de supervisão distante apresentou uma quantidade considerável de exemplos de treinamento positivos e negativos quando comparado com os outros dois sistemas tradicionais. Nosso esquema de extração também apresentou melhor performance no preenchimento do esquema de dados proposto. No geral, nosso sistema de extração superou os modelos de base com um aumento médio de 0.29 pontos no F-Score, mostrando diferença significativa de performance. Neste trabalho foi proposto uma modificação do paradigma de supervisão distante para rotulagem automática de exemplos de treinamento, e um esquema de extração para preenchimento de um dado esquema de dados com performance superior aos sistemas de base analisados.engUniversidade Federal de PernambucoPrograma de Pos Graduacao em Ciencia da ComputacaoUFPEBrasilAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessBanco de dadosProcessamento de linguagem naturalExtracting structured information from text to augment knowledge basesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesismestradoreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPETHUMBNAILDISSERTAÇÃO Johny Moreira da Silva.pdf.jpgDISSERTAÇÃO Johny Moreira da Silva.pdf.jpgGenerated Thumbnailimage/jpeg1260https://repositorio.ufpe.br/bitstream/123456789/34145/5/DISSERTA%c3%87%c3%83O%20Johny%20Moreira%20da%20Silva.pdf.jpgf2ff31a5f7ee5e0bcb6419a1c4050d28MD55ORIGINALDISSERTAÇÃO Johny Moreira da Silva.pdfDISSERTAÇÃO Johny Moreira da Silva.pdfapplication/pdf4114482https://repositorio.ufpe.br/bitstream/123456789/34145/1/DISSERTA%c3%87%c3%83O%20Johny%20Moreira%20da%20Silva.pdf54d9d4064308022bf06cdb027299287fMD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufpe.br/bitstream/123456789/34145/2/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82310https://repositorio.ufpe.br/bitstream/123456789/34145/3/license.txtbd573a5ca8288eb7272482765f819534MD53TEXTDISSERTAÇÃO Johny Moreira da Silva.pdf.txtDISSERTAÇÃO Johny Moreira da Silva.pdf.txtExtracted texttext/plain210075https://repositorio.ufpe.br/bitstream/123456789/34145/4/DISSERTA%c3%87%c3%83O%20Johny%20Moreira%20da%20Silva.pdf.txtf49e33f847e1809d24eed1d66ed0e138MD54123456789/341452019-10-25 08:12:26.975oai:repositorio.ufpe.br:123456789/34145TGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKClRvZG8gZGVwb3NpdGFudGUgZGUgbWF0ZXJpYWwgbm8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgKFJJKSBkZXZlIGNvbmNlZGVyLCDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBQZXJuYW1idWNvIChVRlBFKSwgdW1hIExpY2Vuw6dhIGRlIERpc3RyaWJ1acOnw6NvIE7Do28gRXhjbHVzaXZhIHBhcmEgbWFudGVyIGUgdG9ybmFyIGFjZXNzw612ZWlzIG9zIHNldXMgZG9jdW1lbnRvcywgZW0gZm9ybWF0byBkaWdpdGFsLCBuZXN0ZSByZXBvc2l0w7NyaW8uCgpDb20gYSBjb25jZXNzw6NvIGRlc3RhIGxpY2Vuw6dhIG7Do28gZXhjbHVzaXZhLCBvIGRlcG9zaXRhbnRlIG1hbnTDqW0gdG9kb3Mgb3MgZGlyZWl0b3MgZGUgYXV0b3IuCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwoKTGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKCkFvIGNvbmNvcmRhciBjb20gZXN0YSBsaWNlbsOnYSBlIGFjZWl0w6EtbGEsIHZvY8OqIChhdXRvciBvdSBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMpOgoKYSkgRGVjbGFyYSBxdWUgY29uaGVjZSBhIHBvbMOtdGljYSBkZSBjb3B5cmlnaHQgZGEgZWRpdG9yYSBkbyBzZXUgZG9jdW1lbnRvOwpiKSBEZWNsYXJhIHF1ZSBjb25oZWNlIGUgYWNlaXRhIGFzIERpcmV0cml6ZXMgcGFyYSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGUEU7CmMpIENvbmNlZGUgw6AgVUZQRSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZGUgYXJxdWl2YXIsIHJlcHJvZHV6aXIsIGNvbnZlcnRlciAoY29tbyBkZWZpbmlkbyBhIHNlZ3VpciksIGNvbXVuaWNhciBlL291IGRpc3RyaWJ1aXIsIG5vIFJJLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgcG9yIG91dHJvIG1laW87CmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgVUZQRSBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgcGFyYSBxdWFscXVlciBmb3JtYXRvIGRlIGZpY2hlaXJvLCBtZWlvIG91IHN1cG9ydGUsIHBhcmEgZWZlaXRvcyBkZSBzZWd1cmFuw6dhLCBwcmVzZXJ2YcOnw6NvIChiYWNrdXApIGUgYWNlc3NvOwplKSBEZWNsYXJhIHF1ZSBvIGRvY3VtZW50byBzdWJtZXRpZG8gw6kgbyBzZXUgdHJhYmFsaG8gb3JpZ2luYWwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBhIGVudHJlZ2EgZG8gZG9jdW1lbnRvIG7Do28gaW5mcmluZ2Ugb3MgZGlyZWl0b3MgZGUgb3V0cmEgcGVzc29hIG91IGVudGlkYWRlOwpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlCmF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIGlycmVzdHJpdGEgZG8gcmVzcGVjdGl2byBkZXRlbnRvciBkZXNzZXMgZGlyZWl0b3MgcGFyYSBjZWRlciDDoApVRlBFIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgTGljZW7Dp2EgZSBhdXRvcml6YXIgYSB1bml2ZXJzaWRhZGUgYSB1dGlsaXrDoS1sb3MgbGVnYWxtZW50ZS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBlc3NlIG1hdGVyaWFsIGN1am9zIGRpcmVpdG9zIHPDo28gZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZTsKZykgU2UgbyBkb2N1bWVudG8gZW50cmVndWUgw6kgYmFzZWFkbyBlbSB0cmFiYWxobyBmaW5hbmNpYWRvIG91IGFwb2lhZG8gcG9yIG91dHJhIGluc3RpdHVpw6fDo28gcXVlIG7Do28gYSBVRlBFLCBkZWNsYXJhIHF1ZSBjdW1wcml1IHF1YWlzcXVlciBvYnJpZ2HDp8O1ZXMgZXhpZ2lkYXMgcGVsbyByZXNwZWN0aXZvIGNvbnRyYXRvIG91IGFjb3Jkby4KCkEgVUZQRSBpZGVudGlmaWNhcsOhIGNsYXJhbWVudGUgbyhzKSBub21lKHMpIGRvKHMpIGF1dG9yIChlcykgZG9zIGRpcmVpdG9zIGRvIGRvY3VtZW50byBlbnRyZWd1ZSBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIHBhcmEgYWzDqW0gZG8gcHJldmlzdG8gbmEgYWzDrW5lYSBjKS4KRepositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212019-10-25T11:12:26Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false |
dc.title.pt_BR.fl_str_mv |
Extracting structured information from text to augment knowledge bases |
title |
Extracting structured information from text to augment knowledge bases |
spellingShingle |
Extracting structured information from text to augment knowledge bases SILVA, Johny Moreira da Banco de dados Processamento de linguagem natural |
title_short |
Extracting structured information from text to augment knowledge bases |
title_full |
Extracting structured information from text to augment knowledge bases |
title_fullStr |
Extracting structured information from text to augment knowledge bases |
title_full_unstemmed |
Extracting structured information from text to augment knowledge bases |
title_sort |
Extracting structured information from text to augment knowledge bases |
author |
SILVA, Johny Moreira da |
author_facet |
SILVA, Johny Moreira da |
author_role |
author |
dc.contributor.authorLattes.pt_BR.fl_str_mv |
http://lattes.cnpq.br/0022427692093493 |
dc.contributor.advisorLattes.pt_BR.fl_str_mv |
http://lattes.cnpq.br/7113249247656195 |
dc.contributor.author.fl_str_mv |
SILVA, Johny Moreira da |
dc.contributor.advisor1.fl_str_mv |
BARBOSA, Luciano de Andrade |
contributor_str_mv |
BARBOSA, Luciano de Andrade |
dc.subject.por.fl_str_mv |
Banco de dados Processamento de linguagem natural |
topic |
Banco de dados Processamento de linguagem natural |
description |
Knowledge graphs (or knowledge bases) allow data organization and exploration, making easier the semantic understanding and use of data by machines. Traditional strategies for knowledge base construction have mostly relied on manual effort, or have been automatically extracted from structured and semi-structured data. Considering the large amount of unstructured information on theWeb, new approaches on knowledge bases construction and maintenance are trying to leverage this information to improve the quality and coverage of knowledge graphs. In this work, focusing in the completeness problem of existing knowledge bases, we are interested in extracting from unstructured text missing attributes of entities in knowledge bases. For this study, in particular, we use the infoboxes of entities in Wikipedia articles as instances of the knowledge graph and their respective text as source of unstructured data. More specifically, given Wikipedia articles of entities in a particular domain, the structured information of the entity’s attributes in the infobox is used by a distant supervision strategy to identify sentences that mention those attributes in the text. These sentences are provided as labels to train a sequence-based neural network (Bidirectional Long Short-Term Memory or Convolutional Neural Network), which then performs the extraction of the attributes on unseen articles. We have compared our strategy with two traditional approaches for this problem, Kylin and iPopulator. Our distant supervision model have presented a considerable amount of positive and negative training examples, obtaining representative training examples when compared with the other two traditional systems. Also, our pipeline extraction have shown better performance filling the proposed schema. Overall, the extraction pipeline proposed in this work outperforms the baseline models with an average increase of 0.29 points in F-Score, showing significant difference in performance. In this work we have proposed a modification of the Distant Supervision paradigm for automatic labeling of training examples and an extraction pipeline for filling out a given schema with better performance than the analyzed baseline systems. |
publishDate |
2019 |
dc.date.accessioned.fl_str_mv |
2019-10-03T18:22:52Z |
dc.date.available.fl_str_mv |
2019-10-03T18:22:52Z |
dc.date.issued.fl_str_mv |
2019-02-25 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufpe.br/handle/123456789/34145 |
dc.identifier.dark.fl_str_mv |
ark:/64986/001300000dg5t |
url |
https://repositorio.ufpe.br/handle/123456789/34145 |
identifier_str_mv |
ark:/64986/001300000dg5t |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Pernambuco |
dc.publisher.program.fl_str_mv |
Programa de Pos Graduacao em Ciencia da Computacao |
dc.publisher.initials.fl_str_mv |
UFPE |
dc.publisher.country.fl_str_mv |
Brasil |
publisher.none.fl_str_mv |
Universidade Federal de Pernambuco |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFPE instname:Universidade Federal de Pernambuco (UFPE) instacron:UFPE |
instname_str |
Universidade Federal de Pernambuco (UFPE) |
instacron_str |
UFPE |
institution |
UFPE |
reponame_str |
Repositório Institucional da UFPE |
collection |
Repositório Institucional da UFPE |
bitstream.url.fl_str_mv |
https://repositorio.ufpe.br/bitstream/123456789/34145/5/DISSERTA%c3%87%c3%83O%20Johny%20Moreira%20da%20Silva.pdf.jpg https://repositorio.ufpe.br/bitstream/123456789/34145/1/DISSERTA%c3%87%c3%83O%20Johny%20Moreira%20da%20Silva.pdf https://repositorio.ufpe.br/bitstream/123456789/34145/2/license_rdf https://repositorio.ufpe.br/bitstream/123456789/34145/3/license.txt https://repositorio.ufpe.br/bitstream/123456789/34145/4/DISSERTA%c3%87%c3%83O%20Johny%20Moreira%20da%20Silva.pdf.txt |
bitstream.checksum.fl_str_mv |
f2ff31a5f7ee5e0bcb6419a1c4050d28 54d9d4064308022bf06cdb027299287f e39d27027a6cc9cb039ad269a5db8e34 bd573a5ca8288eb7272482765f819534 f49e33f847e1809d24eed1d66ed0e138 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE) |
repository.mail.fl_str_mv |
attena@ufpe.br |
_version_ |
1815172798863114240 |