Embedded representations for item descriptions in unsupervised tasks

Pedro Paulo Valadares Brum

Embedded representations for item descriptions in unsupervised tasks

Detalhes bibliográficos
Autor(a) principal:	Pedro Paulo Valadares Brum
Data de Publicação:	2021
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Institucional da UFMG
Texto Completo:	http://hdl.handle.net/1843/42299
Resumo:	Most machine learning algorithms require a fixed-size vector as input. This makes the area of text representation a challenging one in Natural Language Processing (NLP) tasks, and its results are highly dependent on the target application. For NLP tasks, this fixed-size vector usually represents a sentence or a paragraph. However, building text representations capable of capturing semantic and context-specific information is not a simple task. In this work, we propose a methodology to solve a real-world problem: the identification of unique objects from public procurement stored in the databases of the Federal Public Ministry of Minas Gerais. These scenarios pose challenges that go beyond those commonly known in the text representation area, as we want to group descriptions of products or services. These descriptions in general do not follow the grammatical structure of a sentence in the Portuguese language, as they are mostly formed by nouns, adjectives, and quantities, the latter describing the quantity of items purchased/contracted or the unit of measure that describes the item. Within the proposed framework, we emphasize the text representation problem for unsupervised algorithms. We propose a simple information extraction strategy to improve the quality of sentence vectors, focusing on specific terms such as numbers and nouns, and present a modification of the BERT siamese network, which can be used in an unsupervised way to generate embeddings that carry semantic and syntactic information from descriptions. We also identify numerical terms and measurement units as the two main components in this context, and show that a simple method of standardizing numbers has a significant effect on the results. Experimental results show improvements from the proposed framework in relation to state-of-the-art methods.

Metadados do item

id	UFMG_50952c9af38c640413746ae9c2fd0273
oai_identifier_str	oai:repositorio.ufmg.br:1843/42299
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	Gisele Lobo Pappahttp://lattes.cnpq.br/5936682335701497Anisio Mendes LacerdaRodrygo Luis Teodoro SantosSolange Oliveira Rezendehttp://lattes.cnpq.br/7996389934990654Pedro Paulo Valadares Brum2022-06-06T22:50:13Z2022-06-06T22:50:13Z2021-09-14http://hdl.handle.net/1843/42299Most machine learning algorithms require a fixed-size vector as input. This makes the area of text representation a challenging one in Natural Language Processing (NLP) tasks, and its results are highly dependent on the target application. For NLP tasks, this fixed-size vector usually represents a sentence or a paragraph. However, building text representations capable of capturing semantic and context-specific information is not a simple task. In this work, we propose a methodology to solve a real-world problem: the identification of unique objects from public procurement stored in the databases of the Federal Public Ministry of Minas Gerais. These scenarios pose challenges that go beyond those commonly known in the text representation area, as we want to group descriptions of products or services. These descriptions in general do not follow the grammatical structure of a sentence in the Portuguese language, as they are mostly formed by nouns, adjectives, and quantities, the latter describing the quantity of items purchased/contracted or the unit of measure that describes the item. Within the proposed framework, we emphasize the text representation problem for unsupervised algorithms. We propose a simple information extraction strategy to improve the quality of sentence vectors, focusing on specific terms such as numbers and nouns, and present a modification of the BERT siamese network, which can be used in an unsupervised way to generate embeddings that carry semantic and syntactic information from descriptions. We also identify numerical terms and measurement units as the two main components in this context, and show that a simple method of standardizing numbers has a significant effect on the results. Experimental results show improvements from the proposed framework in relation to state-of-the-art methods.maioria dos algoritmos de aprendizado de máquina exige como entrada um vetor de tamanho fixo. Isso torna a área de representação de texto uma área desafiadora de pesquisa em Processamento de Linguagem Natural (NLP), e seus resultados são altamente dependentes da aplicação em questão. Para tarefas de NLP, esse vetor de tamanho fixo geralmente representa uma frase ou um parágrafo. No entanto, construir representações de sentença capazes de capturar as informações semânticas e específicas de um contexto não é uma tarefa fácil. Neste trabalho propomos uma metodologia para resolver um problema real: a identificação de objetos únicos de licitação em bases de dados do Ministério Público Federal de Minas Gerais. Esse cenário traz desafios que vão além dos comumente conhecidos na área de representação de texto, uma vez que queremos agrupar descrições de produtos ou serviços. Essas descrições no geral não seguem a estrutura gramatical de uma sentença na língua portuguesa, já que são formadas em sua maioria por substantivos, adjetivos, e quantidades, essas últimas descrevendo a quantidade de itens comprada/contratada ou a unidade de medida que descreve o item. Dentro do arcabouço proposto, damos ênfase ao problema de representação de texto para algoritmos não-supervisionados. Propomos uma estratégia simples de extração de informações para melhorar a qualidade dos vetores de sentenças, com foco em termos específicos como números e substantivos, e apresentamos uma modificação do Sentence-BERT, que pode ser usada de forma não-supervisionada para geração de embeddings que carregam informações semânticas e sintáticas das descrições. Também identificamos termos numéricos e unidades de medida como os dois componentes principais neste contexto, e mostramos que um método simples de padronização de números tem um efeito significativo nos resultados. Resultados experimentais mostram ganhos do arcabouço proposto em relação a métodos estado-da-arte.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICEX - INSTITUTO DE CIÊNCIAS EXATASComputação – TesesRepresentação documentária – TesesAgrupamento de texto – TesesProcessamento da linguagem natural (Computação) - TesesText representationText clusteringWord embeddingsRepresentação de textoAgrupamento de textoVetores de palavrasEmbedded representations for item descriptions in unsupervised tasksRepresentações vetoriais para descrições de itens em tarefas não supervisionadasinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALPedro_Brum_dissertacao.pdfPedro_Brum_dissertacao.pdfapplication/pdf3418455https://repositorio.ufmg.br/bitstream/1843/42299/1/Pedro_Brum_dissertacao.pdf3b60ac831f5a79c3692b22ca7457fc79MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82118https://repositorio.ufmg.br/bitstream/1843/42299/2/license.txtcda590c95a0b51b4d15f60c9642ca272MD521843/422992022-06-06 19:50:14.044oai:repositorio.ufmg.br:1843/42299TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2022-06-06T22:50:14Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv	Embedded representations for item descriptions in unsupervised tasks
dc.title.alternative.pt_BR.fl_str_mv	Representações vetoriais para descrições de itens em tarefas não supervisionadas
title	Embedded representations for item descriptions in unsupervised tasks
spellingShingle	Embedded representations for item descriptions in unsupervised tasks Pedro Paulo Valadares Brum Text representation Text clustering Word embeddings Representação de texto Agrupamento de texto Vetores de palavras Computação – Teses Representação documentária – Teses Agrupamento de texto – Teses Processamento da linguagem natural (Computação) - Teses
title_short	Embedded representations for item descriptions in unsupervised tasks
title_full	Embedded representations for item descriptions in unsupervised tasks
title_fullStr	Embedded representations for item descriptions in unsupervised tasks
title_full_unstemmed	Embedded representations for item descriptions in unsupervised tasks
title_sort	Embedded representations for item descriptions in unsupervised tasks
author	Pedro Paulo Valadares Brum
author_facet	Pedro Paulo Valadares Brum
author_role	author
dc.contributor.advisor1.fl_str_mv	Gisele Lobo Pappa
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/5936682335701497
dc.contributor.advisor-co1.fl_str_mv	Anisio Mendes Lacerda
dc.contributor.referee1.fl_str_mv	Rodrygo Luis Teodoro Santos
dc.contributor.referee2.fl_str_mv	Solange Oliveira Rezende
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/7996389934990654
dc.contributor.author.fl_str_mv	Pedro Paulo Valadares Brum
contributor_str_mv	Gisele Lobo Pappa Anisio Mendes Lacerda Rodrygo Luis Teodoro Santos Solange Oliveira Rezende
dc.subject.por.fl_str_mv	Text representation Text clustering Word embeddings Representação de texto Agrupamento de texto Vetores de palavras
topic	Text representation Text clustering Word embeddings Representação de texto Agrupamento de texto Vetores de palavras Computação – Teses Representação documentária – Teses Agrupamento de texto – Teses Processamento da linguagem natural (Computação) - Teses
dc.subject.other.pt_BR.fl_str_mv	Computação – Teses Representação documentária – Teses Agrupamento de texto – Teses Processamento da linguagem natural (Computação) - Teses
description	Most machine learning algorithms require a fixed-size vector as input. This makes the area of text representation a challenging one in Natural Language Processing (NLP) tasks, and its results are highly dependent on the target application. For NLP tasks, this fixed-size vector usually represents a sentence or a paragraph. However, building text representations capable of capturing semantic and context-specific information is not a simple task. In this work, we propose a methodology to solve a real-world problem: the identification of unique objects from public procurement stored in the databases of the Federal Public Ministry of Minas Gerais. These scenarios pose challenges that go beyond those commonly known in the text representation area, as we want to group descriptions of products or services. These descriptions in general do not follow the grammatical structure of a sentence in the Portuguese language, as they are mostly formed by nouns, adjectives, and quantities, the latter describing the quantity of items purchased/contracted or the unit of measure that describes the item. Within the proposed framework, we emphasize the text representation problem for unsupervised algorithms. We propose a simple information extraction strategy to improve the quality of sentence vectors, focusing on specific terms such as numbers and nouns, and present a modification of the BERT siamese network, which can be used in an unsupervised way to generate embeddings that carry semantic and syntactic information from descriptions. We also identify numerical terms and measurement units as the two main components in this context, and show that a simple method of standardizing numbers has a significant effect on the results. Experimental results show improvements from the proposed framework in relation to state-of-the-art methods.
publishDate	2021
dc.date.issued.fl_str_mv	2021-09-14
dc.date.accessioned.fl_str_mv	2022-06-06T22:50:13Z
dc.date.available.fl_str_mv	2022-06-06T22:50:13Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1843/42299
url	http://hdl.handle.net/1843/42299
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv	UFMG
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	ICEX - INSTITUTO DE CIÊNCIAS EXATAS
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br/bitstream/1843/42299/1/Pedro_Brum_dissertacao.pdf https://repositorio.ufmg.br/bitstream/1843/42299/2/license.txt
bitstream.checksum.fl_str_mv	3b60ac831f5a79c3692b22ca7457fc79 cda590c95a0b51b4d15f60c9642ca272
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_	1803589528199364608

Embedded representations for item descriptions in unsupervised tasks

Registros relacionados