Embedded representations for item descriptions in unsupervised tasks
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFMG |
Texto Completo: | http://hdl.handle.net/1843/42299 |
Resumo: | Most machine learning algorithms require a fixed-size vector as input. This makes the area of text representation a challenging one in Natural Language Processing (NLP) tasks, and its results are highly dependent on the target application. For NLP tasks, this fixed-size vector usually represents a sentence or a paragraph. However, building text representations capable of capturing semantic and context-specific information is not a simple task. In this work, we propose a methodology to solve a real-world problem: the identification of unique objects from public procurement stored in the databases of the Federal Public Ministry of Minas Gerais. These scenarios pose challenges that go beyond those commonly known in the text representation area, as we want to group descriptions of products or services. These descriptions in general do not follow the grammatical structure of a sentence in the Portuguese language, as they are mostly formed by nouns, adjectives, and quantities, the latter describing the quantity of items purchased/contracted or the unit of measure that describes the item. Within the proposed framework, we emphasize the text representation problem for unsupervised algorithms. We propose a simple information extraction strategy to improve the quality of sentence vectors, focusing on specific terms such as numbers and nouns, and present a modification of the BERT siamese network, which can be used in an unsupervised way to generate embeddings that carry semantic and syntactic information from descriptions. We also identify numerical terms and measurement units as the two main components in this context, and show that a simple method of standardizing numbers has a significant effect on the results. Experimental results show improvements from the proposed framework in relation to state-of-the-art methods. |
id |
UFMG_50952c9af38c640413746ae9c2fd0273 |
---|---|
oai_identifier_str |
oai:repositorio.ufmg.br:1843/42299 |
network_acronym_str |
UFMG |
network_name_str |
Repositório Institucional da UFMG |
repository_id_str |
|
spelling |
Gisele Lobo Pappahttp://lattes.cnpq.br/5936682335701497Anisio Mendes LacerdaRodrygo Luis Teodoro SantosSolange Oliveira Rezendehttp://lattes.cnpq.br/7996389934990654Pedro Paulo Valadares Brum2022-06-06T22:50:13Z2022-06-06T22:50:13Z2021-09-14http://hdl.handle.net/1843/42299Most machine learning algorithms require a fixed-size vector as input. This makes the area of text representation a challenging one in Natural Language Processing (NLP) tasks, and its results are highly dependent on the target application. For NLP tasks, this fixed-size vector usually represents a sentence or a paragraph. However, building text representations capable of capturing semantic and context-specific information is not a simple task. In this work, we propose a methodology to solve a real-world problem: the identification of unique objects from public procurement stored in the databases of the Federal Public Ministry of Minas Gerais. These scenarios pose challenges that go beyond those commonly known in the text representation area, as we want to group descriptions of products or services. These descriptions in general do not follow the grammatical structure of a sentence in the Portuguese language, as they are mostly formed by nouns, adjectives, and quantities, the latter describing the quantity of items purchased/contracted or the unit of measure that describes the item. Within the proposed framework, we emphasize the text representation problem for unsupervised algorithms. We propose a simple information extraction strategy to improve the quality of sentence vectors, focusing on specific terms such as numbers and nouns, and present a modification of the BERT siamese network, which can be used in an unsupervised way to generate embeddings that carry semantic and syntactic information from descriptions. We also identify numerical terms and measurement units as the two main components in this context, and show that a simple method of standardizing numbers has a significant effect on the results. Experimental results show improvements from the proposed framework in relation to state-of-the-art methods.maioria dos algoritmos de aprendizado de máquina exige como entrada um vetor de tamanho fixo. Isso torna a área de representação de texto uma área desafiadora de pesquisa em Processamento de Linguagem Natural (NLP), e seus resultados são altamente dependentes da aplicação em questão. Para tarefas de NLP, esse vetor de tamanho fixo geralmente representa uma frase ou um parágrafo. No entanto, construir representações de sentença capazes de capturar as informações semânticas e específicas de um contexto não é uma tarefa fácil. Neste trabalho propomos uma metodologia para resolver um problema real: a identificação de objetos únicos de licitação em bases de dados do Ministério Público Federal de Minas Gerais. Esse cenário traz desafios que vão além dos comumente conhecidos na área de representação de texto, uma vez que queremos agrupar descrições de produtos ou serviços. Essas descrições no geral não seguem a estrutura gramatical de uma sentença na língua portuguesa, já que são formadas em sua maioria por substantivos, adjetivos, e quantidades, essas últimas descrevendo a quantidade de itens comprada/contratada ou a unidade de medida que descreve o item. Dentro do arcabouço proposto, damos ênfase ao problema de representação de texto para algoritmos não-supervisionados. Propomos uma estratégia simples de extração de informações para melhorar a qualidade dos vetores de sentenças, com foco em termos específicos como números e substantivos, e apresentamos uma modificação do Sentence-BERT, que pode ser usada de forma não-supervisionada para geração de embeddings que carregam informações semânticas e sintáticas das descrições. Também identificamos termos numéricos e unidades de medida como os dois componentes principais neste contexto, e mostramos que um método simples de padronização de números tem um efeito significativo nos resultados. Resultados experimentais mostram ganhos do arcabouço proposto em relação a métodos estado-da-arte.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICEX - INSTITUTO DE CIÊNCIAS EXATASComputação – TesesRepresentação documentária – TesesAgrupamento de texto – TesesProcessamento da linguagem natural (Computação) - TesesText representationText clusteringWord embeddingsRepresentação de textoAgrupamento de textoVetores de palavrasEmbedded representations for item descriptions in unsupervised tasksRepresentações vetoriais para descrições de itens em tarefas não supervisionadasinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALPedro_Brum_dissertacao.pdfPedro_Brum_dissertacao.pdfapplication/pdf3418455https://repositorio.ufmg.br/bitstream/1843/42299/1/Pedro_Brum_dissertacao.pdf3b60ac831f5a79c3692b22ca7457fc79MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82118https://repositorio.ufmg.br/bitstream/1843/42299/2/license.txtcda590c95a0b51b4d15f60c9642ca272MD521843/422992022-06-06 19:50:14.044oai:repositorio.ufmg.br:1843/42299TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2022-06-06T22:50:14Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
dc.title.pt_BR.fl_str_mv |
Embedded representations for item descriptions in unsupervised tasks |
dc.title.alternative.pt_BR.fl_str_mv |
Representações vetoriais para descrições de itens em tarefas não supervisionadas |
title |
Embedded representations for item descriptions in unsupervised tasks |
spellingShingle |
Embedded representations for item descriptions in unsupervised tasks Pedro Paulo Valadares Brum Text representation Text clustering Word embeddings Representação de texto Agrupamento de texto Vetores de palavras Computação – Teses Representação documentária – Teses Agrupamento de texto – Teses Processamento da linguagem natural (Computação) - Teses |
title_short |
Embedded representations for item descriptions in unsupervised tasks |
title_full |
Embedded representations for item descriptions in unsupervised tasks |
title_fullStr |
Embedded representations for item descriptions in unsupervised tasks |
title_full_unstemmed |
Embedded representations for item descriptions in unsupervised tasks |
title_sort |
Embedded representations for item descriptions in unsupervised tasks |
author |
Pedro Paulo Valadares Brum |
author_facet |
Pedro Paulo Valadares Brum |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Gisele Lobo Pappa |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/5936682335701497 |
dc.contributor.advisor-co1.fl_str_mv |
Anisio Mendes Lacerda |
dc.contributor.referee1.fl_str_mv |
Rodrygo Luis Teodoro Santos |
dc.contributor.referee2.fl_str_mv |
Solange Oliveira Rezende |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/7996389934990654 |
dc.contributor.author.fl_str_mv |
Pedro Paulo Valadares Brum |
contributor_str_mv |
Gisele Lobo Pappa Anisio Mendes Lacerda Rodrygo Luis Teodoro Santos Solange Oliveira Rezende |
dc.subject.por.fl_str_mv |
Text representation Text clustering Word embeddings Representação de texto Agrupamento de texto Vetores de palavras |
topic |
Text representation Text clustering Word embeddings Representação de texto Agrupamento de texto Vetores de palavras Computação – Teses Representação documentária – Teses Agrupamento de texto – Teses Processamento da linguagem natural (Computação) - Teses |
dc.subject.other.pt_BR.fl_str_mv |
Computação – Teses Representação documentária – Teses Agrupamento de texto – Teses Processamento da linguagem natural (Computação) - Teses |
description |
Most machine learning algorithms require a fixed-size vector as input. This makes the area of text representation a challenging one in Natural Language Processing (NLP) tasks, and its results are highly dependent on the target application. For NLP tasks, this fixed-size vector usually represents a sentence or a paragraph. However, building text representations capable of capturing semantic and context-specific information is not a simple task. In this work, we propose a methodology to solve a real-world problem: the identification of unique objects from public procurement stored in the databases of the Federal Public Ministry of Minas Gerais. These scenarios pose challenges that go beyond those commonly known in the text representation area, as we want to group descriptions of products or services. These descriptions in general do not follow the grammatical structure of a sentence in the Portuguese language, as they are mostly formed by nouns, adjectives, and quantities, the latter describing the quantity of items purchased/contracted or the unit of measure that describes the item. Within the proposed framework, we emphasize the text representation problem for unsupervised algorithms. We propose a simple information extraction strategy to improve the quality of sentence vectors, focusing on specific terms such as numbers and nouns, and present a modification of the BERT siamese network, which can be used in an unsupervised way to generate embeddings that carry semantic and syntactic information from descriptions. We also identify numerical terms and measurement units as the two main components in this context, and show that a simple method of standardizing numbers has a significant effect on the results. Experimental results show improvements from the proposed framework in relation to state-of-the-art methods. |
publishDate |
2021 |
dc.date.issued.fl_str_mv |
2021-09-14 |
dc.date.accessioned.fl_str_mv |
2022-06-06T22:50:13Z |
dc.date.available.fl_str_mv |
2022-06-06T22:50:13Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1843/42299 |
url |
http://hdl.handle.net/1843/42299 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Ciência da Computação |
dc.publisher.initials.fl_str_mv |
UFMG |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
ICEX - INSTITUTO DE CIÊNCIAS EXATAS |
publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
instname_str |
Universidade Federal de Minas Gerais (UFMG) |
instacron_str |
UFMG |
institution |
UFMG |
reponame_str |
Repositório Institucional da UFMG |
collection |
Repositório Institucional da UFMG |
bitstream.url.fl_str_mv |
https://repositorio.ufmg.br/bitstream/1843/42299/1/Pedro_Brum_dissertacao.pdf https://repositorio.ufmg.br/bitstream/1843/42299/2/license.txt |
bitstream.checksum.fl_str_mv |
3b60ac831f5a79c3692b22ca7457fc79 cda590c95a0b51b4d15f60c9642ca272 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
repository.mail.fl_str_mv |
|
_version_ |
1803589528199364608 |