Evaluating the performance and improving the usability of parallel and distributed word embedding tools

Detalhes bibliográficos
Autor(a) principal: Silva, Mateus Lyra da
Data de Publicação: 2020
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da PUC_RS
Texto Completo: http://tede2.pucrs.br/tede2/handle/tede/9245
Resumo: The representation of words by means of vectors, also called Word Embeddings (WE), has been receiving great attention from the Natural Language Processing (NLP) field. WE models are able to express syntactic and semantic similarities, as well as relationships and contexts of words within a given corpus. Although the most popular implementations of WE algorithms present low scalability, there are new approaches that apply High-Performance Computing (HPC) techniques. This is an opportunity for an analysis of the main differences among the existing implementations, based on performance and scalability metrics. In this Dissertation, we present an interdisciplinary study that addresses resource utilization and performance aspects of known WE algorithms found in the literature. To improve scalability and usability we propose an integration for local and remote execution environments that contains a set of the most optimized versions. Utilizing these optimizations it is possible to achieve an average performance gain of 15x for multicores and 105x for multinodes compared to the original version. There is also a big reduction in the memory footprint compared to the most popular Python versions. Since an appropriated use of HPC environments may require expert knowledge, we also propose a parameter tuning model utilizing an Multilayer Perceptron (MLP) neural network and Simulated Annealing (SA) algorithm to suggest the best parameter setup considering the computational resources, that may be an aid for non-expert users in the usage of HPC environments.
id P_RS_39accfd7c6870fefe031dcd763572fb2
oai_identifier_str oai:tede2.pucrs.br:tede/9245
network_acronym_str P_RS
network_name_str Biblioteca Digital de Teses e Dissertações da PUC_RS
repository_id_str
spelling De Rose, César Augusto Fonticielhahttp://lattes.cnpq.br/6703453792017497http://lattes.cnpq.br/8584495387617430Silva, Mateus Lyra da2020-08-28T14:36:04Z2020-03-30http://tede2.pucrs.br/tede2/handle/tede/9245The representation of words by means of vectors, also called Word Embeddings (WE), has been receiving great attention from the Natural Language Processing (NLP) field. WE models are able to express syntactic and semantic similarities, as well as relationships and contexts of words within a given corpus. Although the most popular implementations of WE algorithms present low scalability, there are new approaches that apply High-Performance Computing (HPC) techniques. This is an opportunity for an analysis of the main differences among the existing implementations, based on performance and scalability metrics. In this Dissertation, we present an interdisciplinary study that addresses resource utilization and performance aspects of known WE algorithms found in the literature. To improve scalability and usability we propose an integration for local and remote execution environments that contains a set of the most optimized versions. Utilizing these optimizations it is possible to achieve an average performance gain of 15x for multicores and 105x for multinodes compared to the original version. There is also a big reduction in the memory footprint compared to the most popular Python versions. Since an appropriated use of HPC environments may require expert knowledge, we also propose a parameter tuning model utilizing an Multilayer Perceptron (MLP) neural network and Simulated Annealing (SA) algorithm to suggest the best parameter setup considering the computational resources, that may be an aid for non-expert users in the usage of HPC environments.A representação de palavras por meio de vetores chamada de Word Embeddings (WE) vem recebendo grande atenção do campo de Processamento de Linguagem natural (NLP). Modelos WE são capazes de expressar similaridades sintáticas e semânticas, bem como relacionamentos e contextos de palavras em um determinado corpus. Apesar de as implementações mais populares de algoritmos de WE apresentarem baixa escalabilidade, existem novas abordagens que aplicam técnicas de High-Performance Computing (HPC). Nesta dissertação é apresentado um estudo interdisciplinar direcionado a utilização de recursos e aspectos de desempenho dos algoritmos de WE encontrados na literatura. Para melhorar a escalabilidade e usabilidade, o presente trabalho propõe uma integração para ambientes de execução locais e remotos, que contém um conjunto das versões mais otimizadas. Usando estas otimizações é possível alcançar um ganho de desempenho médio de 15x para multicores e 105x para multinodes comparado à versão original. Há também uma grande redução no consumo de memória comparado às versões mais populares em Python. Uma vez que o uso apropriado de ambientes de alta performance pode requerer conhecimento especializado de seus usuários, neste trabalho também é proposto um modelo de otimização de parâmetros que utiliza uma rede neural Multilayer Perceptron (MLP) e o algoritmo Simulated Annealing (SA) para sugerir conjuntos de parâmetros que considerem os recursos computacionais, o que pode ser um auxílio para usuários não especialistas no uso de ambientes computacionais de alto desempenho.Submitted by PPG Ciência da Computação (ppgcc@pucrs.br) on 2020-07-29T17:35:26Z No. of bitstreams: 1 Dissertacao_homolog.pdf: 8822751 bytes, checksum: f5bebcc4f366a19c4cec808bd2e531ff (MD5)Approved for entry into archive by Lucas Martins Kern (lucas.kern@pucrs.br) on 2020-08-28T14:30:54Z (GMT) No. of bitstreams: 1 Dissertacao_homolog.pdf: 8822751 bytes, checksum: f5bebcc4f366a19c4cec808bd2e531ff (MD5)Made available in DSpace on 2020-08-28T14:36:04Z (GMT). No. of bitstreams: 1 Dissertacao_homolog.pdf: 8822751 bytes, checksum: f5bebcc4f366a19c4cec808bd2e531ff (MD5) Previous issue date: 2020-03-30Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESapplication/pdfhttp://tede2.pucrs.br:80/tede2/retrieve/178708/Dissertacao_homolog.pdf.jpgengPontifícia Universidade Católica do Rio Grande do SulPrograma de Pós-Graduação em Ciência da ComputaçãoPUCRSBrasilEscola PolitécnicaWord2vecHPCMemória distribuídaMulticomputadoresMPIOpenMPWord2vecHPCShared memoryMulticomputersMPIOpenMPCIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAOEvaluating the performance and improving the usability of parallel and distributed word embedding toolsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisTrabalho não apresenta restrição para publicação-4570527706994352458500500600-8620782570833253013590462550136975366info:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da PUC_RSinstname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)instacron:PUC_RSTHUMBNAILDissertacao_homolog.pdf.jpgDissertacao_homolog.pdf.jpgimage/jpeg5698http://tede2.pucrs.br/tede2/bitstream/tede/9245/4/Dissertacao_homolog.pdf.jpg3aea60dfa9984e96b6a82415ada9dc26MD54TEXTDissertacao_homolog.pdf.txtDissertacao_homolog.pdf.txttext/plain97062http://tede2.pucrs.br/tede2/bitstream/tede/9245/3/Dissertacao_homolog.pdf.txt5d6080a290c8abb68be59dfc4f382049MD53ORIGINALDissertacao_homolog.pdfDissertacao_homolog.pdfapplication/pdf8822751http://tede2.pucrs.br/tede2/bitstream/tede/9245/2/Dissertacao_homolog.pdff5bebcc4f366a19c4cec808bd2e531ffMD52LICENSElicense.txtlicense.txttext/plain; charset=utf-8590http://tede2.pucrs.br/tede2/bitstream/tede/9245/1/license.txt220e11f2d3ba5354f917c7035aadef24MD51tede/92452020-08-28 12:00:14.659oai:tede2.pucrs.br:tede/9245QXV0b3JpemE/P28gcGFyYSBQdWJsaWNhPz9vIEVsZXRyP25pY2E6IENvbSBiYXNlIG5vIGRpc3Bvc3RvIG5hIExlaSBGZWRlcmFsIG4/OS42MTAsIGRlIDE5IGRlIGZldmVyZWlybyBkZSAxOTk4LCBvIGF1dG9yIEFVVE9SSVpBIGEgcHVibGljYT8/byBlbGV0cj9uaWNhIGRhIHByZXNlbnRlIG9icmEgbm8gYWNlcnZvIGRhIEJpYmxpb3RlY2EgRGlnaXRhbCBkYSBQb250aWY/Y2lhIFVuaXZlcnNpZGFkZSBDYXQ/bGljYSBkbyBSaW8gR3JhbmRlIGRvIFN1bCwgc2VkaWFkYSBhIEF2LiBJcGlyYW5nYSA2NjgxLCBQb3J0byBBbGVncmUsIFJpbyBHcmFuZGUgZG8gU3VsLCBjb20gcmVnaXN0cm8gZGUgQ05QSiA4ODYzMDQxMzAwMDItODEgYmVtIGNvbW8gZW0gb3V0cmFzIGJpYmxpb3RlY2FzIGRpZ2l0YWlzLCBuYWNpb25haXMgZSBpbnRlcm5hY2lvbmFpcywgY29ucz9yY2lvcyBlIHJlZGVzID9zIHF1YWlzIGEgYmlibGlvdGVjYSBkYSBQVUNSUyBwb3NzYSBhIHZpciBwYXJ0aWNpcGFyLCBzZW0gP251cyBhbHVzaXZvIGFvcyBkaXJlaXRvcyBhdXRvcmFpcywgYSB0P3R1bG8gZGUgZGl2dWxnYT8/byBkYSBwcm9kdT8/byBjaWVudD9maWNhLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://tede2.pucrs.br/tede2/PRIhttps://tede2.pucrs.br/oai/requestbiblioteca.central@pucrs.br||opendoar:2020-08-28T15:00:14Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)false
dc.title.por.fl_str_mv Evaluating the performance and improving the usability of parallel and distributed word embedding tools
title Evaluating the performance and improving the usability of parallel and distributed word embedding tools
spellingShingle Evaluating the performance and improving the usability of parallel and distributed word embedding tools
Silva, Mateus Lyra da
Word2vec
HPC
Memória distribuída
Multicomputadores
MPI
OpenMP
Word2vec
HPC
Shared memory
Multicomputers
MPI
OpenMP
CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
title_short Evaluating the performance and improving the usability of parallel and distributed word embedding tools
title_full Evaluating the performance and improving the usability of parallel and distributed word embedding tools
title_fullStr Evaluating the performance and improving the usability of parallel and distributed word embedding tools
title_full_unstemmed Evaluating the performance and improving the usability of parallel and distributed word embedding tools
title_sort Evaluating the performance and improving the usability of parallel and distributed word embedding tools
author Silva, Mateus Lyra da
author_facet Silva, Mateus Lyra da
author_role author
dc.contributor.advisor1.fl_str_mv De Rose, César Augusto Fonticielha
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/6703453792017497
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/8584495387617430
dc.contributor.author.fl_str_mv Silva, Mateus Lyra da
contributor_str_mv De Rose, César Augusto Fonticielha
dc.subject.por.fl_str_mv Word2vec
HPC
Memória distribuída
Multicomputadores
MPI
OpenMP
topic Word2vec
HPC
Memória distribuída
Multicomputadores
MPI
OpenMP
Word2vec
HPC
Shared memory
Multicomputers
MPI
OpenMP
CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
dc.subject.eng.fl_str_mv Word2vec
HPC
Shared memory
Multicomputers
MPI
OpenMP
dc.subject.cnpq.fl_str_mv CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
description The representation of words by means of vectors, also called Word Embeddings (WE), has been receiving great attention from the Natural Language Processing (NLP) field. WE models are able to express syntactic and semantic similarities, as well as relationships and contexts of words within a given corpus. Although the most popular implementations of WE algorithms present low scalability, there are new approaches that apply High-Performance Computing (HPC) techniques. This is an opportunity for an analysis of the main differences among the existing implementations, based on performance and scalability metrics. In this Dissertation, we present an interdisciplinary study that addresses resource utilization and performance aspects of known WE algorithms found in the literature. To improve scalability and usability we propose an integration for local and remote execution environments that contains a set of the most optimized versions. Utilizing these optimizations it is possible to achieve an average performance gain of 15x for multicores and 105x for multinodes compared to the original version. There is also a big reduction in the memory footprint compared to the most popular Python versions. Since an appropriated use of HPC environments may require expert knowledge, we also propose a parameter tuning model utilizing an Multilayer Perceptron (MLP) neural network and Simulated Annealing (SA) algorithm to suggest the best parameter setup considering the computational resources, that may be an aid for non-expert users in the usage of HPC environments.
publishDate 2020
dc.date.accessioned.fl_str_mv 2020-08-28T14:36:04Z
dc.date.issued.fl_str_mv 2020-03-30
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://tede2.pucrs.br/tede2/handle/tede/9245
url http://tede2.pucrs.br/tede2/handle/tede/9245
dc.language.iso.fl_str_mv eng
language eng
dc.relation.program.fl_str_mv -4570527706994352458
dc.relation.confidence.fl_str_mv 500
500
600
dc.relation.cnpq.fl_str_mv -862078257083325301
dc.relation.sponsorship.fl_str_mv 3590462550136975366
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Pontifícia Universidade Católica do Rio Grande do Sul
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv PUCRS
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Escola Politécnica
publisher.none.fl_str_mv Pontifícia Universidade Católica do Rio Grande do Sul
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da PUC_RS
instname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
instacron:PUC_RS
instname_str Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
instacron_str PUC_RS
institution PUC_RS
reponame_str Biblioteca Digital de Teses e Dissertações da PUC_RS
collection Biblioteca Digital de Teses e Dissertações da PUC_RS
bitstream.url.fl_str_mv http://tede2.pucrs.br/tede2/bitstream/tede/9245/4/Dissertacao_homolog.pdf.jpg
http://tede2.pucrs.br/tede2/bitstream/tede/9245/3/Dissertacao_homolog.pdf.txt
http://tede2.pucrs.br/tede2/bitstream/tede/9245/2/Dissertacao_homolog.pdf
http://tede2.pucrs.br/tede2/bitstream/tede/9245/1/license.txt
bitstream.checksum.fl_str_mv 3aea60dfa9984e96b6a82415ada9dc26
5d6080a290c8abb68be59dfc4f382049
f5bebcc4f366a19c4cec808bd2e531ff
220e11f2d3ba5354f917c7035aadef24
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
repository.mail.fl_str_mv biblioteca.central@pucrs.br||
_version_ 1799765346456961024