Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

Detalhes bibliográficos
Autor(a) principal: Santos, Joaquim
Data de Publicação: 2020
Outros Autores: Consoli, Bernardo, Vieira, Renata
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10174/29657
Resumo: Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.
id RCAP_7a5657c428b8ec5865e3a3578a0117f5
oai_identifier_str oai:dspace.uevora.pt:10174/29657
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Word Embedding Evaluation in Downstream Tasks and Semantic AnalogiesLanguage modelsEvaluationLanguage Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.LREC/ELRA2021-04-01T14:49:43Z2021-04-012020-05-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10174/29657http://hdl.handle.net/10174/29657engSANTOS, Joaquim; CONSOLI, Bernardo; VIEIRA, Renata. Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4828-4834.https://www.aclweb.org/anthology/2020.lrec-1.594.pdfndndrenatav@uevora.pt299Santos, JoaquimConsoli, BernardoVieira, Renatainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T19:25:40Zoai:dspace.uevora.pt:10174/29657Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T01:18:44.601871Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
spellingShingle Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
Santos, Joaquim
Language models
Evaluation
title_short Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_full Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_fullStr Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_full_unstemmed Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_sort Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
author Santos, Joaquim
author_facet Santos, Joaquim
Consoli, Bernardo
Vieira, Renata
author_role author
author2 Consoli, Bernardo
Vieira, Renata
author2_role author
author
dc.contributor.author.fl_str_mv Santos, Joaquim
Consoli, Bernardo
Vieira, Renata
dc.subject.por.fl_str_mv Language models
Evaluation
topic Language models
Evaluation
description Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.
publishDate 2020
dc.date.none.fl_str_mv 2020-05-01T00:00:00Z
2021-04-01T14:49:43Z
2021-04-01
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10174/29657
http://hdl.handle.net/10174/29657
url http://hdl.handle.net/10174/29657
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv SANTOS, Joaquim; CONSOLI, Bernardo; VIEIRA, Renata. Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4828-4834.
https://www.aclweb.org/anthology/2020.lrec-1.594.pdf
nd
nd
renatav@uevora.pt
299
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv LREC/ELRA
publisher.none.fl_str_mv LREC/ELRA
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799136669075505152