Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Outros Autores: | , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10174/29657 |
Resumo: | Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality. |
id |
RCAP_7a5657c428b8ec5865e3a3578a0117f5 |
---|---|
oai_identifier_str |
oai:dspace.uevora.pt:10174/29657 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Word Embedding Evaluation in Downstream Tasks and Semantic AnalogiesLanguage modelsEvaluationLanguage Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.LREC/ELRA2021-04-01T14:49:43Z2021-04-012020-05-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10174/29657http://hdl.handle.net/10174/29657engSANTOS, Joaquim; CONSOLI, Bernardo; VIEIRA, Renata. Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4828-4834.https://www.aclweb.org/anthology/2020.lrec-1.594.pdfndndrenatav@uevora.pt299Santos, JoaquimConsoli, BernardoVieira, Renatainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T19:25:40Zoai:dspace.uevora.pt:10174/29657Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T01:18:44.601871Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies |
title |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies |
spellingShingle |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies Santos, Joaquim Language models Evaluation |
title_short |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies |
title_full |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies |
title_fullStr |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies |
title_full_unstemmed |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies |
title_sort |
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies |
author |
Santos, Joaquim |
author_facet |
Santos, Joaquim Consoli, Bernardo Vieira, Renata |
author_role |
author |
author2 |
Consoli, Bernardo Vieira, Renata |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Santos, Joaquim Consoli, Bernardo Vieira, Renata |
dc.subject.por.fl_str_mv |
Language models Evaluation |
topic |
Language models Evaluation |
description |
Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-05-01T00:00:00Z 2021-04-01T14:49:43Z 2021-04-01 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10174/29657 http://hdl.handle.net/10174/29657 |
url |
http://hdl.handle.net/10174/29657 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
SANTOS, Joaquim; CONSOLI, Bernardo; VIEIRA, Renata. Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4828-4834. https://www.aclweb.org/anthology/2020.lrec-1.594.pdf nd nd renatav@uevora.pt 299 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
LREC/ELRA |
publisher.none.fl_str_mv |
LREC/ELRA |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799136669075505152 |