Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

Santos, Joaquim; Consoli, Bernardo; Vieira, Renata

Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

Detalhes bibliográficos
Autor(a) principal:	Santos, Joaquim
Data de Publicação:	2020
Outros Autores:	Consoli, Bernardo, Vieira, Renata
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10174/29657
Resumo:	Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.

Metadados do item

id	RCAP_7a5657c428b8ec5865e3a3578a0117f5
oai_identifier_str	oai:dspace.uevora.pt:10174/29657
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Word Embedding Evaluation in Downstream Tasks and Semantic AnalogiesLanguage modelsEvaluationLanguage Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.LREC/ELRA2021-04-01T14:49:43Z2021-04-012020-05-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10174/29657http://hdl.handle.net/10174/29657engSANTOS, Joaquim; CONSOLI, Bernardo; VIEIRA, Renata. Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4828-4834.https://www.aclweb.org/anthology/2020.lrec-1.594.pdfndndrenatav@uevora.pt299Santos, JoaquimConsoli, BernardoVieira, Renatainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T19:25:40Zoai:dspace.uevora.pt:10174/29657Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T01:18:44.601871Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
spellingShingle	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies Santos, Joaquim Language models Evaluation
title_short	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_full	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_fullStr	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_full_unstemmed	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
title_sort	Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
author	Santos, Joaquim
author_facet	Santos, Joaquim Consoli, Bernardo Vieira, Renata
author_role	author
author2	Consoli, Bernardo Vieira, Renata
author2_role	author author
dc.contributor.author.fl_str_mv	Santos, Joaquim Consoli, Bernardo Vieira, Renata
dc.subject.por.fl_str_mv	Language models Evaluation
topic	Language models Evaluation
description	Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.
publishDate	2020
dc.date.none.fl_str_mv	2020-05-01T00:00:00Z 2021-04-01T14:49:43Z 2021-04-01
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10174/29657 http://hdl.handle.net/10174/29657
url	http://hdl.handle.net/10174/29657
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	SANTOS, Joaquim; CONSOLI, Bernardo; VIEIRA, Renata. Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4828-4834. https://www.aclweb.org/anthology/2020.lrec-1.594.pdf nd nd renatav@uevora.pt 299
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	LREC/ELRA
publisher.none.fl_str_mv	LREC/ELRA
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799136669075505152

Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

Registros relacionados