Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus

Pinto, José Pedro; Viana, Paula; Teixeira, Inês; Andrade, Maria

Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus

Detalhes bibliográficos
Autor(a) principal:	Pinto, José Pedro
Data de Publicação:	2022
Outros Autores:	Viana, Paula, Teixeira, Inês, Andrade, Maria
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10400.22/21674
Resumo:	The subjectiveness of multimedia content description has a strong negative impact on tag-based information retrieval. In our work, we propose enhancing available descriptions by adding semantically related tags. To cope with this objective, we use a word embedding technique based on the Word2Vec neural network parameterized and trained using a new dataset built from online newspapers. A large number of news stories was scraped and pre-processed to build a new dataset. Our target language is Portuguese, one of the most spoken languages worldwide. The results achieved significantly outperform similar existing solutions developed in the scope of different languages, including Portuguese. Contributions include also an online application and API available for external use. Although the presented work has been designed to enhance multimedia content annotation, it can be used in several other application areas.

Metadados do item

id	RCAP_80dc382a34c04412dcdce934302a4326
oai_identifier_str	oai:recipp.ipp.pt:10400.22/21674
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpusNatural language processingMachine learningMultimedia systemsContext awarenessWord2VecThe subjectiveness of multimedia content description has a strong negative impact on tag-based information retrieval. In our work, we propose enhancing available descriptions by adding semantically related tags. To cope with this objective, we use a word embedding technique based on the Word2Vec neural network parameterized and trained using a new dataset built from online newspapers. A large number of news stories was scraped and pre-processed to build a new dataset. Our target language is Portuguese, one of the most spoken languages worldwide. The results achieved significantly outperform similar existing solutions developed in the scope of different languages, including Portuguese. Contributions include also an online application and API available for external use. Although the presented work has been designed to enhance multimedia content annotation, it can be used in several other application areas.This work is financed by National Funds through the Portuguese funding agency, FCT - Fundacão para a Ciência e a Tecnologia, within project LA/P/0063/2020. The funders had ¸ no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.PeerJRepositório Científico do Instituto Politécnico do PortoPinto, José PedroViana, PaulaTeixeira, InêsAndrade, Maria2023-01-19T12:03:19Z2022-07-182022-07-18T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.22/21674eng10.7717/peerj-cs.964info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-03-13T13:17:55Zoai:recipp.ipp.pt:10400.22/21674Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T17:41:42.691735Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus
title	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus
spellingShingle	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus Pinto, José Pedro Natural language processing Machine learning Multimedia systems Context awareness Word2Vec
title_short	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus
title_full	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus
title_fullStr	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus
title_full_unstemmed	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus
title_sort	Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus
author	Pinto, José Pedro
author_facet	Pinto, José Pedro Viana, Paula Teixeira, Inês Andrade, Maria
author_role	author
author2	Viana, Paula Teixeira, Inês Andrade, Maria
author2_role	author author author
dc.contributor.none.fl_str_mv	Repositório Científico do Instituto Politécnico do Porto
dc.contributor.author.fl_str_mv	Pinto, José Pedro Viana, Paula Teixeira, Inês Andrade, Maria
dc.subject.por.fl_str_mv	Natural language processing Machine learning Multimedia systems Context awareness Word2Vec
topic	Natural language processing Machine learning Multimedia systems Context awareness Word2Vec
description	The subjectiveness of multimedia content description has a strong negative impact on tag-based information retrieval. In our work, we propose enhancing available descriptions by adding semantically related tags. To cope with this objective, we use a word embedding technique based on the Word2Vec neural network parameterized and trained using a new dataset built from online newspapers. A large number of news stories was scraped and pre-processed to build a new dataset. Our target language is Portuguese, one of the most spoken languages worldwide. The results achieved significantly outperform similar existing solutions developed in the scope of different languages, including Portuguese. Contributions include also an online application and API available for external use. Although the presented work has been designed to enhance multimedia content annotation, it can be used in several other application areas.
publishDate	2022
dc.date.none.fl_str_mv	2022-07-18 2022-07-18T00:00:00Z 2023-01-19T12:03:19Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10400.22/21674
url	http://hdl.handle.net/10400.22/21674
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	10.7717/peerj-cs.964
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	PeerJ
publisher.none.fl_str_mv	PeerJ
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799131504540909568

Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus

Registros relacionados