Topic modelling: a consistent framework for comparative studies and its practical application
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/144705 |
Resumo: | Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics |
id |
RCAP_ae238f25a75ca6bce7b11e7567396a9f |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/144705 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Topic modelling: a consistent framework for comparative studies and its practical applicationNatural Language ProcessingTop2VecTopic CoherenceTopic ModellingUnsupervised LearningDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThis research was part of the DSAIPA/DS/0116/2019 project, supported by a grant of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”).Topic Modelling (TM) is an unsupervised learning method to find latent semantic structure in a set of documents, grouping them according to their semantic content. Although in the literature there are several proposed algorithms for TM, these are commonly not validated against the same datasets and evaluation metrics. Simultaneously, current surveys found in the literature, rely on a reduced number of algorithms or, given the velocity of advances in the field, exclude models that have been presented with state-of-the-art results. Consequentially, in this work, we aim to present a more complete comparative study on the performance of different TM techniques, which shall be evaluated on three datasets, arising from different contexts: the 20 Newsgroup dataset, the Yahoo! Q&A dataset, and the BIG Patent dataset. The experiments, evaluated primarily through the Context Vectors (CV) Topic Coherence, indicate that Top2Vec is the best performing model across all datasets. Given the results obtained, an exploratory analysis was conducted on a newly introduced dataset, containing abstracts of articles funded by Central Banks and other international organizations. This endeavour is intended to provide an informative outlook on the organizations’ diverse topics of interest and their evolution over the period in study. In short, the major contribution of this work is to offer an updated survey on the state of art TM approaches, while demonstrating its practical usability in a new context, whilst exploring the insights obtained.Bação, Fernando José Ferreira LucasRUNAmaro, Ana Margarida Rocha2022-10-14T13:08:36Z2022-10-032022-10-03T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/144705TID:203076524enginfo:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:24:32Zoai:run.unl.pt:10362/144705Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:51:42.518940Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Topic modelling: a consistent framework for comparative studies and its practical application |
title |
Topic modelling: a consistent framework for comparative studies and its practical application |
spellingShingle |
Topic modelling: a consistent framework for comparative studies and its practical application Amaro, Ana Margarida Rocha Natural Language Processing Top2Vec Topic Coherence Topic Modelling Unsupervised Learning |
title_short |
Topic modelling: a consistent framework for comparative studies and its practical application |
title_full |
Topic modelling: a consistent framework for comparative studies and its practical application |
title_fullStr |
Topic modelling: a consistent framework for comparative studies and its practical application |
title_full_unstemmed |
Topic modelling: a consistent framework for comparative studies and its practical application |
title_sort |
Topic modelling: a consistent framework for comparative studies and its practical application |
author |
Amaro, Ana Margarida Rocha |
author_facet |
Amaro, Ana Margarida Rocha |
author_role |
author |
dc.contributor.none.fl_str_mv |
Bação, Fernando José Ferreira Lucas RUN |
dc.contributor.author.fl_str_mv |
Amaro, Ana Margarida Rocha |
dc.subject.por.fl_str_mv |
Natural Language Processing Top2Vec Topic Coherence Topic Modelling Unsupervised Learning |
topic |
Natural Language Processing Top2Vec Topic Coherence Topic Modelling Unsupervised Learning |
description |
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-10-14T13:08:36Z 2022-10-03 2022-10-03T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/144705 TID:203076524 |
url |
http://hdl.handle.net/10362/144705 |
identifier_str_mv |
TID:203076524 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/embargoedAccess |
eu_rights_str_mv |
embargoedAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138109876600832 |