Topic modelling: a consistent framework for comparative studies and its practical application

Detalhes bibliográficos
Autor(a) principal: Amaro, Ana Margarida Rocha
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/144705
Resumo: Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics
id RCAP_ae238f25a75ca6bce7b11e7567396a9f
oai_identifier_str oai:run.unl.pt:10362/144705
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Topic modelling: a consistent framework for comparative studies and its practical applicationNatural Language ProcessingTop2VecTopic CoherenceTopic ModellingUnsupervised LearningDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThis research was part of the DSAIPA/DS/0116/2019 project, supported by a grant of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”).Topic Modelling (TM) is an unsupervised learning method to find latent semantic structure in a set of documents, grouping them according to their semantic content. Although in the literature there are several proposed algorithms for TM, these are commonly not validated against the same datasets and evaluation metrics. Simultaneously, current surveys found in the literature, rely on a reduced number of algorithms or, given the velocity of advances in the field, exclude models that have been presented with state-of-the-art results. Consequentially, in this work, we aim to present a more complete comparative study on the performance of different TM techniques, which shall be evaluated on three datasets, arising from different contexts: the 20 Newsgroup dataset, the Yahoo! Q&A dataset, and the BIG Patent dataset. The experiments, evaluated primarily through the Context Vectors (CV) Topic Coherence, indicate that Top2Vec is the best performing model across all datasets. Given the results obtained, an exploratory analysis was conducted on a newly introduced dataset, containing abstracts of articles funded by Central Banks and other international organizations. This endeavour is intended to provide an informative outlook on the organizations’ diverse topics of interest and their evolution over the period in study. In short, the major contribution of this work is to offer an updated survey on the state of art TM approaches, while demonstrating its practical usability in a new context, whilst exploring the insights obtained.Bação, Fernando José Ferreira LucasRUNAmaro, Ana Margarida Rocha2022-10-14T13:08:36Z2022-10-032022-10-03T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/144705TID:203076524enginfo:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:24:32Zoai:run.unl.pt:10362/144705Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:51:42.518940Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Topic modelling: a consistent framework for comparative studies and its practical application
title Topic modelling: a consistent framework for comparative studies and its practical application
spellingShingle Topic modelling: a consistent framework for comparative studies and its practical application
Amaro, Ana Margarida Rocha
Natural Language Processing
Top2Vec
Topic Coherence
Topic Modelling
Unsupervised Learning
title_short Topic modelling: a consistent framework for comparative studies and its practical application
title_full Topic modelling: a consistent framework for comparative studies and its practical application
title_fullStr Topic modelling: a consistent framework for comparative studies and its practical application
title_full_unstemmed Topic modelling: a consistent framework for comparative studies and its practical application
title_sort Topic modelling: a consistent framework for comparative studies and its practical application
author Amaro, Ana Margarida Rocha
author_facet Amaro, Ana Margarida Rocha
author_role author
dc.contributor.none.fl_str_mv Bação, Fernando José Ferreira Lucas
RUN
dc.contributor.author.fl_str_mv Amaro, Ana Margarida Rocha
dc.subject.por.fl_str_mv Natural Language Processing
Top2Vec
Topic Coherence
Topic Modelling
Unsupervised Learning
topic Natural Language Processing
Top2Vec
Topic Coherence
Topic Modelling
Unsupervised Learning
description Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics
publishDate 2022
dc.date.none.fl_str_mv 2022-10-14T13:08:36Z
2022-10-03
2022-10-03T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/144705
TID:203076524
url http://hdl.handle.net/10362/144705
identifier_str_mv TID:203076524
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/embargoedAccess
eu_rights_str_mv embargoedAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138109876600832