Topic modeling: Summarize and organize data corpus using machine learning algorithms

Souza , Marcos de; Souza , Renato Rocha

Topic modeling: Summarize and organize data corpus using machine learning algorithms

Detalhes bibliográficos
Autor(a) principal:	Souza , Marcos de
Data de Publicação:	2020
Outros Autores:	Souza , Renato Rocha
Tipo de documento:	Artigo
Idioma:	por
Título da fonte:	Múltiplos Olhares em Ciência da Informação
Texto Completo:	https://periodicos.ufmg.br/index.php/moci/article/view/19138
Resumo:	The research compares the results and performance of the Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) models of Machine Learning when applied Topic Modeling in documents of formal channels of scientific communication, consisting of 2006 scientific articles and expanded abstracts from the XIII to the XVII National Meeting of Research in Information Science (ENANCIB). The steps of empirical research are the collection of data for the constitution, cleaning, manipulation, combination, normalization, treatment and transformation of data from the corpus to connect to machine learning models. The models summarized and organized the data corpus into topics that are made up of terms and weights. The LSI model presented a greater variety between the terms and weights contained in each topic, different from the LDA model which presented a greater similarity in the results, thus making it easier for the domain specialist to create the assumption for the names of the topics.

Metadados do item

id	UFMG-20_436550f3d8117f62f1e968d99845cf68
oai_identifier_str	oai:periodicos.ufmg.br:article/19138
network_acronym_str	UFMG-20
network_name_str	Múltiplos Olhares em Ciência da Informação
repository_id_str
spelling	Topic modeling: Summarize and organize data corpus using machine learning algorithmsModelagem de tópicos: Resumir e organizar corpus de dados por meio de algoritmos de aprendizagem de máquinaModelagem de tópicosAprendizagem de máquinaAlocação de Dirichlet LatenteIndexação semântica latenteModeling topicsMachine learningLatent Dirichlet allocationLatent semantic indexingThe research compares the results and performance of the Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) models of Machine Learning when applied Topic Modeling in documents of formal channels of scientific communication, consisting of 2006 scientific articles and expanded abstracts from the XIII to the XVII National Meeting of Research in Information Science (ENANCIB). The steps of empirical research are the collection of data for the constitution, cleaning, manipulation, combination, normalization, treatment and transformation of data from the corpus to connect to machine learning models. The models summarized and organized the data corpus into topics that are made up of terms and weights. The LSI model presented a greater variety between the terms and weights contained in each topic, different from the LDA model which presented a greater similarity in the results, thus making it easier for the domain specialist to create the assumption for the names of the topics.A pesquisa compara os resultados e desempenho dos modelos Latent Semantic Indexing (LSI) e Latent Dirichlet Allocation (LDA) de Machine Learning quando aplicado Modelagem de Tópicos em documentos dos canais formais da comunicação científica, constituído por 2006 artigos científicos e resumos expandidos do XIII ao XVII Encontro Nacional de Pesquisa em Ciência da Informação (ENANCIB). Constituem as etapas da pesquisa empírica a coleta dos dados para constituição, limpeza, manipulação, combinação, normalização, tratamento e transformação dos dados do corpus para conectar aos modelos de aprendizagem de máquina. Os modelos resumiram e organizaram o corpus de dados em tópicos que são constituídos por termos e pesos. O modelo LSI apresentou uma maior variedade entre os termos e pesos contidos em cada tópico, diferente do modelo LDA que apresentou uma maior similaridade nos resultados, facilitando, assim, para o especialista de domínio, criar a suposição para os nomes dos tópicos.Universidade Federal de Minas Gerais (UFMG)2020-01-31info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://periodicos.ufmg.br/index.php/moci/article/view/19138Múltiplos Olhares em Ciência da Informação ; Vol. 9 No. 2 (2019): PPGGOG - DiscentesMúltiplos Olhares em Ciência da Informação - ISSN 2237-6658; Vol. 9 Núm. 2 (2019): PPGGOG - DiscentesMúltiplos Olhares em Ciência da Informação - ISSN 2237-6658; Vol. 9 No 2 (2019): PPGGOG - DiscentesMúltiplos Olhares em Ciência da Informação; v. 9 n. 2 (2019): PPGGOG - Discentes2237-6658reponame:Múltiplos Olhares em Ciência da Informaçãoinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGporhttps://periodicos.ufmg.br/index.php/moci/article/view/19138/16257Copyright (c) 2020 Múltiplos Olhares em Ciência da Informaçãoinfo:eu-repo/semantics/openAccessSouza , Marcos deSouza , Renato Rocha2020-04-19T15:24:39Zoai:periodicos.ufmg.br:article/19138Revistahttps://periodicos.ufmg.br/index.php/moci/PUBhttps://periodicos.ufmg.br/index.php/moci/oaimoci@eci.ufmg.br2237-66582237-6658opendoar:2020-04-19T15:24:39Múltiplos Olhares em Ciência da Informação - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv	Topic modeling: Summarize and organize data corpus using machine learning algorithms Modelagem de tópicos: Resumir e organizar corpus de dados por meio de algoritmos de aprendizagem de máquina
title	Topic modeling: Summarize and organize data corpus using machine learning algorithms
spellingShingle	Topic modeling: Summarize and organize data corpus using machine learning algorithms Souza , Marcos de Modelagem de tópicos Aprendizagem de máquina Alocação de Dirichlet Latente Indexação semântica latente Modeling topics Machine learning Latent Dirichlet allocation Latent semantic indexing
title_short	Topic modeling: Summarize and organize data corpus using machine learning algorithms
title_full	Topic modeling: Summarize and organize data corpus using machine learning algorithms
title_fullStr	Topic modeling: Summarize and organize data corpus using machine learning algorithms
title_full_unstemmed	Topic modeling: Summarize and organize data corpus using machine learning algorithms
title_sort	Topic modeling: Summarize and organize data corpus using machine learning algorithms
author	Souza , Marcos de
author_facet	Souza , Marcos de Souza , Renato Rocha
author_role	author
author2	Souza , Renato Rocha
author2_role	author
dc.contributor.author.fl_str_mv	Souza , Marcos de Souza , Renato Rocha
dc.subject.por.fl_str_mv	Modelagem de tópicos Aprendizagem de máquina Alocação de Dirichlet Latente Indexação semântica latente Modeling topics Machine learning Latent Dirichlet allocation Latent semantic indexing
topic	Modelagem de tópicos Aprendizagem de máquina Alocação de Dirichlet Latente Indexação semântica latente Modeling topics Machine learning Latent Dirichlet allocation Latent semantic indexing
description	The research compares the results and performance of the Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) models of Machine Learning when applied Topic Modeling in documents of formal channels of scientific communication, consisting of 2006 scientific articles and expanded abstracts from the XIII to the XVII National Meeting of Research in Information Science (ENANCIB). The steps of empirical research are the collection of data for the constitution, cleaning, manipulation, combination, normalization, treatment and transformation of data from the corpus to connect to machine learning models. The models summarized and organized the data corpus into topics that are made up of terms and weights. The LSI model presented a greater variety between the terms and weights contained in each topic, different from the LDA model which presented a greater similarity in the results, thus making it easier for the domain specialist to create the assumption for the names of the topics.
publishDate	2020
dc.date.none.fl_str_mv	2020-01-31
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://periodicos.ufmg.br/index.php/moci/article/view/19138
url	https://periodicos.ufmg.br/index.php/moci/article/view/19138
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	https://periodicos.ufmg.br/index.php/moci/article/view/19138/16257
dc.rights.driver.fl_str_mv	Copyright (c) 2020 Múltiplos Olhares em Ciência da Informação info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Copyright (c) 2020 Múltiplos Olhares em Ciência da Informação
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais (UFMG)
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais (UFMG)
dc.source.none.fl_str_mv	Múltiplos Olhares em Ciência da Informação ; Vol. 9 No. 2 (2019): PPGGOG - Discentes Múltiplos Olhares em Ciência da Informação - ISSN 2237-6658; Vol. 9 Núm. 2 (2019): PPGGOG - Discentes Múltiplos Olhares em Ciência da Informação - ISSN 2237-6658; Vol. 9 No 2 (2019): PPGGOG - Discentes Múltiplos Olhares em Ciência da Informação; v. 9 n. 2 (2019): PPGGOG - Discentes 2237-6658 reponame:Múltiplos Olhares em Ciência da Informação instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Múltiplos Olhares em Ciência da Informação
collection	Múltiplos Olhares em Ciência da Informação
repository.name.fl_str_mv	Múltiplos Olhares em Ciência da Informação - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv	moci@eci.ufmg.br
_version_	1796797464256184320

Topic modeling: Summarize and organize data corpus using machine learning algorithms

Registros relacionados