Investigação do processo de stemming na lingua portuguesa

Alvares, Reinaldo Viana

Investigação do processo de stemming na lingua portuguesa

Detalhes bibliográficos
Autor(a) principal:	Alvares, Reinaldo Viana
Data de Publicação:	2008
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Repositório Institucional da Universidade Federal Fluminense (RIUFF)
Texto Completo:	https://app.uff.br/riuff/handle/1/17898
Resumo:	The information retrieval process is a usual task for the human. However, having a complex automation. This happens because the quality of the results is often related with the degree of the user's satisfaction, a difficult parameter to measure. In general this quality is evaluated being taking into account a group of queries in a text collection, and their relevant answers. Commonly, two evaluation measures are used in this process: the first is the precision, wich represents the proportion of recovered relevant items from the total of recovered items; and the second is the recall, wich represents the proportion of recovered relevant items from the total of relevant items of the collection. One of the challenges is to find efficient forms to represent the documents, in order to avoid ambiguity. An alternative to solve this problem consists of obtaining a unique representation for words that appear for a same concept. This task can be defined as stemming. Many times, the stemming process is dependent to the morphologic structure of the target language. For the Portuguese language, there were found few solutions to assist the demand for these algorithms. The morphologic complexity of Portuguese language, and the few stemming solutions found for this language, were the motivation for the research shown in this work. This work presents a new model for the stemming process, that is applicable to the Portuguese language, based on a statistical study accomplished in a collection of extracted words of the Brazilian Web. With objective of evaluating the model, a stemmer is implemented and compared with a solution found in the literature, especially developed for Portuguese. The main contributions of this work are the systematical model for the stemming process, besides the stemmer conceived and implemented specially for the Portuguese language.

Metadados do item

id	UFF-2_8c42f07ec44877a9028ea06730a083f1
oai_identifier_str	oai:app.uff.br:1/17898
network_acronym_str	UFF-2
network_name_str	Repositório Institucional da Universidade Federal Fluminense (RIUFF)
repository_id_str	2120
spelling	Investigação do processo de stemming na lingua portuguesaStemming process investigation for the portuguese languageCiência da computaçãoAlgoritmoRecuperação da informaçãoProcesso de mineração de dadosRecuperação de dados (Computação)Mineração de textoBanco de DadosKDDinteligência artificialAlgoritmos de stemmingProcessamento de linguagem naturalCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAOThe information retrieval process is a usual task for the human. However, having a complex automation. This happens because the quality of the results is often related with the degree of the user's satisfaction, a difficult parameter to measure. In general this quality is evaluated being taking into account a group of queries in a text collection, and their relevant answers. Commonly, two evaluation measures are used in this process: the first is the precision, wich represents the proportion of recovered relevant items from the total of recovered items; and the second is the recall, wich represents the proportion of recovered relevant items from the total of relevant items of the collection. One of the challenges is to find efficient forms to represent the documents, in order to avoid ambiguity. An alternative to solve this problem consists of obtaining a unique representation for words that appear for a same concept. This task can be defined as stemming. Many times, the stemming process is dependent to the morphologic structure of the target language. For the Portuguese language, there were found few solutions to assist the demand for these algorithms. The morphologic complexity of Portuguese language, and the few stemming solutions found for this language, were the motivation for the research shown in this work. This work presents a new model for the stemming process, that is applicable to the Portuguese language, based on a statistical study accomplished in a collection of extracted words of the Brazilian Web. With objective of evaluating the model, a stemmer is implemented and compared with a solution found in the literature, especially developed for Portuguese. The main contributions of this work are the systematical model for the stemming process, besides the stemmer conceived and implemented specially for the Portuguese language.O processo de busca e recuperação de informação é uma tarefa rotineira do ser humano, no entanto, de complexa automatização. Isto ocorre pois a qualidade dos resultados é muitas vezes relacionada com o grau de satisfação do usuário, um parâmetro de difícil mensuração. Em geral esta qualidade é avaliada levando-se em consideração um conjunto de consultas realizadas em uma coleção de textos, e as respostas relevantes obtidas. Comumente, duas medidas de avaliação são utilizadas neste processo: precision, que representa a proporção de itens relevantes recuperados do total de itens recuperados; e recall, que representa a proporção de itens relevantes recuperados do total de itens relevantes da coleção. Para isso, um dos desafios é encontrar formas eficientes para representar os documentos, de maneira a evitar ambigüidade. Uma alternativa para resolver este problema consiste em obter uma representação única para palavras que apontem para um mesmo conceito. Esta tarefa pode ser definida como stemming. O processo de stemming muitas vezes é atrelado à estrutura morfológica do idioma onde é utilizado. Em se tratando da língua portuguesa, foram encontradas poucas soluções para atender a demanda por esses tipos de algoritmos. A complexidade morfológica da língua portuguesa e as poucas soluções de stemming encontradas para este idioma, serviram como motivação para o desenvolvimento desta dissertação. Este trabalho apresenta um modelo para algoritmos de stemming, aplicável à língua portuguesa, baseado num estudo estatístico realizado em uma coleção de palavras extraídas da Web brasileira. Com objetivo de avaliar o modelo, um stemmer é implementado e comparado com uma solução encontrada na literatura, especialmente desenvolvida para este idioma. As principais contribuições deste trabalho são o modelo sistemático para o processo de stemming, além do stemmer concebido e implementado especialmente para a língua portuguesa.Programa de Pós-Graduação em ComputaçãoComputaçãoGarcia, Ana Cristina BicharraCPF:31237899422http://lattes.cnpq.br/4879977915136752Rezende, Solange OliveiraCPF:29523433222http://lattes.cnpq.brSoto, Miguel PariCPF:22264323422http://lattes.cnpq.br/1534009365844020Alvares, Reinaldo Viana2021-03-10T20:43:03Z2008-06-162021-03-10T20:43:03Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://app.uff.br/riuff/handle/1/17898porCC-BY-SAinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da Universidade Federal Fluminense (RIUFF)instname:Universidade Federal Fluminense (UFF)instacron:UFF2021-03-10T20:43:03Zoai:app.uff.br:1/17898Repositório InstitucionalPUBhttps://app.uff.br/oai/requestriuff@id.uff.bropendoar:21202021-03-10T20:43:03Repositório Institucional da Universidade Federal Fluminense (RIUFF) - Universidade Federal Fluminense (UFF)false
dc.title.none.fl_str_mv	Investigação do processo de stemming na lingua portuguesa Stemming process investigation for the portuguese language
title	Investigação do processo de stemming na lingua portuguesa
spellingShingle	Investigação do processo de stemming na lingua portuguesa Alvares, Reinaldo Viana Ciência da computação Algoritmo Recuperação da informação Processo de mineração de dados Recuperação de dados (Computação) Mineração de texto Banco de Dados KDD inteligência artificial Algoritmos de stemming Processamento de linguagem natural CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAO
title_short	Investigação do processo de stemming na lingua portuguesa
title_full	Investigação do processo de stemming na lingua portuguesa
title_fullStr	Investigação do processo de stemming na lingua portuguesa
title_full_unstemmed	Investigação do processo de stemming na lingua portuguesa
title_sort	Investigação do processo de stemming na lingua portuguesa
author	Alvares, Reinaldo Viana
author_facet	Alvares, Reinaldo Viana
author_role	author
dc.contributor.none.fl_str_mv	Garcia, Ana Cristina Bicharra CPF:31237899422 http://lattes.cnpq.br/4879977915136752 Rezende, Solange Oliveira CPF:29523433222 http://lattes.cnpq.br Soto, Miguel Pari CPF:22264323422 http://lattes.cnpq.br/1534009365844020
dc.contributor.author.fl_str_mv	Alvares, Reinaldo Viana
dc.subject.por.fl_str_mv	Ciência da computação Algoritmo Recuperação da informação Processo de mineração de dados Recuperação de dados (Computação) Mineração de texto Banco de Dados KDD inteligência artificial Algoritmos de stemming Processamento de linguagem natural CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAO
topic	Ciência da computação Algoritmo Recuperação da informação Processo de mineração de dados Recuperação de dados (Computação) Mineração de texto Banco de Dados KDD inteligência artificial Algoritmos de stemming Processamento de linguagem natural CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAO
description	The information retrieval process is a usual task for the human. However, having a complex automation. This happens because the quality of the results is often related with the degree of the user's satisfaction, a difficult parameter to measure. In general this quality is evaluated being taking into account a group of queries in a text collection, and their relevant answers. Commonly, two evaluation measures are used in this process: the first is the precision, wich represents the proportion of recovered relevant items from the total of recovered items; and the second is the recall, wich represents the proportion of recovered relevant items from the total of relevant items of the collection. One of the challenges is to find efficient forms to represent the documents, in order to avoid ambiguity. An alternative to solve this problem consists of obtaining a unique representation for words that appear for a same concept. This task can be defined as stemming. Many times, the stemming process is dependent to the morphologic structure of the target language. For the Portuguese language, there were found few solutions to assist the demand for these algorithms. The morphologic complexity of Portuguese language, and the few stemming solutions found for this language, were the motivation for the research shown in this work. This work presents a new model for the stemming process, that is applicable to the Portuguese language, based on a statistical study accomplished in a collection of extracted words of the Brazilian Web. With objective of evaluating the model, a stemmer is implemented and compared with a solution found in the literature, especially developed for Portuguese. The main contributions of this work are the systematical model for the stemming process, besides the stemmer conceived and implemented specially for the Portuguese language.
publishDate	2008
dc.date.none.fl_str_mv	2008-06-16 2021-03-10T20:43:03Z 2021-03-10T20:43:03Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://app.uff.br/riuff/handle/1/17898
url	https://app.uff.br/riuff/handle/1/17898
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	CC-BY-SA info:eu-repo/semantics/openAccess
rights_invalid_str_mv	CC-BY-SA
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Programa de Pós-Graduação em Computação Computação
publisher.none.fl_str_mv	Programa de Pós-Graduação em Computação Computação
dc.source.none.fl_str_mv	reponame:Repositório Institucional da Universidade Federal Fluminense (RIUFF) instname:Universidade Federal Fluminense (UFF) instacron:UFF
instname_str	Universidade Federal Fluminense (UFF)
instacron_str	UFF
institution	UFF
reponame_str	Repositório Institucional da Universidade Federal Fluminense (RIUFF)
collection	Repositório Institucional da Universidade Federal Fluminense (RIUFF)
repository.name.fl_str_mv	Repositório Institucional da Universidade Federal Fluminense (RIUFF) - Universidade Federal Fluminense (UFF)
repository.mail.fl_str_mv	riuff@id.uff.br
_version_	1807838908058173440

Investigação do processo de stemming na lingua portuguesa

Registros relacionados