Newsminer: um sistema de data warehouse baseado em texto de notícias

Detalhes bibliográficos
Autor(a) principal: Nogueira, Rodrigo Ramos
Data de Publicação: 2017
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFSCAR
Texto Completo: https://repositorio.ufscar.br/handle/ufscar/9138
Resumo: Data and text mining applications managing Web data have been the subject of recent research. In every case, data mining tasks need to work on clean, consistent, and integrated data for obtaining the best results. Thus, Data Warehouse environments are a valuable source of clean, integrated data for data mining applications. Data Warehouse technology has evolved to retrieve and process data from the Web. In particular, news websites are rich sources that can compose a linguistic corpus. By inserting corpus into a Data Warehousing environment, applications can take advantage of the flexibility that a multidimensional model and OLAP operations provide. Among the benefits are the navigation through the data, the selection of the part of the data considered relevant, data analysis at different levels of abstraction, and aggregation, disaggregation, rotation and filtering over any set of data. This paper presents Newsminer, a data warehouse environment, which provides a consistent and clean set of texts in the form of a multidimensional corpus for consumption by external applications and users. The proposal includes an architecture that integrates the gathering of news in real time, a semantic enrichment module as part of the ETL stage, which adds semantic properties to the data such as news category and POS-tagging annotation and the access to data cubes for consumption by applications and users. Two experiments were performed. The first experiment selects the best news classifier for the semantic enrichment module. The statistical analysis of the results indicated that the Perceptron classifier achieved the best results of F-measure, with a good result of computational time. The second experiment collected data to evaluate real-time news preprocessing. For the data set collected, the results indicated that it is possible to achieve online processing time.
id SCAR_074bd4eb424766402c9a6da9c56181ff
oai_identifier_str oai:repositorio.ufscar.br:ufscar/9138
network_acronym_str SCAR
network_name_str Repositório Institucional da UFSCAR
repository_id_str 4322
spelling Nogueira, Rodrigo RamosGonzalez, Sahudy Montenegrohttp://lattes.cnpq.br/9826346918182685http://lattes.cnpq.br/0327974399448757ad93a7ca-079d-4fd6-b116-627e17b4c3582017-10-09T14:14:24Z2017-10-09T14:14:24Z2017-05-12NOGUEIRA, Rodrigo Ramos. Newsminer: um sistema de data warehouse baseado em texto de notícias. 2017. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2017. Disponível em: https://repositorio.ufscar.br/handle/ufscar/9138.https://repositorio.ufscar.br/handle/ufscar/9138Data and text mining applications managing Web data have been the subject of recent research. In every case, data mining tasks need to work on clean, consistent, and integrated data for obtaining the best results. Thus, Data Warehouse environments are a valuable source of clean, integrated data for data mining applications. Data Warehouse technology has evolved to retrieve and process data from the Web. In particular, news websites are rich sources that can compose a linguistic corpus. By inserting corpus into a Data Warehousing environment, applications can take advantage of the flexibility that a multidimensional model and OLAP operations provide. Among the benefits are the navigation through the data, the selection of the part of the data considered relevant, data analysis at different levels of abstraction, and aggregation, disaggregation, rotation and filtering over any set of data. This paper presents Newsminer, a data warehouse environment, which provides a consistent and clean set of texts in the form of a multidimensional corpus for consumption by external applications and users. The proposal includes an architecture that integrates the gathering of news in real time, a semantic enrichment module as part of the ETL stage, which adds semantic properties to the data such as news category and POS-tagging annotation and the access to data cubes for consumption by applications and users. Two experiments were performed. The first experiment selects the best news classifier for the semantic enrichment module. The statistical analysis of the results indicated that the Perceptron classifier achieved the best results of F-measure, with a good result of computational time. The second experiment collected data to evaluate real-time news preprocessing. For the data set collected, the results indicated that it is possible to achieve online processing time.As aplicações de mineração de dados e textos oriundos da Internet têm sido alvo de recentes pesquisas. E, em todos os casos, as tarefas de mineração de dados necessitam trabalhar sobre dados limpos, consistentes e integrados para obter os melhores resultados. Sendo assim, ambientes de Data Warehouse são uma valiosa fonte de dados limpos e integrados para as aplicações de mineração. A tecnologia de Data Warehouse tem evoluído no sentido de recuperar e tratar dados provenientes da Web. Em particular, os sites de notícias são fontes ricas em textos, que podem compor um corpus linguístico. Inserindo o corpus em um ambiente de Data Warehouse, as aplicações poderão tirar proveito da flexibilidade que um modelo multidimensional e as operações OLAP fornecem. Dentre as vantagens estão a navegação pelos dados, a seleção da parte dos dados considerados relevantes, a análise dos dados em diferentes níveis de abstração, e a agregação, desagregação, rotação e filtragem sobre qualquer conjunto de dados. Este trabalho apresenta o ambiente de Data Warehouse Newsminer, que fornece um conjunto de textos consistente e limpo, na forma de um corpus multidimensional para consumo por aplicações externas e usuários. A proposta inclui uma arquitetura que integra a coleta textos de notícias em tempo próximo do tempo real, um módulo de enriquecimento semântico como parte da etapa de ETL, que acrescenta propriedades semânticas aos dados coletados tais como a categoria da notícia e a anotação POS-tagging, e a disponibilização de cubos de dados para consumo por aplicações e usuários. Foram executados dois experimentos. O primeiro experimento é relacionado à escolha do melhor classificador de categorias das notícias do módulo de enriquecimento semântico. A análise estatística dos resultados indicou que o classificador Perceptron atingiu os melhores resultados de F-medida, com resultado bom de tempo de processamento. O segundo experimento coletou dados para avaliar o pré-processamento de notícias em tempo real. Para o conjunto de dados coletados, os resultados indicaram que é possível atingir tempo de processamento online.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)OB800972porUniversidade Federal de São CarlosCâmpus SorocabaPrograma de Pós-Graduação em Ciência da Computação - PPGCC-SoUFSCarMineração de dados (Computação)Sites da WebCorpora multidimensionalEnriquecimento semânticoCategorização de notíciasOLAPMultidimensional corporaData miningWeb sitesData WarehouseNews websitesSemantic enrichmentNews categorizationCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAONewsminer: um sistema de data warehouse baseado em texto de notíciasNewsminer: a data warehouse system based on news websitesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisOnline600600650ef0c9-17ab-462d-9df3-e6221084fe8cinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALNOGUEIRA_Rodrigo_2017.pdfNOGUEIRA_Rodrigo_2017.pdfapplication/pdf5427774https://repositorio.ufscar.br/bitstream/ufscar/9138/1/NOGUEIRA_Rodrigo_2017.pdfdb8155583bf1bffe3ceb4c01bf26f66fMD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81957https://repositorio.ufscar.br/bitstream/ufscar/9138/2/license.txtae0398b6f8b235e40ad82cba6c50031dMD52TEXTNOGUEIRA_Rodrigo_2017.pdf.txtNOGUEIRA_Rodrigo_2017.pdf.txtExtracted texttext/plain142558https://repositorio.ufscar.br/bitstream/ufscar/9138/3/NOGUEIRA_Rodrigo_2017.pdf.txt31c70e32eb759203a79f6c0621f57f9cMD53THUMBNAILNOGUEIRA_Rodrigo_2017.pdf.jpgNOGUEIRA_Rodrigo_2017.pdf.jpgIM Thumbnailimage/jpeg5848https://repositorio.ufscar.br/bitstream/ufscar/9138/4/NOGUEIRA_Rodrigo_2017.pdf.jpga13e96016c83f1b7edc17d79cff09d71MD54ufscar/91382023-09-18 18:31:26.558oai:repositorio.ufscar.br:ufscar/9138TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlCkZlZGVyYWwgZGUgU8OjbyBDYXJsb3MgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdQpkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlCmVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVUZTQ2FyIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28KcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVGU0NhciBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdQpkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcwpuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldQpjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6oKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVGU0NhcgpvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVRlNDYXIsClZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVRlNDYXIgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzCmNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:26Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv Newsminer: um sistema de data warehouse baseado em texto de notícias
dc.title.alternative.eng.fl_str_mv Newsminer: a data warehouse system based on news websites
title Newsminer: um sistema de data warehouse baseado em texto de notícias
spellingShingle Newsminer: um sistema de data warehouse baseado em texto de notícias
Nogueira, Rodrigo Ramos
Mineração de dados (Computação)
Sites da Web
Corpora multidimensional
Enriquecimento semântico
Categorização de notícias
OLAP
Multidimensional corpora
Data mining
Web sites
Data Warehouse
News websites
Semantic enrichment
News categorization
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short Newsminer: um sistema de data warehouse baseado em texto de notícias
title_full Newsminer: um sistema de data warehouse baseado em texto de notícias
title_fullStr Newsminer: um sistema de data warehouse baseado em texto de notícias
title_full_unstemmed Newsminer: um sistema de data warehouse baseado em texto de notícias
title_sort Newsminer: um sistema de data warehouse baseado em texto de notícias
author Nogueira, Rodrigo Ramos
author_facet Nogueira, Rodrigo Ramos
author_role author
dc.contributor.authorlattes.por.fl_str_mv http://lattes.cnpq.br/0327974399448757
dc.contributor.author.fl_str_mv Nogueira, Rodrigo Ramos
dc.contributor.advisor1.fl_str_mv Gonzalez, Sahudy Montenegro
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/9826346918182685
dc.contributor.authorID.fl_str_mv ad93a7ca-079d-4fd6-b116-627e17b4c358
contributor_str_mv Gonzalez, Sahudy Montenegro
dc.subject.por.fl_str_mv Mineração de dados (Computação)
Sites da Web
Corpora multidimensional
Enriquecimento semântico
Categorização de notícias
OLAP
Multidimensional corpora
topic Mineração de dados (Computação)
Sites da Web
Corpora multidimensional
Enriquecimento semântico
Categorização de notícias
OLAP
Multidimensional corpora
Data mining
Web sites
Data Warehouse
News websites
Semantic enrichment
News categorization
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv Data mining
Web sites
Data Warehouse
News websites
Semantic enrichment
News categorization
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description Data and text mining applications managing Web data have been the subject of recent research. In every case, data mining tasks need to work on clean, consistent, and integrated data for obtaining the best results. Thus, Data Warehouse environments are a valuable source of clean, integrated data for data mining applications. Data Warehouse technology has evolved to retrieve and process data from the Web. In particular, news websites are rich sources that can compose a linguistic corpus. By inserting corpus into a Data Warehousing environment, applications can take advantage of the flexibility that a multidimensional model and OLAP operations provide. Among the benefits are the navigation through the data, the selection of the part of the data considered relevant, data analysis at different levels of abstraction, and aggregation, disaggregation, rotation and filtering over any set of data. This paper presents Newsminer, a data warehouse environment, which provides a consistent and clean set of texts in the form of a multidimensional corpus for consumption by external applications and users. The proposal includes an architecture that integrates the gathering of news in real time, a semantic enrichment module as part of the ETL stage, which adds semantic properties to the data such as news category and POS-tagging annotation and the access to data cubes for consumption by applications and users. Two experiments were performed. The first experiment selects the best news classifier for the semantic enrichment module. The statistical analysis of the results indicated that the Perceptron classifier achieved the best results of F-measure, with a good result of computational time. The second experiment collected data to evaluate real-time news preprocessing. For the data set collected, the results indicated that it is possible to achieve online processing time.
publishDate 2017
dc.date.accessioned.fl_str_mv 2017-10-09T14:14:24Z
dc.date.available.fl_str_mv 2017-10-09T14:14:24Z
dc.date.issued.fl_str_mv 2017-05-12
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv NOGUEIRA, Rodrigo Ramos. Newsminer: um sistema de data warehouse baseado em texto de notícias. 2017. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2017. Disponível em: https://repositorio.ufscar.br/handle/ufscar/9138.
dc.identifier.uri.fl_str_mv https://repositorio.ufscar.br/handle/ufscar/9138
identifier_str_mv NOGUEIRA, Rodrigo Ramos. Newsminer: um sistema de data warehouse baseado em texto de notícias. 2017. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2017. Disponível em: https://repositorio.ufscar.br/handle/ufscar/9138.
url https://repositorio.ufscar.br/handle/ufscar/9138
dc.language.iso.fl_str_mv por
language por
dc.relation.confidence.fl_str_mv 600
600
dc.relation.authority.fl_str_mv 650ef0c9-17ab-462d-9df3-e6221084fe8c
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus Sorocaba
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Ciência da Computação - PPGCC-So
dc.publisher.initials.fl_str_mv UFSCar
publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus Sorocaba
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFSCAR
instname:Universidade Federal de São Carlos (UFSCAR)
instacron:UFSCAR
instname_str Universidade Federal de São Carlos (UFSCAR)
instacron_str UFSCAR
institution UFSCAR
reponame_str Repositório Institucional da UFSCAR
collection Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv https://repositorio.ufscar.br/bitstream/ufscar/9138/1/NOGUEIRA_Rodrigo_2017.pdf
https://repositorio.ufscar.br/bitstream/ufscar/9138/2/license.txt
https://repositorio.ufscar.br/bitstream/ufscar/9138/3/NOGUEIRA_Rodrigo_2017.pdf.txt
https://repositorio.ufscar.br/bitstream/ufscar/9138/4/NOGUEIRA_Rodrigo_2017.pdf.jpg
bitstream.checksum.fl_str_mv db8155583bf1bffe3ceb4c01bf26f66f
ae0398b6f8b235e40ad82cba6c50031d
31c70e32eb759203a79f6c0621f57f9c
a13e96016c83f1b7edc17d79cff09d71
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_ 1802136330869669888