A Thorough exploitation of distance-based meta-features for Automated text classification

Sergio Daniel Carvalho Canuto

A Thorough exploitation of distance-based meta-features for Automated text classification

Detalhes bibliográficos
Autor(a) principal:	Sergio Daniel Carvalho Canuto
Data de Publicação:	2019
Tipo de documento:	Tese
Idioma:	eng
Título da fonte:	Repositório Institucional da UFMG
Texto Completo:	http://hdl.handle.net/1843/34071
Resumo:	Automated Text Classification (ATC) has become substantially important for a variety of tasks, such as categorizing news, organizing digital libraries, building web directories, analyzing sentiment of user-generated content and detecting spam, to name a few. Given a set of training documents classified into one or more predefined categories, the task of ATC is to utomatically learn how to classify new (unclassified) documents, using a combination of features of these documents that associates them with categories. Due to the fact that the ATC problem occurs in a number of different applications, diverse machine learning algorithms have been proposed to deal with ATC. Although the classification algorithm itself plays an important role in ATC, the features that represent documents may be equally important to determine effectiveness. In particular, representing documents in a feature space is a prerequisite work for ATC, since these classification algorithms are designed to discover discriminative patterns on these features. In this sense, a relevant challenge relies on efficiently manipulating the feature space to address ATC from a data engineering viewpoint. In this context, we address the problem of automatically learning to classify texts by exploiting information derived from meta-features, i.e., features engineered from the original (bag-of-words) representation. Particularly, the exploited meta-features rely on distance measures to summarize complex relationships between documents and present discriminative information for classification. We here not only propose new meta-features that provide discriminative evidence for classification, but also new mechanisms to analyze and select meta-features using multi-objective strategies. These strategies are capable of reducing the number of meta-features while maximizing the classification effectiveness, when considering the adequacy of the selected meta-features to a particular dataset or classification method. Moreover, we provide additional contributions to improve the efficiency and effectiveness of meta-features. Particularly, we propose: (i) the use of commodity GPUs to reduce the computational time to generate meta-features; (ii) the use of supervised learning to enrich distance relationships with labeled information; and (iii) the design of new specific meta-features for the sentiment analysis context. Our experimental results on five traditional benchmarks for topic classification show that with the appropriate selection techniques, our distance-based meta-features can achieve remarkable classification results considering the results of original feature space and other recently proposed distance-based meta-features. We further explain our results with the identification and discussion about meta-features that, when combined, provide core information to classify documents. Our improvements on core meta-features using labeled information to enrich distance relationships provide additional gains over our best results in topic datasets. We also evaluate meta-features on nineteen sentiment analysis datasets. In this context, our proposals for sentiment classification produced remarkable results considering the effectiveness of previous meta-features that do not take sentiment analysis idiosyncrasies into account.

Metadados do item

id	UFMG_2421b23aaedb7461499385a2ff4b6065
oai_identifier_str	oai:repositorio.ufmg.br:1843/34071
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	Marcos André Gonçalveshttp://lattes.cnpq.br/3457219624656691Thierson Couto RosaGisele Lobo PappaRodrygo Luis Teodoro SantosPável Pereira CaladoAlexandre Plastino de Carvalhohttp://lattes.cnpq.br/5172447060300953Sergio Daniel Carvalho Canuto2020-08-28T19:39:14Z2020-08-28T19:39:14Z2019-11-22http://hdl.handle.net/1843/34071Automated Text Classification (ATC) has become substantially important for a variety of tasks, such as categorizing news, organizing digital libraries, building web directories, analyzing sentiment of user-generated content and detecting spam, to name a few. Given a set of training documents classified into one or more predefined categories, the task of ATC is to utomatically learn how to classify new (unclassified) documents, using a combination of features of these documents that associates them with categories. Due to the fact that the ATC problem occurs in a number of different applications, diverse machine learning algorithms have been proposed to deal with ATC. Although the classification algorithm itself plays an important role in ATC, the features that represent documents may be equally important to determine effectiveness. In particular, representing documents in a feature space is a prerequisite work for ATC, since these classification algorithms are designed to discover discriminative patterns on these features. In this sense, a relevant challenge relies on efficiently manipulating the feature space to address ATC from a data engineering viewpoint. In this context, we address the problem of automatically learning to classify texts by exploiting information derived from meta-features, i.e., features engineered from the original (bag-of-words) representation. Particularly, the exploited meta-features rely on distance measures to summarize complex relationships between documents and present discriminative information for classification. We here not only propose new meta-features that provide discriminative evidence for classification, but also new mechanisms to analyze and select meta-features using multi-objective strategies. These strategies are capable of reducing the number of meta-features while maximizing the classification effectiveness, when considering the adequacy of the selected meta-features to a particular dataset or classification method. Moreover, we provide additional contributions to improve the efficiency and effectiveness of meta-features. Particularly, we propose: (i) the use of commodity GPUs to reduce the computational time to generate meta-features; (ii) the use of supervised learning to enrich distance relationships with labeled information; and (iii) the design of new specific meta-features for the sentiment analysis context. Our experimental results on five traditional benchmarks for topic classification show that with the appropriate selection techniques, our distance-based meta-features can achieve remarkable classification results considering the results of original feature space and other recently proposed distance-based meta-features. We further explain our results with the identification and discussion about meta-features that, when combined, provide core information to classify documents. Our improvements on core meta-features using labeled information to enrich distance relationships provide additional gains over our best results in topic datasets. We also evaluate meta-features on nineteen sentiment analysis datasets. In this context, our proposals for sentiment classification produced remarkable results considering the effectiveness of previous meta-features that do not take sentiment analysis idiosyncrasies into account.Classificação Automática de Texto (CAT) têm adquirido notória importância em uma variedade de tarefas, como a categorização de notícias, organização de bibliotecas digitais, criação de diretórios da web, análise de sentimentos em conteúdos gerados por usuários e detecção de spam. Dado um conjunto de documentos de treinamento classificados em uma ou mais categorias predefinidas, a tarefa do CAT é aprender automaticamente como classificar novos documentos (não classificados), usando uma combinação de atributos desses documentos que os associam a categorias. Devido ao fato de o problema do CAt ocorrer em vários contextos, diversos algoritmos de aprendizado de máquina foram propostos para lidar com CAT.Embora o próprio algoritmo de classificação tenha um papel importante na CAT, os atributos que representam documentos podem ser igualmente importantes para determinar a eficácia da classificação. Especificamente, representar documentos em um espaço de atributos é um trabalho que precede a CAT, pois esses algoritmos de classificação são projetados para descobrir padrões discriminativos usando esses atributos. Nesse sentido, uma tarefa importante consiste em promover a manipulação espaço de atributos para abordar a CAT do ponto de vista da engenharia de dados. Nesse contexto, abordamos o problema de aprender a classificar textos de forma automática, explorando informações derivadas de meta-atributos, ou seja, atributos criados a partir da representação original dos documentos (bag of words). Particularmente, os meta-atributos explorados contam com medidas de distância capazes de sumarizar relacionamentos potencialmente complexos entre documentos e apresentar informações relevantes para classificação.Neste trabalho, não apenas propomos novos meta-atributos que fornecem evidências discriminativas para classificação, mas também novos mecanismos para analisar e selecionar meta-atributos. sentido, utilizamos estratégias multiobjetivo capazes de minimizar o número de meta-atributos e maximizar a eficácia da classificação, considerando a adequação dos meta-atributos selecionados a uma coleção de dados ou método de classificação específico. Além disso, fornecemos contribuições adicionais para aprimorar a eficiência e a eficácia da utilização de meta-atributos. Em particular, propomos o uso de GPUs (Graphical Processxiing Units) para reduzir o tempo computacional da geração de meta-atributos, o uso de aprendizado supervisionado para o enriquecimento dos relacionamentos de distância com dados rotulados, e a construção de novos meta-atributos específicos para o contexto da análise de sentimento. Nossos resultados experimentais em cinco coleções tradicionalmente usadas na classificação em tópicos mostram que, com as técnicas de seleção apropriadas, nossos metaatributos baseados em distância podem alcançar excelentes resultados de classificação considerando os resultados previamente obtidos no espaço de atributos original ou outros metaatributos baseados em distância recentemente propostos. Além disso, avançamos nossa análise experimental com a identificação e discussão de meta-atributos que, quando combinados, fornecem informações centrais para a classificação de documentos. Aprimoramentos adicionais nesses meta-atributos a partir do enriquecimento dos relacionamentos de distância com informações de rotulação proporcionaram ganhos adicionais sobre nossos melhores resultados obtidos em coleções de classificação em tópicos. Também avaliamos meta-atributos em dezenove coleções de análise de sentimento. Nesse contexto, nossas propostas para classificação de sentimento apresentaram excelentes resultados quando comparados aos metaatributos anteriores que não levam em consideração as idiossincrasias da tarefa de análise de sentimentoCAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilComputação – TesesAprendizado supervisionado.Meta característicasAprendizado de máquinaSupervised classificationText classificationMeta-featuresMachine learningA Thorough exploitation of distance-based meta-features for Automated text classificationinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALtese_com_ficha (3) (2).pdftese_com_ficha (3) (2).pdfapplication/pdf2063864https://repositorio.ufmg.br/bitstream/1843/34071/1/tese_com_ficha%20%283%29%20%282%29.pdfdfd6af9eff4d85e239d86c55597d0456MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82119https://repositorio.ufmg.br/bitstream/1843/34071/2/license.txt34badce4be7e31e3adb4575ae96af679MD521843/340712020-08-28 16:39:14.718oai:repositorio.ufmg.br:1843/34071TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KCg==Repositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2020-08-28T19:39:14Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv	A Thorough exploitation of distance-based meta-features for Automated text classification
title	A Thorough exploitation of distance-based meta-features for Automated text classification
spellingShingle	A Thorough exploitation of distance-based meta-features for Automated text classification Sergio Daniel Carvalho Canuto Supervised classification Text classification Meta-features Machine learning Computação – Teses Aprendizado supervisionado. Meta características Aprendizado de máquina
title_short	A Thorough exploitation of distance-based meta-features for Automated text classification
title_full	A Thorough exploitation of distance-based meta-features for Automated text classification
title_fullStr	A Thorough exploitation of distance-based meta-features for Automated text classification
title_full_unstemmed	A Thorough exploitation of distance-based meta-features for Automated text classification
title_sort	A Thorough exploitation of distance-based meta-features for Automated text classification
author	Sergio Daniel Carvalho Canuto
author_facet	Sergio Daniel Carvalho Canuto
author_role	author
dc.contributor.advisor1.fl_str_mv	Marcos André Gonçalves
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/3457219624656691
dc.contributor.advisor-co1.fl_str_mv	Thierson Couto Rosa
dc.contributor.referee1.fl_str_mv	Gisele Lobo Pappa
dc.contributor.referee2.fl_str_mv	Rodrygo Luis Teodoro Santos
dc.contributor.referee3.fl_str_mv	Pável Pereira Calado
dc.contributor.referee4.fl_str_mv	Alexandre Plastino de Carvalho
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/5172447060300953
dc.contributor.author.fl_str_mv	Sergio Daniel Carvalho Canuto
contributor_str_mv	Marcos André Gonçalves Thierson Couto Rosa Gisele Lobo Pappa Rodrygo Luis Teodoro Santos Pável Pereira Calado Alexandre Plastino de Carvalho
dc.subject.por.fl_str_mv	Supervised classification Text classification Meta-features Machine learning
topic	Supervised classification Text classification Meta-features Machine learning Computação – Teses Aprendizado supervisionado. Meta características Aprendizado de máquina
dc.subject.other.pt_BR.fl_str_mv	Computação – Teses Aprendizado supervisionado. Meta características Aprendizado de máquina
description	Automated Text Classification (ATC) has become substantially important for a variety of tasks, such as categorizing news, organizing digital libraries, building web directories, analyzing sentiment of user-generated content and detecting spam, to name a few. Given a set of training documents classified into one or more predefined categories, the task of ATC is to utomatically learn how to classify new (unclassified) documents, using a combination of features of these documents that associates them with categories. Due to the fact that the ATC problem occurs in a number of different applications, diverse machine learning algorithms have been proposed to deal with ATC. Although the classification algorithm itself plays an important role in ATC, the features that represent documents may be equally important to determine effectiveness. In particular, representing documents in a feature space is a prerequisite work for ATC, since these classification algorithms are designed to discover discriminative patterns on these features. In this sense, a relevant challenge relies on efficiently manipulating the feature space to address ATC from a data engineering viewpoint. In this context, we address the problem of automatically learning to classify texts by exploiting information derived from meta-features, i.e., features engineered from the original (bag-of-words) representation. Particularly, the exploited meta-features rely on distance measures to summarize complex relationships between documents and present discriminative information for classification. We here not only propose new meta-features that provide discriminative evidence for classification, but also new mechanisms to analyze and select meta-features using multi-objective strategies. These strategies are capable of reducing the number of meta-features while maximizing the classification effectiveness, when considering the adequacy of the selected meta-features to a particular dataset or classification method. Moreover, we provide additional contributions to improve the efficiency and effectiveness of meta-features. Particularly, we propose: (i) the use of commodity GPUs to reduce the computational time to generate meta-features; (ii) the use of supervised learning to enrich distance relationships with labeled information; and (iii) the design of new specific meta-features for the sentiment analysis context. Our experimental results on five traditional benchmarks for topic classification show that with the appropriate selection techniques, our distance-based meta-features can achieve remarkable classification results considering the results of original feature space and other recently proposed distance-based meta-features. We further explain our results with the identification and discussion about meta-features that, when combined, provide core information to classify documents. Our improvements on core meta-features using labeled information to enrich distance relationships provide additional gains over our best results in topic datasets. We also evaluate meta-features on nineteen sentiment analysis datasets. In this context, our proposals for sentiment classification produced remarkable results considering the effectiveness of previous meta-features that do not take sentiment analysis idiosyncrasies into account.
publishDate	2019
dc.date.issued.fl_str_mv	2019-11-22
dc.date.accessioned.fl_str_mv	2020-08-28T19:39:14Z
dc.date.available.fl_str_mv	2020-08-28T19:39:14Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1843/34071
url	http://hdl.handle.net/1843/34071
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv	UFMG
dc.publisher.country.fl_str_mv	Brasil
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br/bitstream/1843/34071/1/tese_com_ficha%20%283%29%20%282%29.pdf https://repositorio.ufmg.br/bitstream/1843/34071/2/license.txt
bitstream.checksum.fl_str_mv	dfd6af9eff4d85e239d86c55597d0456 34badce4be7e31e3adb4575ae96af679
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_	1803589382368657408

A Thorough exploitation of distance-based meta-features for Automated text classification

Registros relacionados