Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling

Detalhes bibliográficos
Autor(a) principal: Washington Luiz Miranda da Cunha
Data de Publicação: 2019
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Institucional da UFMG
Texto Completo: http://hdl.handle.net/1843/33474
Resumo: Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this dissertation, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the “best” documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Another main contribution is a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline.
id UFMG_6323c0ff7483944f6b8de4285d52745d
oai_identifier_str oai:repositorio.ufmg.br:1843/33474
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Marcos André Gonçalveshttp://lattes.cnpq.br/3457219624656691Leonardo Chaves Dutra da Rochahttp://lattes.cnpq.br/8074447921818504Jussara Marques de Almeida GonçalvesAnisio Mendes Lacerdahttp://lattes.cnpq.br/6927963916587716Washington Luiz Miranda da Cunha2020-05-15T17:50:19Z2020-05-15T17:50:19Z2019-11-08http://hdl.handle.net/1843/33474Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this dissertation, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the “best” documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Another main contribution is a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline.Pipelines de classificação de texto são uma sequência de tarefas que devem ser executadas para classificar documentos em um conjunto de categorias predefinidas. A fase de pré-processamento (antes do treinamento) desses pipelines envolve diferentes maneiras de transformar e manipular os documentos para a próxima fase (aprendizado). Nesta dissertação, apresentamos três novas etapas na fase de pré-processamento dos pipelines de classificação de texto para melhorar a eficácia e reduzir os custos associados. A etapa de geração de meta-features (MFs) baseadas em distância visa reduzir a dimensionalidade da matriz termo-documento original, enquanto produz um espaço potencialmente mais informativo, o qual explora explicitamente as informações discriminativas sobre as categorias. O segundo passo é a esparsificação que visa tornar a representação do MF menos densa para reduzir os custos de treinamento. A terceira etapa é a amostragem seletiva (SS), destinada a remover linhas (documentos) da matriz obtida na etapa anterior, selecionando cuidadosamente os “melhores” documentos para a fase de aprendizado. Nossos experimentos mostram que o pipeline de pré-processamento estendido proposto pode obter ganhos significativos em eficácia quando comparado ao TF-IDF original (até 52 %) e às representações baseadas em embeddings (até 46 %), a um custo muito menor (até 9,7x mais rápido em alguns conjuntos de dados). Outra contribuição principal é uma avaliação completa e rigorosa do trade-off entre custo e eficácia associadas à introdução dessas novas etapas no pipeline.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICEX - INSTITUTO DE CIÊNCIAS EXATASComputação - TesesAprendizado de máquina - TesesPipelines de classificação de texto - TesesPré-processamento de dados - TesesText classification pipelinesPre-processingMeta-featuresSparsificationSelective samplingExtended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective samplinginfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALdissertacao_washingtonCunha_vfinal.pdfdissertacao_washingtonCunha_vfinal.pdfapplication/pdf1676084https://repositorio.ufmg.br/bitstream/1843/33474/1/dissertacao_washingtonCunha_vfinal.pdf8f3c831a6b439e0cea70cb6adf8eda7fMD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82119https://repositorio.ufmg.br/bitstream/1843/33474/2/license.txt34badce4be7e31e3adb4575ae96af679MD521843/334742020-05-15 14:50:19.541oai:repositorio.ufmg.br:1843/33474TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KCg==Repositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2020-05-15T17:50:19Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
title Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
spellingShingle Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
Washington Luiz Miranda da Cunha
Text classification pipelines
Pre-processing
Meta-features
Sparsification
Selective sampling
Computação - Teses
Aprendizado de máquina - Teses
Pipelines de classificação de texto - Teses
Pré-processamento de dados - Teses
title_short Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
title_full Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
title_fullStr Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
title_full_unstemmed Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
title_sort Extended pre-processing pipeline for text classification: on the role of meta-features, sparsification and selective sampling
author Washington Luiz Miranda da Cunha
author_facet Washington Luiz Miranda da Cunha
author_role author
dc.contributor.advisor1.fl_str_mv Marcos André Gonçalves
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/3457219624656691
dc.contributor.advisor2.fl_str_mv Leonardo Chaves Dutra da Rocha
dc.contributor.advisor2Lattes.fl_str_mv http://lattes.cnpq.br/8074447921818504
dc.contributor.referee1.fl_str_mv Jussara Marques de Almeida Gonçalves
dc.contributor.referee2.fl_str_mv Anisio Mendes Lacerda
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/6927963916587716
dc.contributor.author.fl_str_mv Washington Luiz Miranda da Cunha
contributor_str_mv Marcos André Gonçalves
Leonardo Chaves Dutra da Rocha
Jussara Marques de Almeida Gonçalves
Anisio Mendes Lacerda
dc.subject.por.fl_str_mv Text classification pipelines
Pre-processing
Meta-features
Sparsification
Selective sampling
topic Text classification pipelines
Pre-processing
Meta-features
Sparsification
Selective sampling
Computação - Teses
Aprendizado de máquina - Teses
Pipelines de classificação de texto - Teses
Pré-processamento de dados - Teses
dc.subject.other.pt_BR.fl_str_mv Computação - Teses
Aprendizado de máquina - Teses
Pipelines de classificação de texto - Teses
Pré-processamento de dados - Teses
description Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this dissertation, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the “best” documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Another main contribution is a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline.
publishDate 2019
dc.date.issued.fl_str_mv 2019-11-08
dc.date.accessioned.fl_str_mv 2020-05-15T17:50:19Z
dc.date.available.fl_str_mv 2020-05-15T17:50:19Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1843/33474
url http://hdl.handle.net/1843/33474
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv UFMG
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv ICEX - INSTITUTO DE CIÊNCIAS EXATAS
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
bitstream.url.fl_str_mv https://repositorio.ufmg.br/bitstream/1843/33474/1/dissertacao_washingtonCunha_vfinal.pdf
https://repositorio.ufmg.br/bitstream/1843/33474/2/license.txt
bitstream.checksum.fl_str_mv 8f3c831a6b439e0cea70cb6adf8eda7f
34badce4be7e31e3adb4575ae96af679
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_ 1803589201109712896