Text classification using compression-based dissimilarity measures

Detalhes bibliográficos
Autor(a) principal: Coutinho, David Pereira
Data de Publicação: 2015
Outros Autores: Figueiredo, Mário A. T.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10400.21/6144
Resumo: Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
id RCAP_7e02e1f0a46623094cf5a2740b432a89
oai_identifier_str oai:repositorio.ipl.pt:10400.21/6144
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Text classification using compression-based dissimilarity measuresText classificationText similarity measuresRelative entropyZiv-Merhav methodCross-parsing algorithmArguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.World Scientific Publications CO PTE LTDRCIPLCoutinho, David PereiraFigueiredo, Mário A. T.2016-05-03T10:47:14Z2015-082015-08-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.21/6144engCOUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23 (2015), pp. 1-180218-001410.1142/S0218001415530043metadata only accessinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-08-03T09:50:33Zoai:repositorio.ipl.pt:10400.21/6144Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:15:19.960472Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Text classification using compression-based dissimilarity measures
title Text classification using compression-based dissimilarity measures
spellingShingle Text classification using compression-based dissimilarity measures
Coutinho, David Pereira
Text classification
Text similarity measures
Relative entropy
Ziv-Merhav method
Cross-parsing algorithm
title_short Text classification using compression-based dissimilarity measures
title_full Text classification using compression-based dissimilarity measures
title_fullStr Text classification using compression-based dissimilarity measures
title_full_unstemmed Text classification using compression-based dissimilarity measures
title_sort Text classification using compression-based dissimilarity measures
author Coutinho, David Pereira
author_facet Coutinho, David Pereira
Figueiredo, Mário A. T.
author_role author
author2 Figueiredo, Mário A. T.
author2_role author
dc.contributor.none.fl_str_mv RCIPL
dc.contributor.author.fl_str_mv Coutinho, David Pereira
Figueiredo, Mário A. T.
dc.subject.por.fl_str_mv Text classification
Text similarity measures
Relative entropy
Ziv-Merhav method
Cross-parsing algorithm
topic Text classification
Text similarity measures
Relative entropy
Ziv-Merhav method
Cross-parsing algorithm
description Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
publishDate 2015
dc.date.none.fl_str_mv 2015-08
2015-08-01T00:00:00Z
2016-05-03T10:47:14Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.21/6144
url http://hdl.handle.net/10400.21/6144
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv COUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23 (2015), pp. 1-18
0218-0014
10.1142/S0218001415530043
dc.rights.driver.fl_str_mv metadata only access
info:eu-repo/semantics/openAccess
rights_invalid_str_mv metadata only access
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv World Scientific Publications CO PTE LTD
publisher.none.fl_str_mv World Scientific Publications CO PTE LTD
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799133412018094080