Text classification using compression-based dissimilarity measures
Autor(a) principal: | |
---|---|
Data de Publicação: | 2015 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10400.21/6144 |
Resumo: | Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering. |
id |
RCAP_7e02e1f0a46623094cf5a2740b432a89 |
---|---|
oai_identifier_str |
oai:repositorio.ipl.pt:10400.21/6144 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Text classification using compression-based dissimilarity measuresText classificationText similarity measuresRelative entropyZiv-Merhav methodCross-parsing algorithmArguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.World Scientific Publications CO PTE LTDRCIPLCoutinho, David PereiraFigueiredo, Mário A. T.2016-05-03T10:47:14Z2015-082015-08-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.21/6144engCOUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23 (2015), pp. 1-180218-001410.1142/S0218001415530043metadata only accessinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-08-03T09:50:33Zoai:repositorio.ipl.pt:10400.21/6144Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:15:19.960472Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Text classification using compression-based dissimilarity measures |
title |
Text classification using compression-based dissimilarity measures |
spellingShingle |
Text classification using compression-based dissimilarity measures Coutinho, David Pereira Text classification Text similarity measures Relative entropy Ziv-Merhav method Cross-parsing algorithm |
title_short |
Text classification using compression-based dissimilarity measures |
title_full |
Text classification using compression-based dissimilarity measures |
title_fullStr |
Text classification using compression-based dissimilarity measures |
title_full_unstemmed |
Text classification using compression-based dissimilarity measures |
title_sort |
Text classification using compression-based dissimilarity measures |
author |
Coutinho, David Pereira |
author_facet |
Coutinho, David Pereira Figueiredo, Mário A. T. |
author_role |
author |
author2 |
Figueiredo, Mário A. T. |
author2_role |
author |
dc.contributor.none.fl_str_mv |
RCIPL |
dc.contributor.author.fl_str_mv |
Coutinho, David Pereira Figueiredo, Mário A. T. |
dc.subject.por.fl_str_mv |
Text classification Text similarity measures Relative entropy Ziv-Merhav method Cross-parsing algorithm |
topic |
Text classification Text similarity measures Relative entropy Ziv-Merhav method Cross-parsing algorithm |
description |
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering. |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015-08 2015-08-01T00:00:00Z 2016-05-03T10:47:14Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10400.21/6144 |
url |
http://hdl.handle.net/10400.21/6144 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
COUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23 (2015), pp. 1-18 0218-0014 10.1142/S0218001415530043 |
dc.rights.driver.fl_str_mv |
metadata only access info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
metadata only access |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
World Scientific Publications CO PTE LTD |
publisher.none.fl_str_mv |
World Scientific Publications CO PTE LTD |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799133412018094080 |