COMPRESSED LEARNING FOR TEXT CATEGORIZATION

Detalhes bibliográficos
Autor(a) principal: Ferreira, Artur
Data de Publicação: 2013
Outros Autores: Figueiredo, Mario
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: https://doi.org/10.34629/ipl.isel.i-ETC.3
Resumo: In text classification based on the bag-of-words (BoW) or similar representations, we usually have a large number of features, many of which are irrelevant (or even detrimental) for classification tasks. Recent results show that compressed learning (CL), i.e., learning in a domain of reduced dimensionality obtained by random projections (RP), is possible, and theoretical bounds on the test set error rate have been shown. In this work, we assess the performance of CL, based on RP of BoW representations for text classification. Our experimental results show that CL significantly reduces the number of features and the training time, while simultaneously improving the classification accuracy. Rather than the mild decrease in accuracy upper bounded by the theory, we actually find an increase of accuracy. Our approach is further compared against two techniques, namely the unsupervised random subspaces method and the supervised Fisher index. The CL approach is suited for unsupervised or semi-supervised learning, without any modification, since it does not use the class labels.
id RCAP_8ac0e759f37ff52d93cf694a48d5bc7b
oai_identifier_str oai:i-ETC.journals.isel.pt:article/3
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling COMPRESSED LEARNING FOR TEXT CATEGORIZATIONComputers; Machine Learningrandom projections, random subspaces, compressed learning, text classification, support vector machinesIn text classification based on the bag-of-words (BoW) or similar representations, we usually have a large number of features, many of which are irrelevant (or even detrimental) for classification tasks. Recent results show that compressed learning (CL), i.e., learning in a domain of reduced dimensionality obtained by random projections (RP), is possible, and theoretical bounds on the test set error rate have been shown. In this work, we assess the performance of CL, based on RP of BoW representations for text classification. Our experimental results show that CL significantly reduces the number of features and the training time, while simultaneously improving the classification accuracy. Rather than the mild decrease in accuracy upper bounded by the theory, we actually find an increase of accuracy. Our approach is further compared against two techniques, namely the unsupervised random subspaces method and the supervised Fisher index. The CL approach is suited for unsupervised or semi-supervised learning, without any modification, since it does not use the class labels.ISEL - High Institute of Engineering of Lisbon2013-06-26T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://doi.org/10.34629/ipl.isel.i-ETC.3oai:i-ETC.journals.isel.pt:article/3i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computers; Vol 2, No 1 (2013): The CETC2011 Issue; ID-1i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computers; Vol 2, No 1 (2013): The CETC2011 Issue; ID-12182-4010reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPenghttp://journals.isel.pt/index.php/i-ETC/article/view/3https://doi.org/10.34629/ipl.isel.i-ETC.3http://journals.isel.pt/index.php/i-ETC/article/view/3/3Copyright (c) 2013 i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computershttp://creativecommons.org/licenses/by-nc/4.0info:eu-repo/semantics/openAccessFerreira, ArturFigueiredo, Mario2022-09-20T15:26:06Zoai:i-ETC.journals.isel.pt:article/3Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T15:51:11.874485Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv COMPRESSED LEARNING FOR TEXT CATEGORIZATION
title COMPRESSED LEARNING FOR TEXT CATEGORIZATION
spellingShingle COMPRESSED LEARNING FOR TEXT CATEGORIZATION
Ferreira, Artur
Computers; Machine Learning
random projections, random subspaces, compressed learning, text classification, support vector machines
title_short COMPRESSED LEARNING FOR TEXT CATEGORIZATION
title_full COMPRESSED LEARNING FOR TEXT CATEGORIZATION
title_fullStr COMPRESSED LEARNING FOR TEXT CATEGORIZATION
title_full_unstemmed COMPRESSED LEARNING FOR TEXT CATEGORIZATION
title_sort COMPRESSED LEARNING FOR TEXT CATEGORIZATION
author Ferreira, Artur
author_facet Ferreira, Artur
Figueiredo, Mario
author_role author
author2 Figueiredo, Mario
author2_role author
dc.contributor.author.fl_str_mv Ferreira, Artur
Figueiredo, Mario
dc.subject.por.fl_str_mv Computers; Machine Learning
random projections, random subspaces, compressed learning, text classification, support vector machines
topic Computers; Machine Learning
random projections, random subspaces, compressed learning, text classification, support vector machines
description In text classification based on the bag-of-words (BoW) or similar representations, we usually have a large number of features, many of which are irrelevant (or even detrimental) for classification tasks. Recent results show that compressed learning (CL), i.e., learning in a domain of reduced dimensionality obtained by random projections (RP), is possible, and theoretical bounds on the test set error rate have been shown. In this work, we assess the performance of CL, based on RP of BoW representations for text classification. Our experimental results show that CL significantly reduces the number of features and the training time, while simultaneously improving the classification accuracy. Rather than the mild decrease in accuracy upper bounded by the theory, we actually find an increase of accuracy. Our approach is further compared against two techniques, namely the unsupervised random subspaces method and the supervised Fisher index. The CL approach is suited for unsupervised or semi-supervised learning, without any modification, since it does not use the class labels.
publishDate 2013
dc.date.none.fl_str_mv 2013-06-26T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://doi.org/10.34629/ipl.isel.i-ETC.3
oai:i-ETC.journals.isel.pt:article/3
url https://doi.org/10.34629/ipl.isel.i-ETC.3
identifier_str_mv oai:i-ETC.journals.isel.pt:article/3
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv http://journals.isel.pt/index.php/i-ETC/article/view/3
https://doi.org/10.34629/ipl.isel.i-ETC.3
http://journals.isel.pt/index.php/i-ETC/article/view/3/3
dc.rights.driver.fl_str_mv Copyright (c) 2013 i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computers
http://creativecommons.org/licenses/by-nc/4.0
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Copyright (c) 2013 i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computers
http://creativecommons.org/licenses/by-nc/4.0
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv ISEL - High Institute of Engineering of Lisbon
publisher.none.fl_str_mv ISEL - High Institute of Engineering of Lisbon
dc.source.none.fl_str_mv i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computers; Vol 2, No 1 (2013): The CETC2011 Issue; ID-1
i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computers; Vol 2, No 1 (2013): The CETC2011 Issue; ID-1
2182-4010
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799130375489847296