SDRS: a new lossless dimensionality reduction for text corpora

Detalhes bibliográficos
Autor(a) principal: De Mendizabal, I. V.
Data de Publicação: 2020
Outros Autores: Basto-Fernandes, V., Ezpeleta, E, Méndez, J. R., Zurutuza, U.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/20406
Resumo: In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.
id RCAP_bd0f2cced72222f15baea142ae8c16c4
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/20406
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling SDRS: a new lossless dimensionality reduction for text corporaSpam filteringToken-based representationSynset-based representationSemantic-based feature reductionMulti-objective evolutionary algorithmsIn recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.Elsevier2023-03-21T00:00:00Z2020-01-01T00:00:00Z20202020-04-22T15:53:39Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/20406eng0306-457310.1016/j.ipm.2020.102249De Mendizabal, I. V.Basto-Fernandes, V.Ezpeleta, E,Méndez, J. R.Zurutuza, U.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:47:28Zoai:repositorio.iscte-iul.pt:10071/20406Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:23:00.933110Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv SDRS: a new lossless dimensionality reduction for text corpora
title SDRS: a new lossless dimensionality reduction for text corpora
spellingShingle SDRS: a new lossless dimensionality reduction for text corpora
De Mendizabal, I. V.
Spam filtering
Token-based representation
Synset-based representation
Semantic-based feature reduction
Multi-objective evolutionary algorithms
title_short SDRS: a new lossless dimensionality reduction for text corpora
title_full SDRS: a new lossless dimensionality reduction for text corpora
title_fullStr SDRS: a new lossless dimensionality reduction for text corpora
title_full_unstemmed SDRS: a new lossless dimensionality reduction for text corpora
title_sort SDRS: a new lossless dimensionality reduction for text corpora
author De Mendizabal, I. V.
author_facet De Mendizabal, I. V.
Basto-Fernandes, V.
Ezpeleta, E,
Méndez, J. R.
Zurutuza, U.
author_role author
author2 Basto-Fernandes, V.
Ezpeleta, E,
Méndez, J. R.
Zurutuza, U.
author2_role author
author
author
author
dc.contributor.author.fl_str_mv De Mendizabal, I. V.
Basto-Fernandes, V.
Ezpeleta, E,
Méndez, J. R.
Zurutuza, U.
dc.subject.por.fl_str_mv Spam filtering
Token-based representation
Synset-based representation
Semantic-based feature reduction
Multi-objective evolutionary algorithms
topic Spam filtering
Token-based representation
Synset-based representation
Semantic-based feature reduction
Multi-objective evolutionary algorithms
description In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.
publishDate 2020
dc.date.none.fl_str_mv 2020-01-01T00:00:00Z
2020
2020-04-22T15:53:39Z
2023-03-21T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/20406
url http://hdl.handle.net/10071/20406
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 0306-4573
10.1016/j.ipm.2020.102249
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134791859175424