Normalização textual e indexação semântica aplicadas da filtragem de SMS spam

Detalhes bibliográficos
Autor(a) principal: Silva, Tiago Pasqualini da
Data de Publicação: 2016
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFSCAR
Texto Completo: https://repositorio.ufscar.br/handle/ufscar/8811
Resumo: The rapid popularization of smartphones has contributed to the growth of SMS usage as an alternative way of communication. The increasing number of users, along with the trust they inherently have in their devices, makes SMS messages a propitious environment for spammers. In fact, reports clearly indicate that volume of mobile phone spam is dramatically increasing year by year. SMS spam represents a challenging problem for traditional filtering methods nowadays, since such messages are usually fairly short and normally rife with slangs, idioms, symbols and acronyms that make even tokenization a difficult task. In this scenario, this thesis proposes and then evaluates a method to normalize and expand original short and messy SMS text messages in order to acquire better attributes and enhance the classification performance. The proposed text processing approach is based on lexicography and semantic dictionaries along with the state-of-the-art techniques for semantic analysis and context detection. This technique is used to normalize terms and create new attributes in order to change and expand original text samples aiming to alleviate factors that can degrade the algorithms performance, such as redundancies and inconsistencies. The approach was validated with a public, real and non-encoded dataset along with several established machine learning methods. The experiments were diligently designed to ensure statistically sound results which indicate that the proposed text processing techniques can in fact enhance SMS spam filtering.
id SCAR_3fefc04e11bbcef9a900386ce9d85b15
oai_identifier_str oai:repositorio.ufscar.br:ufscar/8811
network_acronym_str SCAR
network_name_str Repositório Institucional da UFSCAR
repository_id_str 4322
spelling Silva, Tiago Pasqualini daAlmeida, Tiago Agostinho dehttp://lattes.cnpq.br/5368680512020633http://lattes.cnpq.br/4030198351353056965803df-e76e-4694-be65-66986e456d192017-06-01T17:49:38Z2017-06-01T17:49:38Z2016-07-01SILVA, Tiago Pasqualini da. Normalização textual e indexação semântica aplicadas da filtragem de SMS spam. 2016. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2016. Disponível em: https://repositorio.ufscar.br/handle/ufscar/8811.https://repositorio.ufscar.br/handle/ufscar/8811The rapid popularization of smartphones has contributed to the growth of SMS usage as an alternative way of communication. The increasing number of users, along with the trust they inherently have in their devices, makes SMS messages a propitious environment for spammers. In fact, reports clearly indicate that volume of mobile phone spam is dramatically increasing year by year. SMS spam represents a challenging problem for traditional filtering methods nowadays, since such messages are usually fairly short and normally rife with slangs, idioms, symbols and acronyms that make even tokenization a difficult task. In this scenario, this thesis proposes and then evaluates a method to normalize and expand original short and messy SMS text messages in order to acquire better attributes and enhance the classification performance. The proposed text processing approach is based on lexicography and semantic dictionaries along with the state-of-the-art techniques for semantic analysis and context detection. This technique is used to normalize terms and create new attributes in order to change and expand original text samples aiming to alleviate factors that can degrade the algorithms performance, such as redundancies and inconsistencies. The approach was validated with a public, real and non-encoded dataset along with several established machine learning methods. The experiments were diligently designed to ensure statistically sound results which indicate that the proposed text processing techniques can in fact enhance SMS spam filtering.A popularização dos smartphones contribuiu para o crescimento do uso de mensagens SMS como forma alternativa de comunicação. O crescente número de usuários, aliado à confiança que eles possuem nos seus dispositivos tornam as mensagem SMS um ambiente propício aos spammers. Relatórios recentes indicam que o volume de spam enviados via SMS está aumentando vertiginosamente nos últimos anos. SMS spam representa um problema desafiador para os métodos tradicionais de detecção de spam, uma vez que essas mensagens são curtas e geralmente repletas de gírias, símbolos, abreviações e emoticons, que torna até mesmo a tokenização uma tarefa difícil. Diante desse cenário, esta dissertação propõe e avalia um método para normalizar e expandir amostras curtas e ruidosas de mensagens SMS de forma a obter atributos mais representativos e, com isso, melhorar o desempenho geral na tarefa de classificação. O método proposto é baseado em dicionários lexicográficos e semânticos e utiliza técnicas modernas de análise semântica e detecção de contexto. Ele é empregado para normalizar os termos que compõem as mensagens e criar novos atributos para alterar e expandir as amostras originais de texto com o objetivo de mitigar fatores que podem degradar o desempenho dos métodos de classificação, tais como redundâncias e inconsistências. A proposta foi avaliada usando uma base de dados real, pública e não codificada, além de vários métodos consagrados de aprendizado de máquina. Os experimentos foram conduzidos para garantir resultados estatisticamente corretos e indicaram que o método proposto pode de fato melhorar a detecção de spam em SMS.Não recebi financiamentoporUniversidade Federal de São CarlosCâmpus SorocabaPrograma de Pós-Graduação em Ciência da Computação - PPGCC-SoUFSCarSmartphonesAplicativos móveisProcessamento de linguagem natural (Computação)Filtragem de SMS spamAprendizado de máquinaCategorização de textoMobile appsNatural language processing (Computer science)SMS spam filteringText categorizationMachine learningCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAONormalização textual e indexação semântica aplicadas da filtragem de SMS spamTexto normalization and semantic indexing to enhance SMS spam filteringinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisOnline6005de967ad-743c-4f36-972b-79dd683c0e9dinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALSILVA_Tiago_2016.pdfSILVA_Tiago_2016.pdfapplication/pdf13631569https://repositorio.ufscar.br/bitstream/ufscar/8811/1/SILVA_Tiago_2016.pdf7774c3913aa556cc48c0669f686cd3b5MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81957https://repositorio.ufscar.br/bitstream/ufscar/8811/2/license.txtae0398b6f8b235e40ad82cba6c50031dMD52TEXTSILVA_Tiago_2016.pdf.txtSILVA_Tiago_2016.pdf.txtExtracted texttext/plain65https://repositorio.ufscar.br/bitstream/ufscar/8811/3/SILVA_Tiago_2016.pdf.txt34c28c8d63cc2a1d8ea05dfa01f52895MD53THUMBNAILSILVA_Tiago_2016.pdf.jpgSILVA_Tiago_2016.pdf.jpgIM Thumbnailimage/jpeg5618https://repositorio.ufscar.br/bitstream/ufscar/8811/4/SILVA_Tiago_2016.pdf.jpgf26647f02cc65fd29849b43a8e3714d5MD54ufscar/88112023-09-18 18:31:24.27oai:repositorio.ufscar.br:ufscar/8811TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlCkZlZGVyYWwgZGUgU8OjbyBDYXJsb3MgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdQpkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlCmVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVUZTQ2FyIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28KcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVGU0NhciBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdQpkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcwpuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldQpjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6oKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVGU0NhcgpvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVRlNDYXIsClZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVRlNDYXIgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzCmNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:24Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
dc.title.alternative.eng.fl_str_mv Texto normalization and semantic indexing to enhance SMS spam filtering
title Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
spellingShingle Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
Silva, Tiago Pasqualini da
Smartphones
Aplicativos móveis
Processamento de linguagem natural (Computação)
Filtragem de SMS spam
Aprendizado de máquina
Categorização de texto
Mobile apps
Natural language processing (Computer science)
SMS spam filtering
Text categorization
Machine learning
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO
title_short Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
title_full Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
title_fullStr Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
title_full_unstemmed Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
title_sort Normalização textual e indexação semântica aplicadas da filtragem de SMS spam
author Silva, Tiago Pasqualini da
author_facet Silva, Tiago Pasqualini da
author_role author
dc.contributor.authorlattes.por.fl_str_mv http://lattes.cnpq.br/4030198351353056
dc.contributor.author.fl_str_mv Silva, Tiago Pasqualini da
dc.contributor.advisor1.fl_str_mv Almeida, Tiago Agostinho de
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/5368680512020633
dc.contributor.authorID.fl_str_mv 965803df-e76e-4694-be65-66986e456d19
contributor_str_mv Almeida, Tiago Agostinho de
dc.subject.por.fl_str_mv Smartphones
Aplicativos móveis
Processamento de linguagem natural (Computação)
Filtragem de SMS spam
Aprendizado de máquina
Categorização de texto
topic Smartphones
Aplicativos móveis
Processamento de linguagem natural (Computação)
Filtragem de SMS spam
Aprendizado de máquina
Categorização de texto
Mobile apps
Natural language processing (Computer science)
SMS spam filtering
Text categorization
Machine learning
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO
dc.subject.eng.fl_str_mv Mobile apps
Natural language processing (Computer science)
SMS spam filtering
Text categorization
Machine learning
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO
description The rapid popularization of smartphones has contributed to the growth of SMS usage as an alternative way of communication. The increasing number of users, along with the trust they inherently have in their devices, makes SMS messages a propitious environment for spammers. In fact, reports clearly indicate that volume of mobile phone spam is dramatically increasing year by year. SMS spam represents a challenging problem for traditional filtering methods nowadays, since such messages are usually fairly short and normally rife with slangs, idioms, symbols and acronyms that make even tokenization a difficult task. In this scenario, this thesis proposes and then evaluates a method to normalize and expand original short and messy SMS text messages in order to acquire better attributes and enhance the classification performance. The proposed text processing approach is based on lexicography and semantic dictionaries along with the state-of-the-art techniques for semantic analysis and context detection. This technique is used to normalize terms and create new attributes in order to change and expand original text samples aiming to alleviate factors that can degrade the algorithms performance, such as redundancies and inconsistencies. The approach was validated with a public, real and non-encoded dataset along with several established machine learning methods. The experiments were diligently designed to ensure statistically sound results which indicate that the proposed text processing techniques can in fact enhance SMS spam filtering.
publishDate 2016
dc.date.issued.fl_str_mv 2016-07-01
dc.date.accessioned.fl_str_mv 2017-06-01T17:49:38Z
dc.date.available.fl_str_mv 2017-06-01T17:49:38Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv SILVA, Tiago Pasqualini da. Normalização textual e indexação semântica aplicadas da filtragem de SMS spam. 2016. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2016. Disponível em: https://repositorio.ufscar.br/handle/ufscar/8811.
dc.identifier.uri.fl_str_mv https://repositorio.ufscar.br/handle/ufscar/8811
identifier_str_mv SILVA, Tiago Pasqualini da. Normalização textual e indexação semântica aplicadas da filtragem de SMS spam. 2016. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2016. Disponível em: https://repositorio.ufscar.br/handle/ufscar/8811.
url https://repositorio.ufscar.br/handle/ufscar/8811
dc.language.iso.fl_str_mv por
language por
dc.relation.confidence.fl_str_mv 600
dc.relation.authority.fl_str_mv 5de967ad-743c-4f36-972b-79dd683c0e9d
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus Sorocaba
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Ciência da Computação - PPGCC-So
dc.publisher.initials.fl_str_mv UFSCar
publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus Sorocaba
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFSCAR
instname:Universidade Federal de São Carlos (UFSCAR)
instacron:UFSCAR
instname_str Universidade Federal de São Carlos (UFSCAR)
instacron_str UFSCAR
institution UFSCAR
reponame_str Repositório Institucional da UFSCAR
collection Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv https://repositorio.ufscar.br/bitstream/ufscar/8811/1/SILVA_Tiago_2016.pdf
https://repositorio.ufscar.br/bitstream/ufscar/8811/2/license.txt
https://repositorio.ufscar.br/bitstream/ufscar/8811/3/SILVA_Tiago_2016.pdf.txt
https://repositorio.ufscar.br/bitstream/ufscar/8811/4/SILVA_Tiago_2016.pdf.jpg
bitstream.checksum.fl_str_mv 7774c3913aa556cc48c0669f686cd3b5
ae0398b6f8b235e40ad82cba6c50031d
34c28c8d63cc2a1d8ea05dfa01f52895
f26647f02cc65fd29849b43a8e3714d5
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_ 1802136324245815296