Balanceamento de dados com base em oversampling em dados transformados

Detalhes bibliográficos
Autor(a) principal: Maione, Camila
Data de Publicação: 2020
Tipo de documento: Tese
Idioma: por
Título da fonte: Repositório Institucional da UFG
dARK ID: ark:/38995/00130000048pb
Texto Completo: http://repositorio.bc.ufg.br/tede/handle/tede/10943
Resumo: Introduction: The efficiency and reliability of data analyses depends heavily on the quality of the analyzed data. The fundamental process of preparing databases in order to make them cleaner, more representative and improve their quality is called data preprocessing, during which data balancing is also performed. The importance of data balancing lies in the fact that several classification models commonly employed in enterprises and academic projects are designed to work with balanced data sets, and there are several factors which hinder classification performance which are associated to data imbalance. Objective: A new approach for data balancing based on data transformation combined with resampling of transformed data is proposed. The proposed approach transforms the original data set by transforming its input variables into new ones, therefore altering the data samples' position in the dimensional plane and consequently the choice that SMOTE-based resampling algorithms make over the initial samples, their nearest neighbours and where to place the generated synthetic samples. Methods: An initial implementation based on Principal Component Analysis (PCA) and SMOTE is presented, called PCA-SMOTE. In order to test the quality of the balancing performed by PCA-SMOTE, twelve test data sets were balanced through PCA-SMOTE and three other popular data balancing methods, and the performance of three classification models trained on these balanced sets are assessed and compared. Results: Several classification models trained on data sets which were balanced using the proposed method presented higher or similar performance measures in comparison to the same models trained on data sets that were balanced through the other evaluated algorithms, such as Borderline-SMOTE, Safe-Level-SMOTE and ADASYN. Conclusion: The satisfactory results obtained prove the potential of the proposed algorithm to improve learning of classifiers on imbalanced data sets.
id UFG-2_54b23e40faceab60d6a4515c992b723b
oai_identifier_str oai:repositorio.bc.ufg.br:tede/10943
network_acronym_str UFG-2
network_name_str Repositório Institucional da UFG
repository_id_str
spelling Barbosa, Rommel Melgaçohttp://lattes.cnpq.br/6228227125338610Barbosa, Rommel MelgaçoLeitão Júnior, PlínioCosta, Ronaldo Martins daCosta, Ana Paula Cabral SeixasLozano, Kátia Kelvis Cassianohttp://lattes.cnpq.br/3960167225526655Maione, Camila2020-11-26T11:54:36Z2020-11-26T11:54:36Z2020-08-17MAIONE, C. Balanceamento de dados com base em oversampling em dados transformados. 2020. 135 f. Tese (Doutorado em Ciência da Computação em Rede) - Universidade Federal de Goiás, Goiânia, 2020.http://repositorio.bc.ufg.br/tede/handle/tede/10943ark:/38995/00130000048pbIntroduction: The efficiency and reliability of data analyses depends heavily on the quality of the analyzed data. The fundamental process of preparing databases in order to make them cleaner, more representative and improve their quality is called data preprocessing, during which data balancing is also performed. The importance of data balancing lies in the fact that several classification models commonly employed in enterprises and academic projects are designed to work with balanced data sets, and there are several factors which hinder classification performance which are associated to data imbalance. Objective: A new approach for data balancing based on data transformation combined with resampling of transformed data is proposed. The proposed approach transforms the original data set by transforming its input variables into new ones, therefore altering the data samples' position in the dimensional plane and consequently the choice that SMOTE-based resampling algorithms make over the initial samples, their nearest neighbours and where to place the generated synthetic samples. Methods: An initial implementation based on Principal Component Analysis (PCA) and SMOTE is presented, called PCA-SMOTE. In order to test the quality of the balancing performed by PCA-SMOTE, twelve test data sets were balanced through PCA-SMOTE and three other popular data balancing methods, and the performance of three classification models trained on these balanced sets are assessed and compared. Results: Several classification models trained on data sets which were balanced using the proposed method presented higher or similar performance measures in comparison to the same models trained on data sets that were balanced through the other evaluated algorithms, such as Borderline-SMOTE, Safe-Level-SMOTE and ADASYN. Conclusion: The satisfactory results obtained prove the potential of the proposed algorithm to improve learning of classifiers on imbalanced data sets.Introdução: A eficiência e confiabilidade de análises de bases de dados dependem da qualidade da base de dados em questão. O processo de preparação de bases de dados para torná-las mais limpas, representativas e de melhor qualidade chama-se pré-processamento de dados, durante o qual também é realizado o balanceamento dos dados. A importância de balancear os dados jaz no fato de que diversos modelos de classificação utilizados em projetos coorporativos e acadêmicos são projetados para trabalhar com conjuntos de dados balanceados, e há diversos outros fatores degradadores de desempenho de classificação que estão associados ao desbalanceamento de dados. Objetivo: Propõe-se uma nova abordagem para balanceamento de dados, baseada em transformação de dados combinada com resampling de dados transformados. A abordagem proposta transforma o conjunto de dados original através da transformação de suas variáveis descritoras, consequentemente alterando a posição das amostras de dados no plano dimensional, influenciando a escolha que algoritmos de resampling como o SMOTE fazem sobre as amostras iniciais, seus vizinhos mais próximos e onde posicionar as amostras sintéticas geradas.Métodos: Uma implementação inicial baseada em análise de componentes principais (PCA) e SMOTE é apresentada, chamado PCA-SMOTE. Para testar a qualidade do balanceamento realizado pelo PCA-SMOTE, 12 bases de dados de teste foram balanceadas utilizando o PCASMOTE e outros três métodos de balanceamento populares na literatura, e o desempenho de três modelos de classificação diferentes treinados com tais bases foram avaliados e comparados. Resultados: Diversos modelos de classificação treinados com bases balanceadas através do método proposto mostraram desempenho superior ou similar aos dos modelos treinados com bases balanceadas pelos outros algoritmos populares, como Borderline-SMOTE, Safe-Level- SMOTE e ADASYN, em diversos casos de teste. Conclusões: Os resultados satisfatórios obtidos comprovam o potencial que o PCA-SMOTE possui para melhorar o aprendizado de classificadores sobre bases de dados desbalanceadas.Submitted by Onia Arantes Albuquerque (onia.ufg@gmail.com) on 2020-11-25T13:30:57Z No. of bitstreams: 2 Tese - Camila Maione - 2020.pdf: 3971430 bytes, checksum: 772603443763c250c13977717736fd41 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5)Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2020-11-26T11:54:36Z (GMT) No. of bitstreams: 2 Tese - Camila Maione - 2020.pdf: 3971430 bytes, checksum: 772603443763c250c13977717736fd41 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5)Made available in DSpace on 2020-11-26T11:54:36Z (GMT). No. of bitstreams: 2 Tese - Camila Maione - 2020.pdf: 3971430 bytes, checksum: 772603443763c250c13977717736fd41 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5) Previous issue date: 2020-08-17Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESporUniversidade Federal de GoiásPrograma de Pós-graduação em Ciência da Computação em Rede UFG/UFMS (INF)UFGBrasilInstituto de Informática - INF (RG)Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessMineração de dadosClassificação de dadosAprendizagem de máquinaBalanceamento de dadosTransformação de dadosPré-processamento de dadosData miningData classificationMachine learningImbalanced dataData transformationData preprocessingCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOBalanceamento de dados com base em oversampling em dados transformadosData balancing based on oversampling on transformed datainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis20500500500500261841reponame:Repositório Institucional da UFGinstname:Universidade Federal de Goiás (UFG)instacron:UFGORIGINALTese - Camila Maione - 2020.pdfTese - Camila Maione - 2020.pdfapplication/pdf3971430http://repositorio.bc.ufg.br/tede/bitstreams/68e77b47-0fd9-4804-8db7-1ea60def3bb5/download772603443763c250c13977717736fd41MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.bc.ufg.br/tede/bitstreams/9d3174f5-ba5b-429d-9f71-a87d98b7b59d/download8a4605be74aa9ea9d79846c1fba20a33MD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.bc.ufg.br/tede/bitstreams/bbf37246-8354-4971-802b-db9af68a6012/download4460e5956bc1d1639be9ae6146a50347MD52tede/109432020-11-26 08:54:37.468http://creativecommons.org/licenses/by-nc-nd/4.0/Attribution-NonCommercial-NoDerivatives 4.0 Internationalopen.accessoai:repositorio.bc.ufg.br:tede/10943http://repositorio.bc.ufg.br/tedeRepositório InstitucionalPUBhttp://repositorio.bc.ufg.br/oai/requesttasesdissertacoes.bc@ufg.bropendoar:2020-11-26T11:54:37Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)falseTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
dc.title.pt_BR.fl_str_mv Balanceamento de dados com base em oversampling em dados transformados
dc.title.alternative.eng.fl_str_mv Data balancing based on oversampling on transformed data
title Balanceamento de dados com base em oversampling em dados transformados
spellingShingle Balanceamento de dados com base em oversampling em dados transformados
Maione, Camila
Mineração de dados
Classificação de dados
Aprendizagem de máquina
Balanceamento de dados
Transformação de dados
Pré-processamento de dados
Data mining
Data classification
Machine learning
Imbalanced data
Data transformation
Data preprocessing
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short Balanceamento de dados com base em oversampling em dados transformados
title_full Balanceamento de dados com base em oversampling em dados transformados
title_fullStr Balanceamento de dados com base em oversampling em dados transformados
title_full_unstemmed Balanceamento de dados com base em oversampling em dados transformados
title_sort Balanceamento de dados com base em oversampling em dados transformados
author Maione, Camila
author_facet Maione, Camila
author_role author
dc.contributor.advisor1.fl_str_mv Barbosa, Rommel Melgaço
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/6228227125338610
dc.contributor.referee1.fl_str_mv Barbosa, Rommel Melgaço
dc.contributor.referee2.fl_str_mv Leitão Júnior, Plínio
dc.contributor.referee3.fl_str_mv Costa, Ronaldo Martins da
dc.contributor.referee4.fl_str_mv Costa, Ana Paula Cabral Seixas
dc.contributor.referee5.fl_str_mv Lozano, Kátia Kelvis Cassiano
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/3960167225526655
dc.contributor.author.fl_str_mv Maione, Camila
contributor_str_mv Barbosa, Rommel Melgaço
Barbosa, Rommel Melgaço
Leitão Júnior, Plínio
Costa, Ronaldo Martins da
Costa, Ana Paula Cabral Seixas
Lozano, Kátia Kelvis Cassiano
dc.subject.por.fl_str_mv Mineração de dados
Classificação de dados
Aprendizagem de máquina
Balanceamento de dados
Transformação de dados
Pré-processamento de dados
topic Mineração de dados
Classificação de dados
Aprendizagem de máquina
Balanceamento de dados
Transformação de dados
Pré-processamento de dados
Data mining
Data classification
Machine learning
Imbalanced data
Data transformation
Data preprocessing
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv Data mining
Data classification
Machine learning
Imbalanced data
Data transformation
Data preprocessing
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description Introduction: The efficiency and reliability of data analyses depends heavily on the quality of the analyzed data. The fundamental process of preparing databases in order to make them cleaner, more representative and improve their quality is called data preprocessing, during which data balancing is also performed. The importance of data balancing lies in the fact that several classification models commonly employed in enterprises and academic projects are designed to work with balanced data sets, and there are several factors which hinder classification performance which are associated to data imbalance. Objective: A new approach for data balancing based on data transformation combined with resampling of transformed data is proposed. The proposed approach transforms the original data set by transforming its input variables into new ones, therefore altering the data samples' position in the dimensional plane and consequently the choice that SMOTE-based resampling algorithms make over the initial samples, their nearest neighbours and where to place the generated synthetic samples. Methods: An initial implementation based on Principal Component Analysis (PCA) and SMOTE is presented, called PCA-SMOTE. In order to test the quality of the balancing performed by PCA-SMOTE, twelve test data sets were balanced through PCA-SMOTE and three other popular data balancing methods, and the performance of three classification models trained on these balanced sets are assessed and compared. Results: Several classification models trained on data sets which were balanced using the proposed method presented higher or similar performance measures in comparison to the same models trained on data sets that were balanced through the other evaluated algorithms, such as Borderline-SMOTE, Safe-Level-SMOTE and ADASYN. Conclusion: The satisfactory results obtained prove the potential of the proposed algorithm to improve learning of classifiers on imbalanced data sets.
publishDate 2020
dc.date.accessioned.fl_str_mv 2020-11-26T11:54:36Z
dc.date.available.fl_str_mv 2020-11-26T11:54:36Z
dc.date.issued.fl_str_mv 2020-08-17
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv MAIONE, C. Balanceamento de dados com base em oversampling em dados transformados. 2020. 135 f. Tese (Doutorado em Ciência da Computação em Rede) - Universidade Federal de Goiás, Goiânia, 2020.
dc.identifier.uri.fl_str_mv http://repositorio.bc.ufg.br/tede/handle/tede/10943
dc.identifier.dark.fl_str_mv ark:/38995/00130000048pb
identifier_str_mv MAIONE, C. Balanceamento de dados com base em oversampling em dados transformados. 2020. 135 f. Tese (Doutorado em Ciência da Computação em Rede) - Universidade Federal de Goiás, Goiânia, 2020.
ark:/38995/00130000048pb
url http://repositorio.bc.ufg.br/tede/handle/tede/10943
dc.language.iso.fl_str_mv por
language por
dc.relation.program.fl_str_mv 20
dc.relation.confidence.fl_str_mv 500
500
500
500
dc.relation.department.fl_str_mv 26
dc.relation.cnpq.fl_str_mv 184
dc.relation.sponsorship.fl_str_mv 1
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Goiás
dc.publisher.program.fl_str_mv Programa de Pós-graduação em Ciência da Computação em Rede UFG/UFMS (INF)
dc.publisher.initials.fl_str_mv UFG
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Instituto de Informática - INF (RG)
publisher.none.fl_str_mv Universidade Federal de Goiás
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFG
instname:Universidade Federal de Goiás (UFG)
instacron:UFG
instname_str Universidade Federal de Goiás (UFG)
instacron_str UFG
institution UFG
reponame_str Repositório Institucional da UFG
collection Repositório Institucional da UFG
bitstream.url.fl_str_mv http://repositorio.bc.ufg.br/tede/bitstreams/68e77b47-0fd9-4804-8db7-1ea60def3bb5/download
http://repositorio.bc.ufg.br/tede/bitstreams/9d3174f5-ba5b-429d-9f71-a87d98b7b59d/download
http://repositorio.bc.ufg.br/tede/bitstreams/bbf37246-8354-4971-802b-db9af68a6012/download
bitstream.checksum.fl_str_mv 772603443763c250c13977717736fd41
8a4605be74aa9ea9d79846c1fba20a33
4460e5956bc1d1639be9ae6146a50347
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)
repository.mail.fl_str_mv tasesdissertacoes.bc@ufg.br
_version_ 1815172555192926208