Balanceamento de dados com base em oversampling em dados transformados
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Tese |
Idioma: | por |
Título da fonte: | Repositório Institucional da UFG |
dARK ID: | ark:/38995/00130000048pb |
Texto Completo: | http://repositorio.bc.ufg.br/tede/handle/tede/10943 |
Resumo: | Introduction: The efficiency and reliability of data analyses depends heavily on the quality of the analyzed data. The fundamental process of preparing databases in order to make them cleaner, more representative and improve their quality is called data preprocessing, during which data balancing is also performed. The importance of data balancing lies in the fact that several classification models commonly employed in enterprises and academic projects are designed to work with balanced data sets, and there are several factors which hinder classification performance which are associated to data imbalance. Objective: A new approach for data balancing based on data transformation combined with resampling of transformed data is proposed. The proposed approach transforms the original data set by transforming its input variables into new ones, therefore altering the data samples' position in the dimensional plane and consequently the choice that SMOTE-based resampling algorithms make over the initial samples, their nearest neighbours and where to place the generated synthetic samples. Methods: An initial implementation based on Principal Component Analysis (PCA) and SMOTE is presented, called PCA-SMOTE. In order to test the quality of the balancing performed by PCA-SMOTE, twelve test data sets were balanced through PCA-SMOTE and three other popular data balancing methods, and the performance of three classification models trained on these balanced sets are assessed and compared. Results: Several classification models trained on data sets which were balanced using the proposed method presented higher or similar performance measures in comparison to the same models trained on data sets that were balanced through the other evaluated algorithms, such as Borderline-SMOTE, Safe-Level-SMOTE and ADASYN. Conclusion: The satisfactory results obtained prove the potential of the proposed algorithm to improve learning of classifiers on imbalanced data sets. |
id |
UFG-2_54b23e40faceab60d6a4515c992b723b |
---|---|
oai_identifier_str |
oai:repositorio.bc.ufg.br:tede/10943 |
network_acronym_str |
UFG-2 |
network_name_str |
Repositório Institucional da UFG |
repository_id_str |
|
spelling |
Barbosa, Rommel Melgaçohttp://lattes.cnpq.br/6228227125338610Barbosa, Rommel MelgaçoLeitão Júnior, PlínioCosta, Ronaldo Martins daCosta, Ana Paula Cabral SeixasLozano, Kátia Kelvis Cassianohttp://lattes.cnpq.br/3960167225526655Maione, Camila2020-11-26T11:54:36Z2020-11-26T11:54:36Z2020-08-17MAIONE, C. Balanceamento de dados com base em oversampling em dados transformados. 2020. 135 f. Tese (Doutorado em Ciência da Computação em Rede) - Universidade Federal de Goiás, Goiânia, 2020.http://repositorio.bc.ufg.br/tede/handle/tede/10943ark:/38995/00130000048pbIntroduction: The efficiency and reliability of data analyses depends heavily on the quality of the analyzed data. The fundamental process of preparing databases in order to make them cleaner, more representative and improve their quality is called data preprocessing, during which data balancing is also performed. The importance of data balancing lies in the fact that several classification models commonly employed in enterprises and academic projects are designed to work with balanced data sets, and there are several factors which hinder classification performance which are associated to data imbalance. Objective: A new approach for data balancing based on data transformation combined with resampling of transformed data is proposed. The proposed approach transforms the original data set by transforming its input variables into new ones, therefore altering the data samples' position in the dimensional plane and consequently the choice that SMOTE-based resampling algorithms make over the initial samples, their nearest neighbours and where to place the generated synthetic samples. Methods: An initial implementation based on Principal Component Analysis (PCA) and SMOTE is presented, called PCA-SMOTE. In order to test the quality of the balancing performed by PCA-SMOTE, twelve test data sets were balanced through PCA-SMOTE and three other popular data balancing methods, and the performance of three classification models trained on these balanced sets are assessed and compared. Results: Several classification models trained on data sets which were balanced using the proposed method presented higher or similar performance measures in comparison to the same models trained on data sets that were balanced through the other evaluated algorithms, such as Borderline-SMOTE, Safe-Level-SMOTE and ADASYN. Conclusion: The satisfactory results obtained prove the potential of the proposed algorithm to improve learning of classifiers on imbalanced data sets.Introdução: A eficiência e confiabilidade de análises de bases de dados dependem da qualidade da base de dados em questão. O processo de preparação de bases de dados para torná-las mais limpas, representativas e de melhor qualidade chama-se pré-processamento de dados, durante o qual também é realizado o balanceamento dos dados. A importância de balancear os dados jaz no fato de que diversos modelos de classificação utilizados em projetos coorporativos e acadêmicos são projetados para trabalhar com conjuntos de dados balanceados, e há diversos outros fatores degradadores de desempenho de classificação que estão associados ao desbalanceamento de dados. Objetivo: Propõe-se uma nova abordagem para balanceamento de dados, baseada em transformação de dados combinada com resampling de dados transformados. A abordagem proposta transforma o conjunto de dados original através da transformação de suas variáveis descritoras, consequentemente alterando a posição das amostras de dados no plano dimensional, influenciando a escolha que algoritmos de resampling como o SMOTE fazem sobre as amostras iniciais, seus vizinhos mais próximos e onde posicionar as amostras sintéticas geradas.Métodos: Uma implementação inicial baseada em análise de componentes principais (PCA) e SMOTE é apresentada, chamado PCA-SMOTE. Para testar a qualidade do balanceamento realizado pelo PCA-SMOTE, 12 bases de dados de teste foram balanceadas utilizando o PCASMOTE e outros três métodos de balanceamento populares na literatura, e o desempenho de três modelos de classificação diferentes treinados com tais bases foram avaliados e comparados. Resultados: Diversos modelos de classificação treinados com bases balanceadas através do método proposto mostraram desempenho superior ou similar aos dos modelos treinados com bases balanceadas pelos outros algoritmos populares, como Borderline-SMOTE, Safe-Level- SMOTE e ADASYN, em diversos casos de teste. Conclusões: Os resultados satisfatórios obtidos comprovam o potencial que o PCA-SMOTE possui para melhorar o aprendizado de classificadores sobre bases de dados desbalanceadas.Submitted by Onia Arantes Albuquerque (onia.ufg@gmail.com) on 2020-11-25T13:30:57Z No. of bitstreams: 2 Tese - Camila Maione - 2020.pdf: 3971430 bytes, checksum: 772603443763c250c13977717736fd41 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5)Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2020-11-26T11:54:36Z (GMT) No. of bitstreams: 2 Tese - Camila Maione - 2020.pdf: 3971430 bytes, checksum: 772603443763c250c13977717736fd41 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5)Made available in DSpace on 2020-11-26T11:54:36Z (GMT). No. of bitstreams: 2 Tese - Camila Maione - 2020.pdf: 3971430 bytes, checksum: 772603443763c250c13977717736fd41 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5) Previous issue date: 2020-08-17Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESporUniversidade Federal de GoiásPrograma de Pós-graduação em Ciência da Computação em Rede UFG/UFMS (INF)UFGBrasilInstituto de Informática - INF (RG)Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessMineração de dadosClassificação de dadosAprendizagem de máquinaBalanceamento de dadosTransformação de dadosPré-processamento de dadosData miningData classificationMachine learningImbalanced dataData transformationData preprocessingCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOBalanceamento de dados com base em oversampling em dados transformadosData balancing based on oversampling on transformed datainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis20500500500500261841reponame:Repositório Institucional da UFGinstname:Universidade Federal de Goiás (UFG)instacron:UFGORIGINALTese - Camila Maione - 2020.pdfTese - Camila Maione - 2020.pdfapplication/pdf3971430http://repositorio.bc.ufg.br/tede/bitstreams/68e77b47-0fd9-4804-8db7-1ea60def3bb5/download772603443763c250c13977717736fd41MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.bc.ufg.br/tede/bitstreams/9d3174f5-ba5b-429d-9f71-a87d98b7b59d/download8a4605be74aa9ea9d79846c1fba20a33MD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.bc.ufg.br/tede/bitstreams/bbf37246-8354-4971-802b-db9af68a6012/download4460e5956bc1d1639be9ae6146a50347MD52tede/109432020-11-26 08:54:37.468http://creativecommons.org/licenses/by-nc-nd/4.0/Attribution-NonCommercial-NoDerivatives 4.0 Internationalopen.accessoai:repositorio.bc.ufg.br:tede/10943http://repositorio.bc.ufg.br/tedeRepositório InstitucionalPUBhttp://repositorio.bc.ufg.br/oai/requesttasesdissertacoes.bc@ufg.bropendoar:2020-11-26T11:54:37Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)falseTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo= |
dc.title.pt_BR.fl_str_mv |
Balanceamento de dados com base em oversampling em dados transformados |
dc.title.alternative.eng.fl_str_mv |
Data balancing based on oversampling on transformed data |
title |
Balanceamento de dados com base em oversampling em dados transformados |
spellingShingle |
Balanceamento de dados com base em oversampling em dados transformados Maione, Camila Mineração de dados Classificação de dados Aprendizagem de máquina Balanceamento de dados Transformação de dados Pré-processamento de dados Data mining Data classification Machine learning Imbalanced data Data transformation Data preprocessing CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
title_short |
Balanceamento de dados com base em oversampling em dados transformados |
title_full |
Balanceamento de dados com base em oversampling em dados transformados |
title_fullStr |
Balanceamento de dados com base em oversampling em dados transformados |
title_full_unstemmed |
Balanceamento de dados com base em oversampling em dados transformados |
title_sort |
Balanceamento de dados com base em oversampling em dados transformados |
author |
Maione, Camila |
author_facet |
Maione, Camila |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Barbosa, Rommel Melgaço |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/6228227125338610 |
dc.contributor.referee1.fl_str_mv |
Barbosa, Rommel Melgaço |
dc.contributor.referee2.fl_str_mv |
Leitão Júnior, Plínio |
dc.contributor.referee3.fl_str_mv |
Costa, Ronaldo Martins da |
dc.contributor.referee4.fl_str_mv |
Costa, Ana Paula Cabral Seixas |
dc.contributor.referee5.fl_str_mv |
Lozano, Kátia Kelvis Cassiano |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/3960167225526655 |
dc.contributor.author.fl_str_mv |
Maione, Camila |
contributor_str_mv |
Barbosa, Rommel Melgaço Barbosa, Rommel Melgaço Leitão Júnior, Plínio Costa, Ronaldo Martins da Costa, Ana Paula Cabral Seixas Lozano, Kátia Kelvis Cassiano |
dc.subject.por.fl_str_mv |
Mineração de dados Classificação de dados Aprendizagem de máquina Balanceamento de dados Transformação de dados Pré-processamento de dados |
topic |
Mineração de dados Classificação de dados Aprendizagem de máquina Balanceamento de dados Transformação de dados Pré-processamento de dados Data mining Data classification Machine learning Imbalanced data Data transformation Data preprocessing CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
dc.subject.eng.fl_str_mv |
Data mining Data classification Machine learning Imbalanced data Data transformation Data preprocessing |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
description |
Introduction: The efficiency and reliability of data analyses depends heavily on the quality of the analyzed data. The fundamental process of preparing databases in order to make them cleaner, more representative and improve their quality is called data preprocessing, during which data balancing is also performed. The importance of data balancing lies in the fact that several classification models commonly employed in enterprises and academic projects are designed to work with balanced data sets, and there are several factors which hinder classification performance which are associated to data imbalance. Objective: A new approach for data balancing based on data transformation combined with resampling of transformed data is proposed. The proposed approach transforms the original data set by transforming its input variables into new ones, therefore altering the data samples' position in the dimensional plane and consequently the choice that SMOTE-based resampling algorithms make over the initial samples, their nearest neighbours and where to place the generated synthetic samples. Methods: An initial implementation based on Principal Component Analysis (PCA) and SMOTE is presented, called PCA-SMOTE. In order to test the quality of the balancing performed by PCA-SMOTE, twelve test data sets were balanced through PCA-SMOTE and three other popular data balancing methods, and the performance of three classification models trained on these balanced sets are assessed and compared. Results: Several classification models trained on data sets which were balanced using the proposed method presented higher or similar performance measures in comparison to the same models trained on data sets that were balanced through the other evaluated algorithms, such as Borderline-SMOTE, Safe-Level-SMOTE and ADASYN. Conclusion: The satisfactory results obtained prove the potential of the proposed algorithm to improve learning of classifiers on imbalanced data sets. |
publishDate |
2020 |
dc.date.accessioned.fl_str_mv |
2020-11-26T11:54:36Z |
dc.date.available.fl_str_mv |
2020-11-26T11:54:36Z |
dc.date.issued.fl_str_mv |
2020-08-17 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
MAIONE, C. Balanceamento de dados com base em oversampling em dados transformados. 2020. 135 f. Tese (Doutorado em Ciência da Computação em Rede) - Universidade Federal de Goiás, Goiânia, 2020. |
dc.identifier.uri.fl_str_mv |
http://repositorio.bc.ufg.br/tede/handle/tede/10943 |
dc.identifier.dark.fl_str_mv |
ark:/38995/00130000048pb |
identifier_str_mv |
MAIONE, C. Balanceamento de dados com base em oversampling em dados transformados. 2020. 135 f. Tese (Doutorado em Ciência da Computação em Rede) - Universidade Federal de Goiás, Goiânia, 2020. ark:/38995/00130000048pb |
url |
http://repositorio.bc.ufg.br/tede/handle/tede/10943 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.program.fl_str_mv |
20 |
dc.relation.confidence.fl_str_mv |
500 500 500 500 |
dc.relation.department.fl_str_mv |
26 |
dc.relation.cnpq.fl_str_mv |
184 |
dc.relation.sponsorship.fl_str_mv |
1 |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Goiás |
dc.publisher.program.fl_str_mv |
Programa de Pós-graduação em Ciência da Computação em Rede UFG/UFMS (INF) |
dc.publisher.initials.fl_str_mv |
UFG |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
Instituto de Informática - INF (RG) |
publisher.none.fl_str_mv |
Universidade Federal de Goiás |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFG instname:Universidade Federal de Goiás (UFG) instacron:UFG |
instname_str |
Universidade Federal de Goiás (UFG) |
instacron_str |
UFG |
institution |
UFG |
reponame_str |
Repositório Institucional da UFG |
collection |
Repositório Institucional da UFG |
bitstream.url.fl_str_mv |
http://repositorio.bc.ufg.br/tede/bitstreams/68e77b47-0fd9-4804-8db7-1ea60def3bb5/download http://repositorio.bc.ufg.br/tede/bitstreams/9d3174f5-ba5b-429d-9f71-a87d98b7b59d/download http://repositorio.bc.ufg.br/tede/bitstreams/bbf37246-8354-4971-802b-db9af68a6012/download |
bitstream.checksum.fl_str_mv |
772603443763c250c13977717736fd41 8a4605be74aa9ea9d79846c1fba20a33 4460e5956bc1d1639be9ae6146a50347 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFG - Universidade Federal de Goiás (UFG) |
repository.mail.fl_str_mv |
tasesdissertacoes.bc@ufg.br |
_version_ |
1815172555192926208 |