Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia

Detalhes bibliográficos
Autor(a) principal: Albuquerque, João
Data de Publicação: 2022
Outros Autores: Medeiros, Ana Margarida, Alves, Ana Catarina, Bourbon, Mafalda, Antunes, Marília
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10400.18/8384
Resumo: Familial Hypercholesterolemia (FH) is an inherited disorder of cholesterol metabolism. Current criteria for FH diagnosis, like Simon Broome (SB) criteria, lead to high false positive rates. The aim of this work was to explore alternative classification procedures for FH diagnosis, based on different biological and biochemical indicators. For this purpose, logistic regression (LR), naive Bayes classifier (NB), random forest (RF) and extreme gradient boosting (XGB) algorithms were combined with Synthetic Minority Oversampling Technique (SMOTE), or threshold adjustment by maximizing Youden index (YI), and compared. Data was tested through a 10 x 10 repeated k-fold cross validation design. The LR model presented an overall better performance, as assessed by the areas under the receiver operating characteristics (AUROC) and precision-recall (AUPRC) curves, and several operating characteristics (OC), regardless of the strategy to cope with class imbalance. When adopting either data processing technique, significantly higher accuracy (Acc), G-mean and F-1 score values were found for all classification algorithms, compared to SB criteria (p < 0.01), revealing a more balanced predictive ability for both classes, and higher effectiveness in classifying FH patients. Adjustment of the cut-off values through pre or post-processing methods revealed a considerable gain in sensitivity (Sens) values (p < 0.01). Although the performance of pre and post-processing strategies was similar, SMOTE does not cause model's parameters to loose interpretability. These results suggest a LR model combined with SMOTE can be an optimal approach to be used as a widespread screening tool.
id RCAP_60e5348cb7656a575031a5f6d6fd3430
oai_identifier_str oai:repositorio.insa.pt:10400.18/8384
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemiaFamilial HypercholesterolemiaDiagnosisFH DiagnosisFH studyPortuguese FH studyDoenças Cardio e Cérebro-vascularesPortugalFamilial Hypercholesterolemia (FH) is an inherited disorder of cholesterol metabolism. Current criteria for FH diagnosis, like Simon Broome (SB) criteria, lead to high false positive rates. The aim of this work was to explore alternative classification procedures for FH diagnosis, based on different biological and biochemical indicators. For this purpose, logistic regression (LR), naive Bayes classifier (NB), random forest (RF) and extreme gradient boosting (XGB) algorithms were combined with Synthetic Minority Oversampling Technique (SMOTE), or threshold adjustment by maximizing Youden index (YI), and compared. Data was tested through a 10 x 10 repeated k-fold cross validation design. The LR model presented an overall better performance, as assessed by the areas under the receiver operating characteristics (AUROC) and precision-recall (AUPRC) curves, and several operating characteristics (OC), regardless of the strategy to cope with class imbalance. When adopting either data processing technique, significantly higher accuracy (Acc), G-mean and F-1 score values were found for all classification algorithms, compared to SB criteria (p < 0.01), revealing a more balanced predictive ability for both classes, and higher effectiveness in classifying FH patients. Adjustment of the cut-off values through pre or post-processing methods revealed a considerable gain in sensitivity (Sens) values (p < 0.01). Although the performance of pre and post-processing strategies was similar, SMOTE does not cause model's parameters to loose interpretability. These results suggest a LR model combined with SMOTE can be an optimal approach to be used as a widespread screening tool.The current work was supported by the programme Norte2020 (operação NORTE-08-5369-FSE-000018), awarded to JA, and by Fundação para a Ciência e Tecnologia (FCT), under the projects UID/MAT/00006/2019 and PTDC/SAU-SER/29180/2017.Public Library of ScienceRepositório Científico do Instituto Nacional de SaúdeAlbuquerque, JoãoMedeiros, Ana MargaridaAlves, Ana CatarinaBourbon, MafaldaAntunes, Marília2022-12-05T15:04:41Z2022-06-242022-06-24T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.18/8384engPLoS One. 2022 Jun 24;17(6):e0269713. doi: 10.1371/journal.pone.0269713. eCollection 20221932-620310.1371/journal.pone.0269713info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-20T15:42:31Zoai:repositorio.insa.pt:10400.18/8384Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T18:43:00.323412Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
title Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
spellingShingle Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
Albuquerque, João
Familial Hypercholesterolemia
Diagnosis
FH Diagnosis
FH study
Portuguese FH study
Doenças Cardio e Cérebro-vasculares
Portugal
title_short Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
title_full Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
title_fullStr Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
title_full_unstemmed Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
title_sort Comparative study on the performance of different classification algorithms, combined with pre- and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia
author Albuquerque, João
author_facet Albuquerque, João
Medeiros, Ana Margarida
Alves, Ana Catarina
Bourbon, Mafalda
Antunes, Marília
author_role author
author2 Medeiros, Ana Margarida
Alves, Ana Catarina
Bourbon, Mafalda
Antunes, Marília
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Repositório Científico do Instituto Nacional de Saúde
dc.contributor.author.fl_str_mv Albuquerque, João
Medeiros, Ana Margarida
Alves, Ana Catarina
Bourbon, Mafalda
Antunes, Marília
dc.subject.por.fl_str_mv Familial Hypercholesterolemia
Diagnosis
FH Diagnosis
FH study
Portuguese FH study
Doenças Cardio e Cérebro-vasculares
Portugal
topic Familial Hypercholesterolemia
Diagnosis
FH Diagnosis
FH study
Portuguese FH study
Doenças Cardio e Cérebro-vasculares
Portugal
description Familial Hypercholesterolemia (FH) is an inherited disorder of cholesterol metabolism. Current criteria for FH diagnosis, like Simon Broome (SB) criteria, lead to high false positive rates. The aim of this work was to explore alternative classification procedures for FH diagnosis, based on different biological and biochemical indicators. For this purpose, logistic regression (LR), naive Bayes classifier (NB), random forest (RF) and extreme gradient boosting (XGB) algorithms were combined with Synthetic Minority Oversampling Technique (SMOTE), or threshold adjustment by maximizing Youden index (YI), and compared. Data was tested through a 10 x 10 repeated k-fold cross validation design. The LR model presented an overall better performance, as assessed by the areas under the receiver operating characteristics (AUROC) and precision-recall (AUPRC) curves, and several operating characteristics (OC), regardless of the strategy to cope with class imbalance. When adopting either data processing technique, significantly higher accuracy (Acc), G-mean and F-1 score values were found for all classification algorithms, compared to SB criteria (p < 0.01), revealing a more balanced predictive ability for both classes, and higher effectiveness in classifying FH patients. Adjustment of the cut-off values through pre or post-processing methods revealed a considerable gain in sensitivity (Sens) values (p < 0.01). Although the performance of pre and post-processing strategies was similar, SMOTE does not cause model's parameters to loose interpretability. These results suggest a LR model combined with SMOTE can be an optimal approach to be used as a widespread screening tool.
publishDate 2022
dc.date.none.fl_str_mv 2022-12-05T15:04:41Z
2022-06-24
2022-06-24T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.18/8384
url http://hdl.handle.net/10400.18/8384
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv PLoS One. 2022 Jun 24;17(6):e0269713. doi: 10.1371/journal.pone.0269713. eCollection 2022
1932-6203
10.1371/journal.pone.0269713
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Public Library of Science
publisher.none.fl_str_mv Public Library of Science
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799132175880159232