Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns

Detalhes bibliográficos
Autor(a) principal: Petrovici, Mihai Daniel
Data de Publicação: 2020
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10400.1/15204
Resumo: Cardiovascular Diseases (CVD) present the highest world health rate, constituting a risk factor to patients with diabetes and simultaneously a consequence of dyslipidemia. E ective lipid management of patients with diabetes is still largely unattained, requiring better perception of both patients and healthcare professionals. Aiming at better understanding the in uence of clinical parameters on Low Density Lipoprotein (LDL)-cholesterol patterns of type 2 diabetes uncontrolled patients, the Electronic Health Records (EHR) provided by APDP (Associa c~ao Protetora de Diabetes Portugal) have been subject to data mining techniques. The database content was primarily analyzed to understand data integrity and to avoid usage of EHR's corrupted values or misleading information. The statistical distribution of each clinical parameter reported in the data base took place to identify their individual behavior and to enable statistically coherent identi cation of the cohort to be used when modeling LDL. As a rst approach, LDL linear modeling was considered, using both ordinary leastsquares and stepwise approaches. Then, LDL non-linear modeling was tested, using the same populations employed on linear modeling to assess the most accurate and practical LDL model. The provided EHR included 32577 medical appointments held by 1767 patients between January 2008 and February 2018. More than 10 clinical features were studied, leading to the decision of limiting the case-study population to those patients who had at least 5 Medical Appointments (MA) during the decade. From all MA's, 32% and 63% reported LDL and Glycated Hemoglobin (HbA1c) measurements, respectively, but some MA's did not report both simultaneously. Six linear models, relating di erent sets of 6 clinical parameters were tested. The linear model 3, involving LDL, Total Cholesterol, HDL, Triglyceride, HbA1c and Platelet is the elected linear model with a Root Mean Square Error (RMSE) of 0.07. The model where Platelets are substituted by Proteinuria presents a RMSE of just 0.054 but employed solely 38 case-studies. Neural network-based modeling strategies were tested as an alternative to linear models. In this sense, the Multi-Objective Genetic Algorithm (MOGA) was used. After data preprocessing, MOGA was performed twice using di erent threshold values. Six models were developed considering di erent combinations of clinical parameters. For each model, the population was divided into 3 groups: 60% of the population was used to train the network, 20% to test the model and the remaining 20% to validate the model. Using the populations employed by each MOGA run, the stepwise algorithm was used to identify the relevance of each clinical parameter in the model and create another linear model using this parameter set. The MOGA model with the best training performance was Model 4, while model 2 was the one performing best in validation with RMSE of 0.057. However, linear model 5 created using the parameter selection identi ed by the MOGA presented a RMSE of 0.054 during validation when total cholesterol, HDL, triglyceride, HbA1c, microalbuminuria, creatinine, MDRD, sex and age are used in the composition of the LDL linear model. Therefore, we can conclude that LDL can be modeled by a linear model using 6 or 10 clinical variables with very low mean square error.
id RCAP_0b630143c2ef9a91bc694416238a3485
oai_identifier_str oai:sapientia.ualg.pt:10400.1/15204
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patternsLipoproteina de baixa densidadeDiabetesData miningModeloDomínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e TecnologiasCardiovascular Diseases (CVD) present the highest world health rate, constituting a risk factor to patients with diabetes and simultaneously a consequence of dyslipidemia. E ective lipid management of patients with diabetes is still largely unattained, requiring better perception of both patients and healthcare professionals. Aiming at better understanding the in uence of clinical parameters on Low Density Lipoprotein (LDL)-cholesterol patterns of type 2 diabetes uncontrolled patients, the Electronic Health Records (EHR) provided by APDP (Associa c~ao Protetora de Diabetes Portugal) have been subject to data mining techniques. The database content was primarily analyzed to understand data integrity and to avoid usage of EHR's corrupted values or misleading information. The statistical distribution of each clinical parameter reported in the data base took place to identify their individual behavior and to enable statistically coherent identi cation of the cohort to be used when modeling LDL. As a rst approach, LDL linear modeling was considered, using both ordinary leastsquares and stepwise approaches. Then, LDL non-linear modeling was tested, using the same populations employed on linear modeling to assess the most accurate and practical LDL model. The provided EHR included 32577 medical appointments held by 1767 patients between January 2008 and February 2018. More than 10 clinical features were studied, leading to the decision of limiting the case-study population to those patients who had at least 5 Medical Appointments (MA) during the decade. From all MA's, 32% and 63% reported LDL and Glycated Hemoglobin (HbA1c) measurements, respectively, but some MA's did not report both simultaneously. Six linear models, relating di erent sets of 6 clinical parameters were tested. The linear model 3, involving LDL, Total Cholesterol, HDL, Triglyceride, HbA1c and Platelet is the elected linear model with a Root Mean Square Error (RMSE) of 0.07. The model where Platelets are substituted by Proteinuria presents a RMSE of just 0.054 but employed solely 38 case-studies. Neural network-based modeling strategies were tested as an alternative to linear models. In this sense, the Multi-Objective Genetic Algorithm (MOGA) was used. After data preprocessing, MOGA was performed twice using di erent threshold values. Six models were developed considering di erent combinations of clinical parameters. For each model, the population was divided into 3 groups: 60% of the population was used to train the network, 20% to test the model and the remaining 20% to validate the model. Using the populations employed by each MOGA run, the stepwise algorithm was used to identify the relevance of each clinical parameter in the model and create another linear model using this parameter set. The MOGA model with the best training performance was Model 4, while model 2 was the one performing best in validation with RMSE of 0.057. However, linear model 5 created using the parameter selection identi ed by the MOGA presented a RMSE of 0.054 during validation when total cholesterol, HDL, triglyceride, HbA1c, microalbuminuria, creatinine, MDRD, sex and age are used in the composition of the LDL linear model. Therefore, we can conclude that LDL can be modeled by a linear model using 6 or 10 clinical variables with very low mean square error.As doenças cardiovasculares (CVD) continuam a ser a maior causa de morte no mundo e constituem um fator de risco para diabéticos para além de os diabéticos terem maior propensão para desenvolver CVD. No entanto, apesar de as diretrizes recentes cobrirem o risco de CVD, o efetivo controlo lipídico está longe de ser conseguido. Além disso, a autogestão lipídica em conjunto com o gerenciamento de decisões terapêuticas, nem sempre assume a prioridade adequada quer pelos pacientes quer pelos profissionais de saúde. Pretendendo compreender melhor a influência dos parâmetros clínicos no colesterol de lipoproteínas de baixa densidade (LDL) de doentes diabéticos tipo 2, doentes estes cujo gerenciamento dos valores lipídicos se suspeitam inst aveis, recorreu-se a registos eletrónicos de saúde (EHR) providenciados pela APDP (Associação Protetora de Diabetes Portugal) para fazer um estudo baseado em técnicas de mineração de dados.(…)Ruano, M. GraçaRibeiro, Rogério José TavaresSapientiaPetrovici, Mihai Daniel2021-03-09T14:59:21Z2020-06-262020-06-26T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10400.1/15204TID:202663418enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-24T10:27:37Zoai:sapientia.ualg.pt:10400.1/15204Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:06:05.204128Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
title Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
spellingShingle Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
Petrovici, Mihai Daniel
Lipoproteina de baixa densidade
Diabetes
Data mining
Modelo
Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias
title_short Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
title_full Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
title_fullStr Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
title_full_unstemmed Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
title_sort Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
author Petrovici, Mihai Daniel
author_facet Petrovici, Mihai Daniel
author_role author
dc.contributor.none.fl_str_mv Ruano, M. Graça
Ribeiro, Rogério José Tavares
Sapientia
dc.contributor.author.fl_str_mv Petrovici, Mihai Daniel
dc.subject.por.fl_str_mv Lipoproteina de baixa densidade
Diabetes
Data mining
Modelo
Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias
topic Lipoproteina de baixa densidade
Diabetes
Data mining
Modelo
Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias
description Cardiovascular Diseases (CVD) present the highest world health rate, constituting a risk factor to patients with diabetes and simultaneously a consequence of dyslipidemia. E ective lipid management of patients with diabetes is still largely unattained, requiring better perception of both patients and healthcare professionals. Aiming at better understanding the in uence of clinical parameters on Low Density Lipoprotein (LDL)-cholesterol patterns of type 2 diabetes uncontrolled patients, the Electronic Health Records (EHR) provided by APDP (Associa c~ao Protetora de Diabetes Portugal) have been subject to data mining techniques. The database content was primarily analyzed to understand data integrity and to avoid usage of EHR's corrupted values or misleading information. The statistical distribution of each clinical parameter reported in the data base took place to identify their individual behavior and to enable statistically coherent identi cation of the cohort to be used when modeling LDL. As a rst approach, LDL linear modeling was considered, using both ordinary leastsquares and stepwise approaches. Then, LDL non-linear modeling was tested, using the same populations employed on linear modeling to assess the most accurate and practical LDL model. The provided EHR included 32577 medical appointments held by 1767 patients between January 2008 and February 2018. More than 10 clinical features were studied, leading to the decision of limiting the case-study population to those patients who had at least 5 Medical Appointments (MA) during the decade. From all MA's, 32% and 63% reported LDL and Glycated Hemoglobin (HbA1c) measurements, respectively, but some MA's did not report both simultaneously. Six linear models, relating di erent sets of 6 clinical parameters were tested. The linear model 3, involving LDL, Total Cholesterol, HDL, Triglyceride, HbA1c and Platelet is the elected linear model with a Root Mean Square Error (RMSE) of 0.07. The model where Platelets are substituted by Proteinuria presents a RMSE of just 0.054 but employed solely 38 case-studies. Neural network-based modeling strategies were tested as an alternative to linear models. In this sense, the Multi-Objective Genetic Algorithm (MOGA) was used. After data preprocessing, MOGA was performed twice using di erent threshold values. Six models were developed considering di erent combinations of clinical parameters. For each model, the population was divided into 3 groups: 60% of the population was used to train the network, 20% to test the model and the remaining 20% to validate the model. Using the populations employed by each MOGA run, the stepwise algorithm was used to identify the relevance of each clinical parameter in the model and create another linear model using this parameter set. The MOGA model with the best training performance was Model 4, while model 2 was the one performing best in validation with RMSE of 0.057. However, linear model 5 created using the parameter selection identi ed by the MOGA presented a RMSE of 0.054 during validation when total cholesterol, HDL, triglyceride, HbA1c, microalbuminuria, creatinine, MDRD, sex and age are used in the composition of the LDL linear model. Therefore, we can conclude that LDL can be modeled by a linear model using 6 or 10 clinical variables with very low mean square error.
publishDate 2020
dc.date.none.fl_str_mv 2020-06-26
2020-06-26T00:00:00Z
2021-03-09T14:59:21Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.1/15204
TID:202663418
url http://hdl.handle.net/10400.1/15204
identifier_str_mv TID:202663418
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799133301705801728