Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10400.1/15204 |
Resumo: | Cardiovascular Diseases (CVD) present the highest world health rate, constituting a risk factor to patients with diabetes and simultaneously a consequence of dyslipidemia. E ective lipid management of patients with diabetes is still largely unattained, requiring better perception of both patients and healthcare professionals. Aiming at better understanding the in uence of clinical parameters on Low Density Lipoprotein (LDL)-cholesterol patterns of type 2 diabetes uncontrolled patients, the Electronic Health Records (EHR) provided by APDP (Associa c~ao Protetora de Diabetes Portugal) have been subject to data mining techniques. The database content was primarily analyzed to understand data integrity and to avoid usage of EHR's corrupted values or misleading information. The statistical distribution of each clinical parameter reported in the data base took place to identify their individual behavior and to enable statistically coherent identi cation of the cohort to be used when modeling LDL. As a rst approach, LDL linear modeling was considered, using both ordinary leastsquares and stepwise approaches. Then, LDL non-linear modeling was tested, using the same populations employed on linear modeling to assess the most accurate and practical LDL model. The provided EHR included 32577 medical appointments held by 1767 patients between January 2008 and February 2018. More than 10 clinical features were studied, leading to the decision of limiting the case-study population to those patients who had at least 5 Medical Appointments (MA) during the decade. From all MA's, 32% and 63% reported LDL and Glycated Hemoglobin (HbA1c) measurements, respectively, but some MA's did not report both simultaneously. Six linear models, relating di erent sets of 6 clinical parameters were tested. The linear model 3, involving LDL, Total Cholesterol, HDL, Triglyceride, HbA1c and Platelet is the elected linear model with a Root Mean Square Error (RMSE) of 0.07. The model where Platelets are substituted by Proteinuria presents a RMSE of just 0.054 but employed solely 38 case-studies. Neural network-based modeling strategies were tested as an alternative to linear models. In this sense, the Multi-Objective Genetic Algorithm (MOGA) was used. After data preprocessing, MOGA was performed twice using di erent threshold values. Six models were developed considering di erent combinations of clinical parameters. For each model, the population was divided into 3 groups: 60% of the population was used to train the network, 20% to test the model and the remaining 20% to validate the model. Using the populations employed by each MOGA run, the stepwise algorithm was used to identify the relevance of each clinical parameter in the model and create another linear model using this parameter set. The MOGA model with the best training performance was Model 4, while model 2 was the one performing best in validation with RMSE of 0.057. However, linear model 5 created using the parameter selection identi ed by the MOGA presented a RMSE of 0.054 during validation when total cholesterol, HDL, triglyceride, HbA1c, microalbuminuria, creatinine, MDRD, sex and age are used in the composition of the LDL linear model. Therefore, we can conclude that LDL can be modeled by a linear model using 6 or 10 clinical variables with very low mean square error. |
id |
RCAP_0b630143c2ef9a91bc694416238a3485 |
---|---|
oai_identifier_str |
oai:sapientia.ualg.pt:10400.1/15204 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patternsLipoproteina de baixa densidadeDiabetesData miningModeloDomínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e TecnologiasCardiovascular Diseases (CVD) present the highest world health rate, constituting a risk factor to patients with diabetes and simultaneously a consequence of dyslipidemia. E ective lipid management of patients with diabetes is still largely unattained, requiring better perception of both patients and healthcare professionals. Aiming at better understanding the in uence of clinical parameters on Low Density Lipoprotein (LDL)-cholesterol patterns of type 2 diabetes uncontrolled patients, the Electronic Health Records (EHR) provided by APDP (Associa c~ao Protetora de Diabetes Portugal) have been subject to data mining techniques. The database content was primarily analyzed to understand data integrity and to avoid usage of EHR's corrupted values or misleading information. The statistical distribution of each clinical parameter reported in the data base took place to identify their individual behavior and to enable statistically coherent identi cation of the cohort to be used when modeling LDL. As a rst approach, LDL linear modeling was considered, using both ordinary leastsquares and stepwise approaches. Then, LDL non-linear modeling was tested, using the same populations employed on linear modeling to assess the most accurate and practical LDL model. The provided EHR included 32577 medical appointments held by 1767 patients between January 2008 and February 2018. More than 10 clinical features were studied, leading to the decision of limiting the case-study population to those patients who had at least 5 Medical Appointments (MA) during the decade. From all MA's, 32% and 63% reported LDL and Glycated Hemoglobin (HbA1c) measurements, respectively, but some MA's did not report both simultaneously. Six linear models, relating di erent sets of 6 clinical parameters were tested. The linear model 3, involving LDL, Total Cholesterol, HDL, Triglyceride, HbA1c and Platelet is the elected linear model with a Root Mean Square Error (RMSE) of 0.07. The model where Platelets are substituted by Proteinuria presents a RMSE of just 0.054 but employed solely 38 case-studies. Neural network-based modeling strategies were tested as an alternative to linear models. In this sense, the Multi-Objective Genetic Algorithm (MOGA) was used. After data preprocessing, MOGA was performed twice using di erent threshold values. Six models were developed considering di erent combinations of clinical parameters. For each model, the population was divided into 3 groups: 60% of the population was used to train the network, 20% to test the model and the remaining 20% to validate the model. Using the populations employed by each MOGA run, the stepwise algorithm was used to identify the relevance of each clinical parameter in the model and create another linear model using this parameter set. The MOGA model with the best training performance was Model 4, while model 2 was the one performing best in validation with RMSE of 0.057. However, linear model 5 created using the parameter selection identi ed by the MOGA presented a RMSE of 0.054 during validation when total cholesterol, HDL, triglyceride, HbA1c, microalbuminuria, creatinine, MDRD, sex and age are used in the composition of the LDL linear model. Therefore, we can conclude that LDL can be modeled by a linear model using 6 or 10 clinical variables with very low mean square error.As doenças cardiovasculares (CVD) continuam a ser a maior causa de morte no mundo e constituem um fator de risco para diabéticos para além de os diabéticos terem maior propensão para desenvolver CVD. No entanto, apesar de as diretrizes recentes cobrirem o risco de CVD, o efetivo controlo lipídico está longe de ser conseguido. Além disso, a autogestão lipídica em conjunto com o gerenciamento de decisões terapêuticas, nem sempre assume a prioridade adequada quer pelos pacientes quer pelos profissionais de saúde. Pretendendo compreender melhor a influência dos parâmetros clínicos no colesterol de lipoproteínas de baixa densidade (LDL) de doentes diabéticos tipo 2, doentes estes cujo gerenciamento dos valores lipídicos se suspeitam inst aveis, recorreu-se a registos eletrónicos de saúde (EHR) providenciados pela APDP (Associação Protetora de Diabetes Portugal) para fazer um estudo baseado em técnicas de mineração de dados.(…)Ruano, M. GraçaRibeiro, Rogério José TavaresSapientiaPetrovici, Mihai Daniel2021-03-09T14:59:21Z2020-06-262020-06-26T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10400.1/15204TID:202663418enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-24T10:27:37Zoai:sapientia.ualg.pt:10400.1/15204Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:06:05.204128Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns |
title |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns |
spellingShingle |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns Petrovici, Mihai Daniel Lipoproteina de baixa densidade Diabetes Data mining Modelo Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias |
title_short |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns |
title_full |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns |
title_fullStr |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns |
title_full_unstemmed |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns |
title_sort |
Data mining electronic health records of type 2 diabetes uncontrolled patients towards clustering LDL-cholesterol patterns |
author |
Petrovici, Mihai Daniel |
author_facet |
Petrovici, Mihai Daniel |
author_role |
author |
dc.contributor.none.fl_str_mv |
Ruano, M. Graça Ribeiro, Rogério José Tavares Sapientia |
dc.contributor.author.fl_str_mv |
Petrovici, Mihai Daniel |
dc.subject.por.fl_str_mv |
Lipoproteina de baixa densidade Diabetes Data mining Modelo Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias |
topic |
Lipoproteina de baixa densidade Diabetes Data mining Modelo Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias |
description |
Cardiovascular Diseases (CVD) present the highest world health rate, constituting a risk factor to patients with diabetes and simultaneously a consequence of dyslipidemia. E ective lipid management of patients with diabetes is still largely unattained, requiring better perception of both patients and healthcare professionals. Aiming at better understanding the in uence of clinical parameters on Low Density Lipoprotein (LDL)-cholesterol patterns of type 2 diabetes uncontrolled patients, the Electronic Health Records (EHR) provided by APDP (Associa c~ao Protetora de Diabetes Portugal) have been subject to data mining techniques. The database content was primarily analyzed to understand data integrity and to avoid usage of EHR's corrupted values or misleading information. The statistical distribution of each clinical parameter reported in the data base took place to identify their individual behavior and to enable statistically coherent identi cation of the cohort to be used when modeling LDL. As a rst approach, LDL linear modeling was considered, using both ordinary leastsquares and stepwise approaches. Then, LDL non-linear modeling was tested, using the same populations employed on linear modeling to assess the most accurate and practical LDL model. The provided EHR included 32577 medical appointments held by 1767 patients between January 2008 and February 2018. More than 10 clinical features were studied, leading to the decision of limiting the case-study population to those patients who had at least 5 Medical Appointments (MA) during the decade. From all MA's, 32% and 63% reported LDL and Glycated Hemoglobin (HbA1c) measurements, respectively, but some MA's did not report both simultaneously. Six linear models, relating di erent sets of 6 clinical parameters were tested. The linear model 3, involving LDL, Total Cholesterol, HDL, Triglyceride, HbA1c and Platelet is the elected linear model with a Root Mean Square Error (RMSE) of 0.07. The model where Platelets are substituted by Proteinuria presents a RMSE of just 0.054 but employed solely 38 case-studies. Neural network-based modeling strategies were tested as an alternative to linear models. In this sense, the Multi-Objective Genetic Algorithm (MOGA) was used. After data preprocessing, MOGA was performed twice using di erent threshold values. Six models were developed considering di erent combinations of clinical parameters. For each model, the population was divided into 3 groups: 60% of the population was used to train the network, 20% to test the model and the remaining 20% to validate the model. Using the populations employed by each MOGA run, the stepwise algorithm was used to identify the relevance of each clinical parameter in the model and create another linear model using this parameter set. The MOGA model with the best training performance was Model 4, while model 2 was the one performing best in validation with RMSE of 0.057. However, linear model 5 created using the parameter selection identi ed by the MOGA presented a RMSE of 0.054 during validation when total cholesterol, HDL, triglyceride, HbA1c, microalbuminuria, creatinine, MDRD, sex and age are used in the composition of the LDL linear model. Therefore, we can conclude that LDL can be modeled by a linear model using 6 or 10 clinical variables with very low mean square error. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-06-26 2020-06-26T00:00:00Z 2021-03-09T14:59:21Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10400.1/15204 TID:202663418 |
url |
http://hdl.handle.net/10400.1/15204 |
identifier_str_mv |
TID:202663418 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799133301705801728 |