Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/157139 |
Resumo: | Clinical data are essential in the medical domain, ensuring quality of care and improving decision-making. However, their heterogeneous and incomplete nature leads to an ubiquity of data quality problems, particularly missing values. Inevitable challenges arise in delivering reliable Decision Support Systems (DSSs), as missing data yield negative effects on the learning process of Machine Learning models. The interest in developing missing value imputation strategies has been growing, in an endeavour to overcome this issue. This dissertation aimed to study missing data and their relationships with observed values, and to lateremploy that information in a technique that addresses the predicaments posed by incomplete datasets in real-world scenarios. Moreover, the concept of correlation was explored within the context of missing value imputation, a promising but rather overlooked approach in biomedical research. First, a comprehensive correlational study was performed, which considered key aspects from missing data analysis. Afterwards, the gathered knowledge was leveraged to create three novel correlation-based imputation techniques. Thesewere not only validated on datasets with a controlled and synthetic missingness, but also on real-world medical datasets. Their performance was evaluated against competing imputation methods, both traditional and state-of-the-art. The contributions of this dissertation encompass a systematic view of theoretical concepts regarding the analysis and handling of missing values. Additionally, an extensive literature review concerning missing data imputation was conducted, which comprised a comparative study of ten methods under diverse missingness conditions. The proposed techniques exhibited similar results when compared to their competitors, sometimes even superior in terms of imputation precision and classification performance, evaluated through the Mean Absolute Error and the Area Under the Receiver Operating Characteristic curve, respectively. Therefore, this dissertation corroborates the potential of correlation to improve the robustness of DSSs to missing values, and provides answers to current flaws shared by correlation-based imputation strategies in real-world medical problems. |
id |
RCAP_d091bf73f766d11141af9235ca32dbc6 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/157139 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing DataMissing DataMissing Data ImputationCorrelationMachine LearningDecision Support SystemDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaClinical data are essential in the medical domain, ensuring quality of care and improving decision-making. However, their heterogeneous and incomplete nature leads to an ubiquity of data quality problems, particularly missing values. Inevitable challenges arise in delivering reliable Decision Support Systems (DSSs), as missing data yield negative effects on the learning process of Machine Learning models. The interest in developing missing value imputation strategies has been growing, in an endeavour to overcome this issue. This dissertation aimed to study missing data and their relationships with observed values, and to lateremploy that information in a technique that addresses the predicaments posed by incomplete datasets in real-world scenarios. Moreover, the concept of correlation was explored within the context of missing value imputation, a promising but rather overlooked approach in biomedical research. First, a comprehensive correlational study was performed, which considered key aspects from missing data analysis. Afterwards, the gathered knowledge was leveraged to create three novel correlation-based imputation techniques. Thesewere not only validated on datasets with a controlled and synthetic missingness, but also on real-world medical datasets. Their performance was evaluated against competing imputation methods, both traditional and state-of-the-art. The contributions of this dissertation encompass a systematic view of theoretical concepts regarding the analysis and handling of missing values. Additionally, an extensive literature review concerning missing data imputation was conducted, which comprised a comparative study of ten methods under diverse missingness conditions. The proposed techniques exhibited similar results when compared to their competitors, sometimes even superior in terms of imputation precision and classification performance, evaluated through the Mean Absolute Error and the Area Under the Receiver Operating Characteristic curve, respectively. Therefore, this dissertation corroborates the potential of correlation to improve the robustness of DSSs to missing values, and provides answers to current flaws shared by correlation-based imputation strategies in real-world medical problems.Dados clínicos são essenciais para assegurar cuidados médicos de qualidade e melhorar a tomada de decisões. Contudo, a sua natureza heterogénea e incompleta cria uma ubiquidade de problemas de qualidade, nomeadamente pela existência de valores em falta. Esta condição origina desafios inevitáveis para a disponibilização de Sistemas de Apoio à Decisão (SADs) fiáveis, já que dados em falta acarretam efeitos negativos no treino de modelos de Aprendizagem Automática. O interesse no desenvolvimento de estratégias de imputação de valores em falta tem vindo a crescer, num esforço para superar esta adversidade. Esta dissertação visou estudar o problema dos dados em falta através das relações que estes apresentam com os valores observados. Esta informação foi depois utilizada no desenvolvimento de técnicas para colmatar os problemas impostos por dados incompletos em cenários reais. Ademais, o conceito de correlação foi explorado no contexto da imputação de valores em falta, já que, apesar de promissor, tem vindo a ser negligenciado em investigação biomédica. Em primeiro lugar, foi realizado um estudo correlacional abrangente que contemplou aspetos fundamentais da análise de dados em falta. Posteriormente, o conhecimento recolhido foi aplicado na criação de três novas técnicas de imputação baseadas na correlação. Estas foram validadas não só em conjuntos de dados com incompletude controlada e sintética, mas também em conjuntos de dados médicos reais. O seu desempenho foi avaliado e comparado a métodos de imputação tanto tradicionais como de estado-de-arte. As contribuições desta dissertação passam pela sistematização de conceitos teóricos relativos à análise e tratamento de dados em falta. Adicionalmente, realizou-se uma extensa revisão da literatura referente à imputação de dados, que compreendeu um estudo comparativo de dez métodos sob diversas condições de incompletude. As técnicas propostas exibiram resultados semelhantes aos dos restantes métodos, por vezes até superiores em termos de precisão da imputação e de performance da classificação. Assim, esta dissertação corrobora o potencial da utilização da correlação na melhoria da robustez de SADs a dados em falta, e fornece respostas a algumas das atuais falhas partilhadas por estratégias de imputação baseadas em correlação quando aplicadas a casos médicos reais.Gamboa, HugoRUNCurioso, Isabel de Almeida2023-09-01T13:41:30Z2022-122022-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/157139enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:39:25Zoai:run.unl.pt:10362/157139Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:56:34.600374Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data |
title |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data |
spellingShingle |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data Curioso, Isabel de Almeida Missing Data Missing Data Imputation Correlation Machine Learning Decision Support System Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
title_short |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data |
title_full |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data |
title_fullStr |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data |
title_full_unstemmed |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data |
title_sort |
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data |
author |
Curioso, Isabel de Almeida |
author_facet |
Curioso, Isabel de Almeida |
author_role |
author |
dc.contributor.none.fl_str_mv |
Gamboa, Hugo RUN |
dc.contributor.author.fl_str_mv |
Curioso, Isabel de Almeida |
dc.subject.por.fl_str_mv |
Missing Data Missing Data Imputation Correlation Machine Learning Decision Support System Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
topic |
Missing Data Missing Data Imputation Correlation Machine Learning Decision Support System Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
description |
Clinical data are essential in the medical domain, ensuring quality of care and improving decision-making. However, their heterogeneous and incomplete nature leads to an ubiquity of data quality problems, particularly missing values. Inevitable challenges arise in delivering reliable Decision Support Systems (DSSs), as missing data yield negative effects on the learning process of Machine Learning models. The interest in developing missing value imputation strategies has been growing, in an endeavour to overcome this issue. This dissertation aimed to study missing data and their relationships with observed values, and to lateremploy that information in a technique that addresses the predicaments posed by incomplete datasets in real-world scenarios. Moreover, the concept of correlation was explored within the context of missing value imputation, a promising but rather overlooked approach in biomedical research. First, a comprehensive correlational study was performed, which considered key aspects from missing data analysis. Afterwards, the gathered knowledge was leveraged to create three novel correlation-based imputation techniques. Thesewere not only validated on datasets with a controlled and synthetic missingness, but also on real-world medical datasets. Their performance was evaluated against competing imputation methods, both traditional and state-of-the-art. The contributions of this dissertation encompass a systematic view of theoretical concepts regarding the analysis and handling of missing values. Additionally, an extensive literature review concerning missing data imputation was conducted, which comprised a comparative study of ten methods under diverse missingness conditions. The proposed techniques exhibited similar results when compared to their competitors, sometimes even superior in terms of imputation precision and classification performance, evaluated through the Mean Absolute Error and the Area Under the Receiver Operating Characteristic curve, respectively. Therefore, this dissertation corroborates the potential of correlation to improve the robustness of DSSs to missing values, and provides answers to current flaws shared by correlation-based imputation strategies in real-world medical problems. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-12 2022-12-01T00:00:00Z 2023-09-01T13:41:30Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/157139 |
url |
http://hdl.handle.net/10362/157139 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138150804619264 |