Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data

Detalhes bibliográficos
Autor(a) principal: Lopes, Ana Filipa Gonçalves
Data de Publicação: 2023
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/163653
Resumo: The exponential growth of electronic health records has resulted in an unprecedented volume of unstructured clinical data. Harnessing the potential of this information requires advanced natural language processing techniques and holds immense potential for healthcare improvement. This dissertation explores the potentialities of clinical language models in extracting and organizing the information from clinical text, and aims to measure the impact of such information in a real clinical task involving the prediction of complications following Cardiothoracic Surgery. Three information extraction models were developed by fine-tuning clinical language models on a ICD-9 code classification task. The ClinicalBERT and BioGPT-based models achieved a Mean Average Precision at 10 around 0.437, outperforming a fine-tuned model from the literature. These models were later applied to extract ICD-9 codes from translated Portuguese clinical notes. The retrieved variables were proven to benefit machine learning models trained to predict post-surgery complications, as their accuracy improved in up to 30% relative to a baseline model not trained with this information, when the information extracted from clinical text was added, achieving values of around 0.880. The results of this research solidify clinical language models as powerful tools for clinically relevant information extraction from free-text medical reports, and set the tone for the integration of these systems in clinical decision support systems, towards high performance and interpretability standards.
id RCAP_22bbc1b210fc0b4dcb09399867bf81fc
oai_identifier_str oai:run.unl.pt:10362/163653
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical DataNatural Language ProcessingMachine LearningClinical Language ModelsElectronic Health Record Clinical NotesICD-9 code extractionCardiothoracic SurgeryDomínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e TecnologiasThe exponential growth of electronic health records has resulted in an unprecedented volume of unstructured clinical data. Harnessing the potential of this information requires advanced natural language processing techniques and holds immense potential for healthcare improvement. This dissertation explores the potentialities of clinical language models in extracting and organizing the information from clinical text, and aims to measure the impact of such information in a real clinical task involving the prediction of complications following Cardiothoracic Surgery. Three information extraction models were developed by fine-tuning clinical language models on a ICD-9 code classification task. The ClinicalBERT and BioGPT-based models achieved a Mean Average Precision at 10 around 0.437, outperforming a fine-tuned model from the literature. These models were later applied to extract ICD-9 codes from translated Portuguese clinical notes. The retrieved variables were proven to benefit machine learning models trained to predict post-surgery complications, as their accuracy improved in up to 30% relative to a baseline model not trained with this information, when the information extracted from clinical text was added, achieving values of around 0.880. The results of this research solidify clinical language models as powerful tools for clinically relevant information extraction from free-text medical reports, and set the tone for the integration of these systems in clinical decision support systems, towards high performance and interpretability standards.O crescimento exponencial dos registros de saúde eletrónicos resultou em um volume sem precedentes de dados clínicos não estruturados. Aproveitar o potencial dessas informações requer técnicas avançadas de processamento de linguagem natural e oferece promissoras melhorias na área da saúde. Esta dissertação explora as capacidades de modelos de linguagem clínica na extração e organização de informações de textos clínicos, com o objetivo de avaliar o impacto dessas informações em uma tarefa clínica tangível: a previsão de complicações após Cirurgia Cardiotorácica. Três modelos de extração de informações foram desenvolvidos através da técnica finetuning de modelos de linguagem clínica numa tarefa de classificação de códigos ICD-9. Modelos baseados nos state-of-the-art ClinicalBERT e BioGPT alcançaram MAP@10 por volta de 0.437, superando um modelo da literatura. Posteriormente, esses modelos foram aplicados para extrair códigos ICD-9 de notas clínicas traduzidas para Portugues. As variáveis adquiridas mostraram-se favoraveis para modelos de aprendizagem automática treinados para prever complicações pós-cirúrgicas. A accuracy desses modelos melhorou em até 30% em relação a um modelo base não treinado com essas informações, quando as informações extraídas de textos clínicos foram incorporadas, atingindo valores em torno de 0,880. Os resultados desta pesquisa confirmam os modelos de linguagem clínica como ferramentas poderosas para a extração de informações clinicamente relevantes de relatórios médicos em formato de texto livre. Além disso, eles preparam o terreno para a integração desses modelos em sistemas de suporte à decisão clínica, em direção a padrões de alto desempenho e interpretabilidade.Gamboa, HugoRUNLopes, Ana Filipa Gonçalves2024-02-16T11:51:06Z2023-112023-11-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/163653enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:48:06Zoai:run.unl.pt:10362/163653Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:59:48.048190Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
title Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
spellingShingle Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
Lopes, Ana Filipa Gonçalves
Natural Language Processing
Machine Learning
Clinical Language Models
Electronic Health Record Clinical Notes
ICD-9 code extraction
Cardiothoracic Surgery
Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias
title_short Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
title_full Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
title_fullStr Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
title_full_unstemmed Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
title_sort Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
author Lopes, Ana Filipa Gonçalves
author_facet Lopes, Ana Filipa Gonçalves
author_role author
dc.contributor.none.fl_str_mv Gamboa, Hugo
RUN
dc.contributor.author.fl_str_mv Lopes, Ana Filipa Gonçalves
dc.subject.por.fl_str_mv Natural Language Processing
Machine Learning
Clinical Language Models
Electronic Health Record Clinical Notes
ICD-9 code extraction
Cardiothoracic Surgery
Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias
topic Natural Language Processing
Machine Learning
Clinical Language Models
Electronic Health Record Clinical Notes
ICD-9 code extraction
Cardiothoracic Surgery
Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias
description The exponential growth of electronic health records has resulted in an unprecedented volume of unstructured clinical data. Harnessing the potential of this information requires advanced natural language processing techniques and holds immense potential for healthcare improvement. This dissertation explores the potentialities of clinical language models in extracting and organizing the information from clinical text, and aims to measure the impact of such information in a real clinical task involving the prediction of complications following Cardiothoracic Surgery. Three information extraction models were developed by fine-tuning clinical language models on a ICD-9 code classification task. The ClinicalBERT and BioGPT-based models achieved a Mean Average Precision at 10 around 0.437, outperforming a fine-tuned model from the literature. These models were later applied to extract ICD-9 codes from translated Portuguese clinical notes. The retrieved variables were proven to benefit machine learning models trained to predict post-surgery complications, as their accuracy improved in up to 30% relative to a baseline model not trained with this information, when the information extracted from clinical text was added, achieving values of around 0.880. The results of this research solidify clinical language models as powerful tools for clinically relevant information extraction from free-text medical reports, and set the tone for the integration of these systems in clinical decision support systems, towards high performance and interpretability standards.
publishDate 2023
dc.date.none.fl_str_mv 2023-11
2023-11-01T00:00:00Z
2024-02-16T11:51:06Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/163653
url http://hdl.handle.net/10362/163653
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138174464688128