Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/163653 |
Resumo: | The exponential growth of electronic health records has resulted in an unprecedented volume of unstructured clinical data. Harnessing the potential of this information requires advanced natural language processing techniques and holds immense potential for healthcare improvement. This dissertation explores the potentialities of clinical language models in extracting and organizing the information from clinical text, and aims to measure the impact of such information in a real clinical task involving the prediction of complications following Cardiothoracic Surgery. Three information extraction models were developed by fine-tuning clinical language models on a ICD-9 code classification task. The ClinicalBERT and BioGPT-based models achieved a Mean Average Precision at 10 around 0.437, outperforming a fine-tuned model from the literature. These models were later applied to extract ICD-9 codes from translated Portuguese clinical notes. The retrieved variables were proven to benefit machine learning models trained to predict post-surgery complications, as their accuracy improved in up to 30% relative to a baseline model not trained with this information, when the information extracted from clinical text was added, achieving values of around 0.880. The results of this research solidify clinical language models as powerful tools for clinically relevant information extraction from free-text medical reports, and set the tone for the integration of these systems in clinical decision support systems, towards high performance and interpretability standards. |
id |
RCAP_22bbc1b210fc0b4dcb09399867bf81fc |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/163653 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical DataNatural Language ProcessingMachine LearningClinical Language ModelsElectronic Health Record Clinical NotesICD-9 code extractionCardiothoracic SurgeryDomínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e TecnologiasThe exponential growth of electronic health records has resulted in an unprecedented volume of unstructured clinical data. Harnessing the potential of this information requires advanced natural language processing techniques and holds immense potential for healthcare improvement. This dissertation explores the potentialities of clinical language models in extracting and organizing the information from clinical text, and aims to measure the impact of such information in a real clinical task involving the prediction of complications following Cardiothoracic Surgery. Three information extraction models were developed by fine-tuning clinical language models on a ICD-9 code classification task. The ClinicalBERT and BioGPT-based models achieved a Mean Average Precision at 10 around 0.437, outperforming a fine-tuned model from the literature. These models were later applied to extract ICD-9 codes from translated Portuguese clinical notes. The retrieved variables were proven to benefit machine learning models trained to predict post-surgery complications, as their accuracy improved in up to 30% relative to a baseline model not trained with this information, when the information extracted from clinical text was added, achieving values of around 0.880. The results of this research solidify clinical language models as powerful tools for clinically relevant information extraction from free-text medical reports, and set the tone for the integration of these systems in clinical decision support systems, towards high performance and interpretability standards.O crescimento exponencial dos registros de saúde eletrónicos resultou em um volume sem precedentes de dados clínicos não estruturados. Aproveitar o potencial dessas informações requer técnicas avançadas de processamento de linguagem natural e oferece promissoras melhorias na área da saúde. Esta dissertação explora as capacidades de modelos de linguagem clínica na extração e organização de informações de textos clínicos, com o objetivo de avaliar o impacto dessas informações em uma tarefa clínica tangível: a previsão de complicações após Cirurgia Cardiotorácica. Três modelos de extração de informações foram desenvolvidos através da técnica finetuning de modelos de linguagem clínica numa tarefa de classificação de códigos ICD-9. Modelos baseados nos state-of-the-art ClinicalBERT e BioGPT alcançaram MAP@10 por volta de 0.437, superando um modelo da literatura. Posteriormente, esses modelos foram aplicados para extrair códigos ICD-9 de notas clínicas traduzidas para Portugues. As variáveis adquiridas mostraram-se favoraveis para modelos de aprendizagem automática treinados para prever complicações pós-cirúrgicas. A accuracy desses modelos melhorou em até 30% em relação a um modelo base não treinado com essas informações, quando as informações extraídas de textos clínicos foram incorporadas, atingindo valores em torno de 0,880. Os resultados desta pesquisa confirmam os modelos de linguagem clínica como ferramentas poderosas para a extração de informações clinicamente relevantes de relatórios médicos em formato de texto livre. Além disso, eles preparam o terreno para a integração desses modelos em sistemas de suporte à decisão clínica, em direção a padrões de alto desempenho e interpretabilidade.Gamboa, HugoRUNLopes, Ana Filipa Gonçalves2024-02-16T11:51:06Z2023-112023-11-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/163653enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:48:06Zoai:run.unl.pt:10362/163653Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:59:48.048190Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data |
title |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data |
spellingShingle |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data Lopes, Ana Filipa Gonçalves Natural Language Processing Machine Learning Clinical Language Models Electronic Health Record Clinical Notes ICD-9 code extraction Cardiothoracic Surgery Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias |
title_short |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data |
title_full |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data |
title_fullStr |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data |
title_full_unstemmed |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data |
title_sort |
Clinical Language Models for Information Extraction and Predictive Tasks from Clinical Notes: Uncovering the Potential of Unstructured Clinical Data |
author |
Lopes, Ana Filipa Gonçalves |
author_facet |
Lopes, Ana Filipa Gonçalves |
author_role |
author |
dc.contributor.none.fl_str_mv |
Gamboa, Hugo RUN |
dc.contributor.author.fl_str_mv |
Lopes, Ana Filipa Gonçalves |
dc.subject.por.fl_str_mv |
Natural Language Processing Machine Learning Clinical Language Models Electronic Health Record Clinical Notes ICD-9 code extraction Cardiothoracic Surgery Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias |
topic |
Natural Language Processing Machine Learning Clinical Language Models Electronic Health Record Clinical Notes ICD-9 code extraction Cardiothoracic Surgery Domínio/Área Científica::Engenharia e Tecnologia::Outras Engenharias e Tecnologias |
description |
The exponential growth of electronic health records has resulted in an unprecedented volume of unstructured clinical data. Harnessing the potential of this information requires advanced natural language processing techniques and holds immense potential for healthcare improvement. This dissertation explores the potentialities of clinical language models in extracting and organizing the information from clinical text, and aims to measure the impact of such information in a real clinical task involving the prediction of complications following Cardiothoracic Surgery. Three information extraction models were developed by fine-tuning clinical language models on a ICD-9 code classification task. The ClinicalBERT and BioGPT-based models achieved a Mean Average Precision at 10 around 0.437, outperforming a fine-tuned model from the literature. These models were later applied to extract ICD-9 codes from translated Portuguese clinical notes. The retrieved variables were proven to benefit machine learning models trained to predict post-surgery complications, as their accuracy improved in up to 30% relative to a baseline model not trained with this information, when the information extracted from clinical text was added, achieving values of around 0.880. The results of this research solidify clinical language models as powerful tools for clinically relevant information extraction from free-text medical reports, and set the tone for the integration of these systems in clinical decision support systems, towards high performance and interpretability standards. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-11 2023-11-01T00:00:00Z 2024-02-16T11:51:06Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/163653 |
url |
http://hdl.handle.net/10362/163653 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138174464688128 |