Automatic text processing of clinical narratives

Detalhes bibliográficos
Autor(a) principal: Faustino, Jorge Filipe Balseiro
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10773/38432
Resumo: The informatization of medical systems and the subsequent move towards the usage of Electronic Health Records (EHR) over the paper format by medical professionals allowed for safer and more e cient healthcare. Additionally, EHR can also be used as a data source for observational studies around the world. However, it is estimated that 70-80% of all clinical data is in the form of unstructured free text and regarding the data that is structured, not all of it follows the same standards, making it di cult to use on the mentioned observational studies. This dissertation aims to tackle those two adversities using natural language processing for the task of extracting concepts from free text and, afterwards, use a common data model to harmonize the data. The developed system employs an annotator, namely cTAKES, to extract the concepts from free text. The extracted concepts are then normalized using text preprocessing, word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the normalized concepts are converted to the OMOP Common Data Model and stored in a database. In order to test the developed system, the i2b2 2010 data set was used. The di erent components of the system were tested and evaluated separately, with the concept extraction component achieving a precision, recall and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization component was evaluated by completing the N2C2 2019 challenge track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP CDM conversion component, it was observed that 7.92% of the concepts were lost during the process. In conclusion, even though the developed system still has margin for improvements, it proves to be a viable method of automatically processing clinical narratives.
id RCAP_2df0b7d8c376ec09efb440882fb377b7
oai_identifier_str oai:ria.ua.pt:10773/38432
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Automatic text processing of clinical narrativesConcept recognitionInformation retrievalNatural language processingText miningThe informatization of medical systems and the subsequent move towards the usage of Electronic Health Records (EHR) over the paper format by medical professionals allowed for safer and more e cient healthcare. Additionally, EHR can also be used as a data source for observational studies around the world. However, it is estimated that 70-80% of all clinical data is in the form of unstructured free text and regarding the data that is structured, not all of it follows the same standards, making it di cult to use on the mentioned observational studies. This dissertation aims to tackle those two adversities using natural language processing for the task of extracting concepts from free text and, afterwards, use a common data model to harmonize the data. The developed system employs an annotator, namely cTAKES, to extract the concepts from free text. The extracted concepts are then normalized using text preprocessing, word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the normalized concepts are converted to the OMOP Common Data Model and stored in a database. In order to test the developed system, the i2b2 2010 data set was used. The di erent components of the system were tested and evaluated separately, with the concept extraction component achieving a precision, recall and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization component was evaluated by completing the N2C2 2019 challenge track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP CDM conversion component, it was observed that 7.92% of the concepts were lost during the process. In conclusion, even though the developed system still has margin for improvements, it proves to be a viable method of automatically processing clinical narratives.A informatização dos sistemas médicos e a subsequente tendência por parte de profissionais de saúde a substituir registos em formato de papel por registos eletrónicos de saúde, permitiu que os serviços de saúde se tornassem mais seguros e eficientes. Além disso, estes registos eletrónicos apresentam também o benefício de poderem ser utilizados como fonte de dados para estudos observacionais. No entanto, estima-se que 70-80% de todos os dados clínicos se encontrem na forma de texto livre não-estruturado e os dados que estão estruturados não seguem todos os mesmos padrões, dificultando o seu potencial uso nos estudos observacionais. Esta dissertação pretende solucionar essas duas adversidades através do uso de processamento de linguagem natural para a tarefa de extrair conceitos de texto livre e, de seguida, usar um modelo comum de dados para os harmonizar. O sistema desenvolvido utiliza um anotador, especificamente o cTAKES, para extrair conceitos de texto livre. Os conceitos extraídos são, então, normalizados através de técnicas de pré-processamento de texto, Word Embeddings, MetaMap e um sistema de procura no Metathesaurus do UMLS. Por fim, os conceitos normalizados são convertidos para o modelo comum de dados da OMOP e guardados numa base de dados. Para testar o sistema desenvolvido usou-se o conjunto de dados i2b2 de 2010. As diferentes partes do sistema foram testadas e avaliadas individualmente sendo que na extração dos conceitos obteve-se uma precisão, recall e F-score de 77.12%, 70.29% e 73.55%, respetivamente. A normalização foi avaliada através do desafio N2C2 2019-track 3 onde se obteve uma exatidão de 77.5%. Na conversão para o modelo comum de dados OMOP observou-se que durante a conversão perderam-se 7.92% dos conceitos. Concluiu-se que, embora o sistema desenvolvido ainda tenha margem para melhorias, este demonstrou-se como um método viável de processamento automático do texto de narrativas clínicas.2023-07-07T15:39:59Z2022-12-12T00:00:00Z2022-12-12info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/38432engFaustino, Jorge Filipe Balseiroinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-02-22T12:14:19Zoai:ria.ua.pt:10773/38432Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:08:36.868478Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Automatic text processing of clinical narratives
title Automatic text processing of clinical narratives
spellingShingle Automatic text processing of clinical narratives
Faustino, Jorge Filipe Balseiro
Concept recognition
Information retrieval
Natural language processing
Text mining
title_short Automatic text processing of clinical narratives
title_full Automatic text processing of clinical narratives
title_fullStr Automatic text processing of clinical narratives
title_full_unstemmed Automatic text processing of clinical narratives
title_sort Automatic text processing of clinical narratives
author Faustino, Jorge Filipe Balseiro
author_facet Faustino, Jorge Filipe Balseiro
author_role author
dc.contributor.author.fl_str_mv Faustino, Jorge Filipe Balseiro
dc.subject.por.fl_str_mv Concept recognition
Information retrieval
Natural language processing
Text mining
topic Concept recognition
Information retrieval
Natural language processing
Text mining
description The informatization of medical systems and the subsequent move towards the usage of Electronic Health Records (EHR) over the paper format by medical professionals allowed for safer and more e cient healthcare. Additionally, EHR can also be used as a data source for observational studies around the world. However, it is estimated that 70-80% of all clinical data is in the form of unstructured free text and regarding the data that is structured, not all of it follows the same standards, making it di cult to use on the mentioned observational studies. This dissertation aims to tackle those two adversities using natural language processing for the task of extracting concepts from free text and, afterwards, use a common data model to harmonize the data. The developed system employs an annotator, namely cTAKES, to extract the concepts from free text. The extracted concepts are then normalized using text preprocessing, word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the normalized concepts are converted to the OMOP Common Data Model and stored in a database. In order to test the developed system, the i2b2 2010 data set was used. The di erent components of the system were tested and evaluated separately, with the concept extraction component achieving a precision, recall and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization component was evaluated by completing the N2C2 2019 challenge track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP CDM conversion component, it was observed that 7.92% of the concepts were lost during the process. In conclusion, even though the developed system still has margin for improvements, it proves to be a viable method of automatically processing clinical narratives.
publishDate 2022
dc.date.none.fl_str_mv 2022-12-12T00:00:00Z
2022-12-12
2023-07-07T15:39:59Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10773/38432
url http://hdl.handle.net/10773/38432
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799137738332569600