Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection

Detalhes bibliográficos
Autor(a) principal: Pinto da Silva, Rildo
Data de Publicação: 2023
Outros Autores: Tarossi Pollettini, Juliana, Pazin Filho, Antonio
Tipo de documento: Artigo
Idioma: por
eng
Título da fonte: Cadernos de Saúde Pública
Texto Completo: https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440
Resumo: Patients with post-COVID-19 syndrome benefit from health promotion programs. Their rapid identification is important for the cost-effective use of these programs. Traditional identification techniques perform poorly especially in pandemics. A descriptive observational study was carried out using 105,008 prior authorizations paid by a private health care provider with the application of an unsupervised natural language processing method by topic modeling to identify patients suspected of being infected by COVID-19. A total of 6 models were generated: 3 using the BERTopic algorithm and 3 Word2Vec models. The BERTopic model automatically creates disease groups. In the Word2Vec model, manual analysis of the first 100 cases of each topic was necessary to define the topics related to COVID-19. The BERTopic model with more than 1,000 authorizations per topic without word treatment selected more severe patients – average cost per prior authorizations paid of BRL 10,206 and total expenditure of BRL 20.3 million (5.4%) in 1,987 prior authorizations (1.9%). It had 70% accuracy compared to human analysis and 20% of cases with potential interest, all subject to analysis for inclusion in a health promotion program. It had an important loss of cases when compared to the traditional research model with structured language and identified other groups of diseases – orthopedic, mental and cancer. The BERTopic model served as an exploratory method to be used in case labeling and subsequent application in supervised models. The automatic identification of other diseases raises ethical questions about the treatment of health information by machine learning.
id FIOCRUZ-5_0a3da7ce60cf36845180759351910a93
oai_identifier_str oai:ojs.teste-cadernos.ensp.fiocruz.br:article/8440
network_acronym_str FIOCRUZ-5
network_name_str Cadernos de Saúde Pública
repository_id_str
spelling Unsupervised natural language processing in the identification of patients with suspected COVID-19 infectionProcesamiento del lenguaje natural no supervisado para identificar a los pacientes sospechosos de infección por COVID-19Processamento de linguagem natural não supervisionado na identificação de pacientes suspeitos de infecção por COVID-19COVID-19; Processamento de Linguagem Natural; Atenção à Saúde; Critérios de Seleção de Pacientes; Instituições Privadas de SaúdeCOVID-19; Natural Language Processing; Health Care; Selection Criteria; Proprietary Health FacilitiesCOVID-19; Procesamiento de Lenguaje Natural; Atención a la Salud; Criterios de Seleción de Pacientes; Instituciones Privadas de SaludPatients with post-COVID-19 syndrome benefit from health promotion programs. Their rapid identification is important for the cost-effective use of these programs. Traditional identification techniques perform poorly especially in pandemics. A descriptive observational study was carried out using 105,008 prior authorizations paid by a private health care provider with the application of an unsupervised natural language processing method by topic modeling to identify patients suspected of being infected by COVID-19. A total of 6 models were generated: 3 using the BERTopic algorithm and 3 Word2Vec models. The BERTopic model automatically creates disease groups. In the Word2Vec model, manual analysis of the first 100 cases of each topic was necessary to define the topics related to COVID-19. The BERTopic model with more than 1,000 authorizations per topic without word treatment selected more severe patients – average cost per prior authorizations paid of BRL 10,206 and total expenditure of BRL 20.3 million (5.4%) in 1,987 prior authorizations (1.9%). It had 70% accuracy compared to human analysis and 20% of cases with potential interest, all subject to analysis for inclusion in a health promotion program. It had an important loss of cases when compared to the traditional research model with structured language and identified other groups of diseases – orthopedic, mental and cancer. The BERTopic model served as an exploratory method to be used in case labeling and subsequent application in supervised models. The automatic identification of other diseases raises ethical questions about the treatment of health information by machine learning.Los pacientes con síndrome pos-COVID-19 pueden beneficiarse de los programas de promoción de la salud. Su rápida identificación es importante para el uso efectivo de estos programas. Las técnicas de identificación tradicionales no tienen un buen desempeño, especialmente en pandemias. Se realizó un estudio observacional descriptivo, con el uso de 105.008 autorizaciones previas pagadas por un operador de salud privado mediante la aplicación de un método no supervisado de procesamiento del lenguaje natural mediante modelado temático para identificar a los pacientes sospechosos de estar infectados por COVID-19. Se generaron 6 modelos: 3 con el uso del algoritmo BERTopic y 3 modelos Word2Vec. El modelo BERTopic crea automáticamente grupos de enfermedades. En el modelo Word2Vec para definir temas relacionados con la COVID-19, fue necesario el análisis manual de los primeros 100 casos de cada tema. El modelo BERTopic con más de 1.000 autorizaciones por tema sin tratamiento de palabras seleccionó a pacientes más graves: costo promedio por autorizaciones previas pagada de BRL 10.206 y gasto total de BRL 20,3 millones (5,4%) en 1.987 autorizaciones previas (1,9%). Además, contó con el 70% de aciertos en comparación con el análisis humano y el 20% de los casos con potencial interés, todos los cuales pueden analizarse para su inclusión en un programa de promoción de la salud. Hubo una pérdida significativa de casos en comparación con el modelo tradicional de investigación con lenguaje estructurado y se identificó otros grupos de enfermedades: ortopédicas, mentales y cáncer. El modelo BERTopic sirvió como un método exploratorio para ser utilizado en el etiquetado de casos y su posterior aplicación en modelos supervisados. La identificación automática de otras enfermedades plantea preguntas éticas sobre el tratamiento de la información de salud mediante el aprendizaje de máquina.Os pacientes com síndrome pós-COVID-19 se beneficiam de programas de promoção de saúde e sua rápida identificação é importante para a utilização custo efetiva desses programas. Técnicas tradicionais de identificação têm fraco desempenho, especialmente em pandemias. Portanto, foi realizado um estudo observacional descritivo utilizando 105.008 autorizações prévias pagas por operadora privada de saúde com aplicação de método não supervisionado de processamento de linguagem natural por modelagem de tópicos para identificação de pacientes suspeitos de infecção por COVID-19. Foram gerados seis modelos: três utilizando o algoritmo BERTopic e três modelos Word2Vec. O modelo BERTopic cria automaticamente grupos de doenças. Já no modelo Word2Vec, para definição dos tópicos relacionados a COVID-19, foi necessária análise manual dos 100 primeiros casos de cada tópico. O modelo BERTopic com mais de 1.000 autorizações por tópico sem tratamento de palavras selecionou pacientes mais graves – custo médio por autorizações prévias pagas de BRL 10.206 e gasto total de BRL 20,3 milhões (5,4%) em 1.987 autorizações prévias (1,9%). Teve 70% de acerto comparado à análise humana e 20% de casos com potencial interesse, todos passíveis de análise para inclusão em programa de promoção à saúde. Teve perda importante de casos quando comparado ao modelo tradicional de pesquisa com linguagem estruturada e identificou outros grupos de doenças – ortopédicas, mentais e câncer. O modelo BERTopic serviu como método exploratório a ser utilizado na rotulagem de casos e posterior aplicação em modelos supervisionados. A identificação automática de outras doenças levanta questionamentos éticos sobre o tratamento de informações em saúde por aprendizado de máquina.Reports in Public HealthCadernos de Saúde Pública2023-11-28info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/xmlapplication/pdfapplication/pdfhttps://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440Reports in Public Health; Vol. 39 No. 11 (2023): NovemberCadernos de Saúde Pública; v. 39 n. 11 (2023): Novembro1678-44640102-311Xreponame:Cadernos de Saúde Públicainstname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZporenghttps://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440/18813https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440/18814https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440/18815Copyright (c) 2023 Cadernos de Saúde Públicainfo:eu-repo/semantics/openAccessPinto da Silva, RildoTarossi Pollettini, JulianaPazin Filho, Antonio2023-11-28T15:15:45Zoai:ojs.teste-cadernos.ensp.fiocruz.br:article/8440Revistahttps://cadernos.ensp.fiocruz.br/ojs/index.php/csphttps://cadernos.ensp.fiocruz.br/ojs/index.php/csp/oaicadernos@ensp.fiocruz.br||cadernos@ensp.fiocruz.br1678-44640102-311Xopendoar:2024-03-06T13:09:37.113488Cadernos de Saúde Pública - Fundação Oswaldo Cruz (FIOCRUZ)true
dc.title.none.fl_str_mv Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
Procesamiento del lenguaje natural no supervisado para identificar a los pacientes sospechosos de infección por COVID-19
Processamento de linguagem natural não supervisionado na identificação de pacientes suspeitos de infecção por COVID-19
title Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
spellingShingle Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
Pinto da Silva, Rildo
COVID-19; Processamento de Linguagem Natural; Atenção à Saúde; Critérios de Seleção de Pacientes; Instituições Privadas de Saúde
COVID-19; Natural Language Processing; Health Care; Selection Criteria; Proprietary Health Facilities
COVID-19; Procesamiento de Lenguaje Natural; Atención a la Salud; Criterios de Seleción de Pacientes; Instituciones Privadas de Salud
title_short Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
title_full Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
title_fullStr Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
title_full_unstemmed Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
title_sort Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection
author Pinto da Silva, Rildo
author_facet Pinto da Silva, Rildo
Tarossi Pollettini, Juliana
Pazin Filho, Antonio
author_role author
author2 Tarossi Pollettini, Juliana
Pazin Filho, Antonio
author2_role author
author
dc.contributor.author.fl_str_mv Pinto da Silva, Rildo
Tarossi Pollettini, Juliana
Pazin Filho, Antonio
dc.subject.por.fl_str_mv COVID-19; Processamento de Linguagem Natural; Atenção à Saúde; Critérios de Seleção de Pacientes; Instituições Privadas de Saúde
COVID-19; Natural Language Processing; Health Care; Selection Criteria; Proprietary Health Facilities
COVID-19; Procesamiento de Lenguaje Natural; Atención a la Salud; Criterios de Seleción de Pacientes; Instituciones Privadas de Salud
topic COVID-19; Processamento de Linguagem Natural; Atenção à Saúde; Critérios de Seleção de Pacientes; Instituições Privadas de Saúde
COVID-19; Natural Language Processing; Health Care; Selection Criteria; Proprietary Health Facilities
COVID-19; Procesamiento de Lenguaje Natural; Atención a la Salud; Criterios de Seleción de Pacientes; Instituciones Privadas de Salud
description Patients with post-COVID-19 syndrome benefit from health promotion programs. Their rapid identification is important for the cost-effective use of these programs. Traditional identification techniques perform poorly especially in pandemics. A descriptive observational study was carried out using 105,008 prior authorizations paid by a private health care provider with the application of an unsupervised natural language processing method by topic modeling to identify patients suspected of being infected by COVID-19. A total of 6 models were generated: 3 using the BERTopic algorithm and 3 Word2Vec models. The BERTopic model automatically creates disease groups. In the Word2Vec model, manual analysis of the first 100 cases of each topic was necessary to define the topics related to COVID-19. The BERTopic model with more than 1,000 authorizations per topic without word treatment selected more severe patients – average cost per prior authorizations paid of BRL 10,206 and total expenditure of BRL 20.3 million (5.4%) in 1,987 prior authorizations (1.9%). It had 70% accuracy compared to human analysis and 20% of cases with potential interest, all subject to analysis for inclusion in a health promotion program. It had an important loss of cases when compared to the traditional research model with structured language and identified other groups of diseases – orthopedic, mental and cancer. The BERTopic model served as an exploratory method to be used in case labeling and subsequent application in supervised models. The automatic identification of other diseases raises ethical questions about the treatment of health information by machine learning.
publishDate 2023
dc.date.none.fl_str_mv 2023-11-28
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440
url https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440
dc.language.iso.fl_str_mv por
eng
language por
eng
dc.relation.none.fl_str_mv https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440/18813
https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440/18814
https://cadernos.ensp.fiocruz.br/ojs/index.php/csp/article/view/8440/18815
dc.rights.driver.fl_str_mv Copyright (c) 2023 Cadernos de Saúde Pública
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Copyright (c) 2023 Cadernos de Saúde Pública
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv text/xml
application/pdf
application/pdf
dc.publisher.none.fl_str_mv Reports in Public Health
Cadernos de Saúde Pública
publisher.none.fl_str_mv Reports in Public Health
Cadernos de Saúde Pública
dc.source.none.fl_str_mv Reports in Public Health; Vol. 39 No. 11 (2023): November
Cadernos de Saúde Pública; v. 39 n. 11 (2023): Novembro
1678-4464
0102-311X
reponame:Cadernos de Saúde Pública
instname:Fundação Oswaldo Cruz (FIOCRUZ)
instacron:FIOCRUZ
instname_str Fundação Oswaldo Cruz (FIOCRUZ)
instacron_str FIOCRUZ
institution FIOCRUZ
reponame_str Cadernos de Saúde Pública
collection Cadernos de Saúde Pública
repository.name.fl_str_mv Cadernos de Saúde Pública - Fundação Oswaldo Cruz (FIOCRUZ)
repository.mail.fl_str_mv cadernos@ensp.fiocruz.br||cadernos@ensp.fiocruz.br
_version_ 1798943399643971584