Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Detalhes bibliográficos
Autor(a) principal: Duque, Juliana Lilian
Data de Publicação: 2012
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFSCAR
Texto Completo: https://repositorio.ufscar.br/handle/ufscar/496
Resumo: Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.
id SCAR_4f2dfbee5687fa8ad6558fe4b2245f24
oai_identifier_str oai:repositorio.ufscar.br:ufscar/496
network_acronym_str SCAR
network_name_str Repositório Institucional da UFSCAR
repository_id_str 4322
spelling Duque, Juliana LilianCiferri, Ricardo Rodrigueshttp://lattes.cnpq.br/8382221522817502http://lattes.cnpq.br/2616679912003387d8648db2-c5d8-4600-a93f-50472ed122a52016-06-02T19:05:56Z2012-05-162016-06-02T19:05:56Z2012-02-24DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.https://repositorio.ufscar.br/handle/ufscar/496Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.Atualmente na área médica existe uma grande quantidade de informações não estruturadas (i.e., em formato textual) sendo produzidas na literatura médica. Com o grande volume de dados, torna-se impossível que os médicos e especialistas da área analisem toda a literatura de forma manual, exigindo técnicas para automatizar a análise destes documentos. Com o intuito de identificar as informações relevantes, estruturar e armazenar estas informações em um banco de dados, para posteriormente identificar relacionamentos interessantes entre as informações extraídas, nesta dissertação é proposto um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. A hipótese é que a busca inicial de sentenças que possuem termos de complicação melhora a eficiência na identificação e na extração de termos de tratamento. Isso acontece porque tratamentos ocorrem principalmente na mesma sentença de complicação ou em sentenças próximas no mesmo parágrafo. Esta metodologia utiliza três abordagens de extração de informação encontradas na literatura: abordagem baseada em aprendizado de máquina para classificar as sentenças de interesse; abordagem baseada em dicionário com termos validados pelo especialista da área e abordagem baseada em regras. A metodologia foi validada como prova de conceito, utilizando artigos do domínio biomédico, mais especificamente da doença Anemia Falciforme. A prova de conceito foi realizada na classificação de sentenças e identificação de termos relevantes. O valor da acurácia obtida na classificação de sentenças foi de 79% para o classificador de complicação e 71% para o classificador de tratamento. Estes valores condizem com os resultados obtidos com a combinação do algoritmo de aprendizado de máquina Support Vector Machine juntamente com a aplicação do filtro Remoção de Ruído e Balanceamento das Classes. Na identificação de termos relevantes, os resultados da metodologia proposta obteve percentual superior de 42% de medida-F comparado à classificação manual (31%) e comparado ao processo parcial, ou seja, sem utilizar o classificador de complicação (36%). Mesmo com a baixa revocação, foi possível obter 100% de revocação para os termos distintos de tratamento, não impactando o processo de extração, e portanto a hipótese considerada neste trabalho foi comprovada.application/pdfporUniversidade Federal de São CarlosPrograma de Pós-Graduação em Ciência da Computação - PPGCCUFSCarBRInteligência artificialBanco de dadosMineração de textosReconhecimento de padrõesExtração de informaçãoAnemia falciformeTratamentosPré-ProcessamentoDomínio BiomédicoInformation ExtractionTreatmentsText MiningPreprocessingBiomedical DomainSickle Cell AnemiaCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOUm processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédicoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis-1-13b1d5172-8bf0-4d0b-8777-ab82599bbf09info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINAL4310.pdfapplication/pdf3265738https://repositorio.ufscar.br/bitstream/ufscar/496/1/4310.pdf6650fb70eee9b096860bcac6b5ed596cMD51TEXT4310.pdf.txt4310.pdf.txtExtracted texttext/plain0https://repositorio.ufscar.br/bitstream/ufscar/496/2/4310.pdf.txtd41d8cd98f00b204e9800998ecf8427eMD52THUMBNAIL4310.pdf.jpg4310.pdf.jpgIM Thumbnailimage/jpeg5312https://repositorio.ufscar.br/bitstream/ufscar/496/3/4310.pdf.jpg02c165f9f8d492190d0649b82eb20abeMD53ufscar/4962023-09-18 18:31:27.429oai:repositorio.ufscar.br:ufscar/496Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:27Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
spellingShingle Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
Duque, Juliana Lilian
Inteligência artificial
Banco de dados
Mineração de textos
Reconhecimento de padrões
Extração de informação
Anemia falciforme
Tratamentos
Pré-Processamento
Domínio Biomédico
Information Extraction
Treatments
Text Mining
Preprocessing
Biomedical Domain
Sickle Cell Anemia
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_full Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_fullStr Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_full_unstemmed Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_sort Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
author Duque, Juliana Lilian
author_facet Duque, Juliana Lilian
author_role author
dc.contributor.authorlattes.por.fl_str_mv http://lattes.cnpq.br/2616679912003387
dc.contributor.author.fl_str_mv Duque, Juliana Lilian
dc.contributor.advisor1.fl_str_mv Ciferri, Ricardo Rodrigues
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/8382221522817502
dc.contributor.authorID.fl_str_mv d8648db2-c5d8-4600-a93f-50472ed122a5
contributor_str_mv Ciferri, Ricardo Rodrigues
dc.subject.por.fl_str_mv Inteligência artificial
Banco de dados
Mineração de textos
Reconhecimento de padrões
Extração de informação
Anemia falciforme
Tratamentos
Pré-Processamento
Domínio Biomédico
topic Inteligência artificial
Banco de dados
Mineração de textos
Reconhecimento de padrões
Extração de informação
Anemia falciforme
Tratamentos
Pré-Processamento
Domínio Biomédico
Information Extraction
Treatments
Text Mining
Preprocessing
Biomedical Domain
Sickle Cell Anemia
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv Information Extraction
Treatments
Text Mining
Preprocessing
Biomedical Domain
Sickle Cell Anemia
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.
publishDate 2012
dc.date.available.fl_str_mv 2012-05-16
2016-06-02T19:05:56Z
dc.date.issued.fl_str_mv 2012-02-24
dc.date.accessioned.fl_str_mv 2016-06-02T19:05:56Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.
dc.identifier.uri.fl_str_mv https://repositorio.ufscar.br/handle/ufscar/496
identifier_str_mv DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.
url https://repositorio.ufscar.br/handle/ufscar/496
dc.language.iso.fl_str_mv por
language por
dc.relation.confidence.fl_str_mv -1
-1
dc.relation.authority.fl_str_mv 3b1d5172-8bf0-4d0b-8777-ab82599bbf09
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de São Carlos
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Ciência da Computação - PPGCC
dc.publisher.initials.fl_str_mv UFSCar
dc.publisher.country.fl_str_mv BR
publisher.none.fl_str_mv Universidade Federal de São Carlos
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFSCAR
instname:Universidade Federal de São Carlos (UFSCAR)
instacron:UFSCAR
instname_str Universidade Federal de São Carlos (UFSCAR)
instacron_str UFSCAR
institution UFSCAR
reponame_str Repositório Institucional da UFSCAR
collection Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv https://repositorio.ufscar.br/bitstream/ufscar/496/1/4310.pdf
https://repositorio.ufscar.br/bitstream/ufscar/496/2/4310.pdf.txt
https://repositorio.ufscar.br/bitstream/ufscar/496/3/4310.pdf.jpg
bitstream.checksum.fl_str_mv 6650fb70eee9b096860bcac6b5ed596c
d41d8cd98f00b204e9800998ecf8427e
02c165f9f8d492190d0649b82eb20abe
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_ 1813715503346089984