Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Duque, Juliana Lilian

Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Detalhes bibliográficos
Autor(a) principal:	Duque, Juliana Lilian
Data de Publicação:	2012
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Repositório Institucional da UFSCAR
Texto Completo:	https://repositorio.ufscar.br/handle/ufscar/496
Resumo:	Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.

Metadados do item

id	SCAR_4f2dfbee5687fa8ad6558fe4b2245f24
oai_identifier_str	oai:repositorio.ufscar.br:ufscar/496
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str	4322
spelling	Duque, Juliana LilianCiferri, Ricardo Rodrigueshttp://lattes.cnpq.br/8382221522817502http://lattes.cnpq.br/2616679912003387d8648db2-c5d8-4600-a93f-50472ed122a52016-06-02T19:05:56Z2012-05-162016-06-02T19:05:56Z2012-02-24DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.https://repositorio.ufscar.br/handle/ufscar/496Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.Atualmente na área médica existe uma grande quantidade de informações não estruturadas (i.e., em formato textual) sendo produzidas na literatura médica. Com o grande volume de dados, torna-se impossível que os médicos e especialistas da área analisem toda a literatura de forma manual, exigindo técnicas para automatizar a análise destes documentos. Com o intuito de identificar as informações relevantes, estruturar e armazenar estas informações em um banco de dados, para posteriormente identificar relacionamentos interessantes entre as informações extraídas, nesta dissertação é proposto um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. A hipótese é que a busca inicial de sentenças que possuem termos de complicação melhora a eficiência na identificação e na extração de termos de tratamento. Isso acontece porque tratamentos ocorrem principalmente na mesma sentença de complicação ou em sentenças próximas no mesmo parágrafo. Esta metodologia utiliza três abordagens de extração de informação encontradas na literatura: abordagem baseada em aprendizado de máquina para classificar as sentenças de interesse; abordagem baseada em dicionário com termos validados pelo especialista da área e abordagem baseada em regras. A metodologia foi validada como prova de conceito, utilizando artigos do domínio biomédico, mais especificamente da doença Anemia Falciforme. A prova de conceito foi realizada na classificação de sentenças e identificação de termos relevantes. O valor da acurácia obtida na classificação de sentenças foi de 79% para o classificador de complicação e 71% para o classificador de tratamento. Estes valores condizem com os resultados obtidos com a combinação do algoritmo de aprendizado de máquina Support Vector Machine juntamente com a aplicação do filtro Remoção de Ruído e Balanceamento das Classes. Na identificação de termos relevantes, os resultados da metodologia proposta obteve percentual superior de 42% de medida-F comparado à classificação manual (31%) e comparado ao processo parcial, ou seja, sem utilizar o classificador de complicação (36%). Mesmo com a baixa revocação, foi possível obter 100% de revocação para os termos distintos de tratamento, não impactando o processo de extração, e portanto a hipótese considerada neste trabalho foi comprovada.application/pdfporUniversidade Federal de São CarlosPrograma de Pós-Graduação em Ciência da Computação - PPGCCUFSCarBRInteligência artificialBanco de dadosMineração de textosReconhecimento de padrõesExtração de informaçãoAnemia falciformeTratamentosPré-ProcessamentoDomínio BiomédicoInformation ExtractionTreatmentsText MiningPreprocessingBiomedical DomainSickle Cell AnemiaCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOUm processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédicoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis-1-13b1d5172-8bf0-4d0b-8777-ab82599bbf09info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINAL4310.pdfapplication/pdf3265738https://repositorio.ufscar.br/bitstream/ufscar/496/1/4310.pdf6650fb70eee9b096860bcac6b5ed596cMD51TEXT4310.pdf.txt4310.pdf.txtExtracted texttext/plain0https://repositorio.ufscar.br/bitstream/ufscar/496/2/4310.pdf.txtd41d8cd98f00b204e9800998ecf8427eMD52THUMBNAIL4310.pdf.jpg4310.pdf.jpgIM Thumbnailimage/jpeg5312https://repositorio.ufscar.br/bitstream/ufscar/496/3/4310.pdf.jpg02c165f9f8d492190d0649b82eb20abeMD53ufscar/4962023-09-18 18:31:27.429oai:repositorio.ufscar.br:ufscar/496Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:27Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
spellingShingle	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico Duque, Juliana Lilian Inteligência artificial Banco de dados Mineração de textos Reconhecimento de padrões Extração de informação Anemia falciforme Tratamentos Pré-Processamento Domínio Biomédico Information Extraction Treatments Text Mining Preprocessing Biomedical Domain Sickle Cell Anemia CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_full	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_fullStr	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_full_unstemmed	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
title_sort	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
author	Duque, Juliana Lilian
author_facet	Duque, Juliana Lilian
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/2616679912003387
dc.contributor.author.fl_str_mv	Duque, Juliana Lilian
dc.contributor.advisor1.fl_str_mv	Ciferri, Ricardo Rodrigues
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/8382221522817502
dc.contributor.authorID.fl_str_mv	d8648db2-c5d8-4600-a93f-50472ed122a5
contributor_str_mv	Ciferri, Ricardo Rodrigues
dc.subject.por.fl_str_mv	Inteligência artificial Banco de dados Mineração de textos Reconhecimento de padrões Extração de informação Anemia falciforme Tratamentos Pré-Processamento Domínio Biomédico
topic	Inteligência artificial Banco de dados Mineração de textos Reconhecimento de padrões Extração de informação Anemia falciforme Tratamentos Pré-Processamento Domínio Biomédico Information Extraction Treatments Text Mining Preprocessing Biomedical Domain Sickle Cell Anemia CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Information Extraction Treatments Text Mining Preprocessing Biomedical Domain Sickle Cell Anemia
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description	Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.
publishDate	2012
dc.date.available.fl_str_mv	2012-05-16 2016-06-02T19:05:56Z
dc.date.issued.fl_str_mv	2012-02-24
dc.date.accessioned.fl_str_mv	2016-06-02T19:05:56Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/ufscar/496
identifier_str_mv	DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.
url	https://repositorio.ufscar.br/handle/ufscar/496
dc.language.iso.fl_str_mv	por
language	por
dc.relation.confidence.fl_str_mv	-1 -1
dc.relation.authority.fl_str_mv	3b1d5172-8bf0-4d0b-8777-ab82599bbf09
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação - PPGCC
dc.publisher.initials.fl_str_mv	UFSCar
dc.publisher.country.fl_str_mv	BR
publisher.none.fl_str_mv	Universidade Federal de São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstream/ufscar/496/1/4310.pdf https://repositorio.ufscar.br/bitstream/ufscar/496/2/4310.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/496/3/4310.pdf.jpg
bitstream.checksum.fl_str_mv	6650fb70eee9b096860bcac6b5ed596c d41d8cd98f00b204e9800998ecf8427e 02c165f9f8d492190d0649b82eb20abe
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_	1813715503346089984

Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Registros relacionados