Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico
Autor(a) principal: | |
---|---|
Data de Publicação: | 2012 |
Tipo de documento: | Dissertação |
Idioma: | por |
Título da fonte: | Repositório Institucional da UFSCAR |
Texto Completo: | https://repositorio.ufscar.br/handle/ufscar/496 |
Resumo: | Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven. |
id |
SCAR_4f2dfbee5687fa8ad6558fe4b2245f24 |
---|---|
oai_identifier_str |
oai:repositorio.ufscar.br:ufscar/496 |
network_acronym_str |
SCAR |
network_name_str |
Repositório Institucional da UFSCAR |
repository_id_str |
4322 |
spelling |
Duque, Juliana LilianCiferri, Ricardo Rodrigueshttp://lattes.cnpq.br/8382221522817502http://lattes.cnpq.br/2616679912003387d8648db2-c5d8-4600-a93f-50472ed122a52016-06-02T19:05:56Z2012-05-162016-06-02T19:05:56Z2012-02-24DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.https://repositorio.ufscar.br/handle/ufscar/496Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.Atualmente na área médica existe uma grande quantidade de informações não estruturadas (i.e., em formato textual) sendo produzidas na literatura médica. Com o grande volume de dados, torna-se impossível que os médicos e especialistas da área analisem toda a literatura de forma manual, exigindo técnicas para automatizar a análise destes documentos. Com o intuito de identificar as informações relevantes, estruturar e armazenar estas informações em um banco de dados, para posteriormente identificar relacionamentos interessantes entre as informações extraídas, nesta dissertação é proposto um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. A hipótese é que a busca inicial de sentenças que possuem termos de complicação melhora a eficiência na identificação e na extração de termos de tratamento. Isso acontece porque tratamentos ocorrem principalmente na mesma sentença de complicação ou em sentenças próximas no mesmo parágrafo. Esta metodologia utiliza três abordagens de extração de informação encontradas na literatura: abordagem baseada em aprendizado de máquina para classificar as sentenças de interesse; abordagem baseada em dicionário com termos validados pelo especialista da área e abordagem baseada em regras. A metodologia foi validada como prova de conceito, utilizando artigos do domínio biomédico, mais especificamente da doença Anemia Falciforme. A prova de conceito foi realizada na classificação de sentenças e identificação de termos relevantes. O valor da acurácia obtida na classificação de sentenças foi de 79% para o classificador de complicação e 71% para o classificador de tratamento. Estes valores condizem com os resultados obtidos com a combinação do algoritmo de aprendizado de máquina Support Vector Machine juntamente com a aplicação do filtro Remoção de Ruído e Balanceamento das Classes. Na identificação de termos relevantes, os resultados da metodologia proposta obteve percentual superior de 42% de medida-F comparado à classificação manual (31%) e comparado ao processo parcial, ou seja, sem utilizar o classificador de complicação (36%). Mesmo com a baixa revocação, foi possível obter 100% de revocação para os termos distintos de tratamento, não impactando o processo de extração, e portanto a hipótese considerada neste trabalho foi comprovada.application/pdfporUniversidade Federal de São CarlosPrograma de Pós-Graduação em Ciência da Computação - PPGCCUFSCarBRInteligência artificialBanco de dadosMineração de textosReconhecimento de padrõesExtração de informaçãoAnemia falciformeTratamentosPré-ProcessamentoDomínio BiomédicoInformation ExtractionTreatmentsText MiningPreprocessingBiomedical DomainSickle Cell AnemiaCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOUm processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédicoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis-1-13b1d5172-8bf0-4d0b-8777-ab82599bbf09info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINAL4310.pdfapplication/pdf3265738https://repositorio.ufscar.br/bitstream/ufscar/496/1/4310.pdf6650fb70eee9b096860bcac6b5ed596cMD51TEXT4310.pdf.txt4310.pdf.txtExtracted texttext/plain0https://repositorio.ufscar.br/bitstream/ufscar/496/2/4310.pdf.txtd41d8cd98f00b204e9800998ecf8427eMD52THUMBNAIL4310.pdf.jpg4310.pdf.jpgIM Thumbnailimage/jpeg5312https://repositorio.ufscar.br/bitstream/ufscar/496/3/4310.pdf.jpg02c165f9f8d492190d0649b82eb20abeMD53ufscar/4962023-09-18 18:31:27.429oai:repositorio.ufscar.br:ufscar/496Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:27Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false |
dc.title.por.fl_str_mv |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico |
title |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico |
spellingShingle |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico Duque, Juliana Lilian Inteligência artificial Banco de dados Mineração de textos Reconhecimento de padrões Extração de informação Anemia falciforme Tratamentos Pré-Processamento Domínio Biomédico Information Extraction Treatments Text Mining Preprocessing Biomedical Domain Sickle Cell Anemia CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
title_short |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico |
title_full |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico |
title_fullStr |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico |
title_full_unstemmed |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico |
title_sort |
Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico |
author |
Duque, Juliana Lilian |
author_facet |
Duque, Juliana Lilian |
author_role |
author |
dc.contributor.authorlattes.por.fl_str_mv |
http://lattes.cnpq.br/2616679912003387 |
dc.contributor.author.fl_str_mv |
Duque, Juliana Lilian |
dc.contributor.advisor1.fl_str_mv |
Ciferri, Ricardo Rodrigues |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/8382221522817502 |
dc.contributor.authorID.fl_str_mv |
d8648db2-c5d8-4600-a93f-50472ed122a5 |
contributor_str_mv |
Ciferri, Ricardo Rodrigues |
dc.subject.por.fl_str_mv |
Inteligência artificial Banco de dados Mineração de textos Reconhecimento de padrões Extração de informação Anemia falciforme Tratamentos Pré-Processamento Domínio Biomédico |
topic |
Inteligência artificial Banco de dados Mineração de textos Reconhecimento de padrões Extração de informação Anemia falciforme Tratamentos Pré-Processamento Domínio Biomédico Information Extraction Treatments Text Mining Preprocessing Biomedical Domain Sickle Cell Anemia CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
dc.subject.eng.fl_str_mv |
Information Extraction Treatments Text Mining Preprocessing Biomedical Domain Sickle Cell Anemia |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
description |
Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven. |
publishDate |
2012 |
dc.date.available.fl_str_mv |
2012-05-16 2016-06-02T19:05:56Z |
dc.date.issued.fl_str_mv |
2012-02-24 |
dc.date.accessioned.fl_str_mv |
2016-06-02T19:05:56Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012. |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufscar.br/handle/ufscar/496 |
identifier_str_mv |
DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012. |
url |
https://repositorio.ufscar.br/handle/ufscar/496 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.confidence.fl_str_mv |
-1 -1 |
dc.relation.authority.fl_str_mv |
3b1d5172-8bf0-4d0b-8777-ab82599bbf09 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de São Carlos |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Ciência da Computação - PPGCC |
dc.publisher.initials.fl_str_mv |
UFSCar |
dc.publisher.country.fl_str_mv |
BR |
publisher.none.fl_str_mv |
Universidade Federal de São Carlos |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR |
instname_str |
Universidade Federal de São Carlos (UFSCAR) |
instacron_str |
UFSCAR |
institution |
UFSCAR |
reponame_str |
Repositório Institucional da UFSCAR |
collection |
Repositório Institucional da UFSCAR |
bitstream.url.fl_str_mv |
https://repositorio.ufscar.br/bitstream/ufscar/496/1/4310.pdf https://repositorio.ufscar.br/bitstream/ufscar/496/2/4310.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/496/3/4310.pdf.jpg |
bitstream.checksum.fl_str_mv |
6650fb70eee9b096860bcac6b5ed596c d41d8cd98f00b204e9800998ecf8427e 02c165f9f8d492190d0649b82eb20abe |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR) |
repository.mail.fl_str_mv |
|
_version_ |
1813715503346089984 |