Disclosing ambiguous gene aliases by automatic literature profiling

Detalhes bibliográficos
Autor(a) principal: Coimbra, Roney Santos
Data de Publicação: 2010
Outros Autores: Vanderwall, Dana E, Oliveira, Guilherme Corrêa
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da FIOCRUZ (ARCA)
Texto Completo: https://www.arca.fiocruz.br/handle/icict/9378
Resumo: Fundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Centro de Excelência em Bioinformática. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Grupo de Genômica e Biologia Computacional. Belo Horizonte, MG, Brasil
id CRUZ_a447377a0914a50730ae8d27deaf5f6a
oai_identifier_str oai:www.arca.fiocruz.br:icict/9378
network_acronym_str CRUZ
network_name_str Repositório Institucional da FIOCRUZ (ARCA)
repository_id_str 2135
spelling Coimbra, Roney SantosVanderwall, Dana EOliveira, Guilherme Corrêa2015-01-14T11:01:59Z2015-01-14T11:01:59Z2010COIMBRA, Roney Santos; VANDERWALL, Dana E; OLIVEIRA, Guilherme Corrêa. Disclosing ambiguous gene aliases by automatic literature profiling. BMC Genomics, 11(suppl.5):s3, 2010.1471-2164https://www.arca.fiocruz.br/handle/icict/937810.1186/1471-2164-11-S5-S3engBiomed CentralDisclosing ambiguous gene aliases by automatic literature profilinginfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleFundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Centro de Excelência em Bioinformática. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Grupo de Genômica e Biologia Computacional. Belo Horizonte, MG, BrasilGlaxoSmithKline Moore Dr. Molecular Discovery Research. Research Triangle Park, NC, USAFundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Centro de Excelência em Bioinformática. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Grupo de Genômica e Biologia Computacional. Belo Horizonte, MG, BrasilBackground Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples. Results Aliases were considered “ambiguous” when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of “synonyms”. We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective “synonyms” or “ambiguous” aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and “synonym” aliases allowed a 3.6-fold increase in the number of unique documents retrieved. Conclusions These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a genedata miningscientific literatureabstratsinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txtlicense.txttext/plain; charset=utf-81914https://www.arca.fiocruz.br/bitstream/icict/9378/1/license.txt7d48279ffeed55da8dfe2f8e81f3b81fMD51ORIGINALDisclosing ambiguous gene aliases by automatic.pdfDisclosing ambiguous gene aliases by automatic.pdfapplication/pdf217573https://www.arca.fiocruz.br/bitstream/icict/9378/2/Disclosing%20ambiguous%20gene%20aliases%20by%20automatic.pdfce54aa2c4ea49eb989f9e7308d827ce6MD52TEXTDisclosing ambiguous gene aliases by automatic.pdf.txtDisclosing ambiguous gene aliases by automatic.pdf.txtExtracted texttext/plain39696https://www.arca.fiocruz.br/bitstream/icict/9378/3/Disclosing%20ambiguous%20gene%20aliases%20by%20automatic.pdf.txtcef612ba271ba78030b89f8f4e7237a5MD53icict/93782019-06-19 10:07:31.853oai:www.arca.fiocruz.br:icict/9378TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkFvIGNvbmNvcmRhciBlIGFjZWl0YXIgZXN0YSBsaWNlbsOnYSB2b2PDqiAoYXV0b3Igb3UgZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzKToKCmEpIERlY2xhcmEgcXVlIGNvbmhlY2UgYSBwb2zDrXRpY2EgZGUgY29weXJpZ2h0IGRhIGVkaXRvcmEgZG8gc2V1IGRvY3VtZW50by4KCmIpIERlY2xhcmEgcXVlIGNvbmhlY2UgZSBhY2VpdGEgYXMgRGlyZXRyaXplcyBwYXJhIG8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgRnVuZGHDp8OjbyBPc3dhbGRvIENydXogKEZJT0NSVVopLgoKYykgQ29uY2VkZSDDoCBGSU9DUlVaIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSBhcnF1aXZhciwgcmVwcm9kdXppciwgY29udmVydGVyIChjb21vIGRlZmluaWRvIGEgc2VndWlyKSwgY29tdW5pY2FyCiAKZS9vdSBkaXN0cmlidWlyIG5vIFJlcG9zaXTDs3JpbyBkYSBGSU9DUlVaLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgCgpwb3IgcXVhbHF1ZXIgb3V0cm8gbWVpby4KCmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgRklPQ1JVWiBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgCgpwYXJhIHF1YWxxdWVyIGZvcm1hdG8gZGUgYXJxdWl2bywgbWVpbyBvdSBzdXBvcnRlLCBwYXJhIGVmZWl0b3MgZGUgc2VndXJhbsOnYSwgcHJlc2VydmHDp8OjbyAoYmFja3VwKSBlIGFjZXNzby4KCmUpIERlY2xhcmEgcXVlIG8gZG9jdW1lbnRvIHN1Ym1ldGlkbyDDqSBvIHNldSB0cmFiYWxobyBvcmlnaW5hbCwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyAKCmNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBEZWNsYXJhIHRhbWLDqW0gcXVlIGEgZW50cmVnYSBkbyBkb2N1bWVudG8gbsOjbyBpbmZyaW5nZSBvcyBkaXJlaXRvcyBkZSBxdWFscXVlciBvdXRyYSBwZXNzb2Egb3UgZW50aWRhZGUuCgpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlIGF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIAoKaXJyZXN0cml0YSBkbyByZXNwZWN0aXZvIGRldGVudG9yIGRlc3NlcyBkaXJlaXRvcywgcGFyYSBjZWRlciBhIEZJT0NSVVogb3MgZGlyZWl0b3MgcmVxdWVyaWRvcyBwb3IgZXN0YSBMaWNlbsOnYSBlIGF1dG9yaXphciBhIAoKdXRpbGl6w6EtbG9zIGxlZ2FsbWVudGUuIERlY2xhcmEgdGFtYsOpbSBxdWUgZXNzZSBtYXRlcmlhbCBjdWpvcyBkaXJlaXRvcyBzw6NvIGRlIHRlcmNlaXJvcyBlc3TDoSBjbGFyYW1lbnRlIGlkZW50aWZpY2FkbyBlIHJlY29uaGVjaWRvIAoKbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZS4KCmcpIFNFIE8gRE9DVU1FTlRPIEVOVFJFR1VFIMOJIEJBU0VBRE8gRU0gVFJBQkFMSE8gRklOQU5DSUFETyBPVSBBUE9JQURPIFBPUiBPVVRSQSBJTlNUSVRVScOHw4NPIFFVRSBOw4NPIEEgRklPQ1JVWiwgREVDTEFSQSBRVUUgQ1VNUFJJVSAKClFVQUlTUVVFUiBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUEVMTyBSRVNQRUNUSVZPIENPTlRSQVRPIE9VIEFDT1JETy4gQSBGSU9DUlVaIGlkZW50aWZpY2Fyw6EgY2xhcmFtZW50ZSBvKHMpIG5vbWUocykgZG8ocykgYXV0b3IoZXMpIGRvcyAKCmRpcmVpdG9zIGRvIGRvY3VtZW50byBlbnRyZWd1ZSBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIHBhcmEgYWzDqW0gZG8gcHJldmlzdG8gbmEgYWzDrW5lYSBjKS4KRepositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352019-06-19T13:07:31Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false
dc.title.pt_BR.fl_str_mv Disclosing ambiguous gene aliases by automatic literature profiling
title Disclosing ambiguous gene aliases by automatic literature profiling
spellingShingle Disclosing ambiguous gene aliases by automatic literature profiling
Coimbra, Roney Santos
data mining
scientific literature
abstrats
title_short Disclosing ambiguous gene aliases by automatic literature profiling
title_full Disclosing ambiguous gene aliases by automatic literature profiling
title_fullStr Disclosing ambiguous gene aliases by automatic literature profiling
title_full_unstemmed Disclosing ambiguous gene aliases by automatic literature profiling
title_sort Disclosing ambiguous gene aliases by automatic literature profiling
author Coimbra, Roney Santos
author_facet Coimbra, Roney Santos
Vanderwall, Dana E
Oliveira, Guilherme Corrêa
author_role author
author2 Vanderwall, Dana E
Oliveira, Guilherme Corrêa
author2_role author
author
dc.contributor.author.fl_str_mv Coimbra, Roney Santos
Vanderwall, Dana E
Oliveira, Guilherme Corrêa
dc.subject.en.pt_BR.fl_str_mv data mining
scientific literature
abstrats
topic data mining
scientific literature
abstrats
description Fundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Centro de Excelência em Bioinformática. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa René Rachou. Grupo de Genômica e Biologia Computacional. Belo Horizonte, MG, Brasil
publishDate 2010
dc.date.issued.fl_str_mv 2010
dc.date.accessioned.fl_str_mv 2015-01-14T11:01:59Z
dc.date.available.fl_str_mv 2015-01-14T11:01:59Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.citation.fl_str_mv COIMBRA, Roney Santos; VANDERWALL, Dana E; OLIVEIRA, Guilherme Corrêa. Disclosing ambiguous gene aliases by automatic literature profiling. BMC Genomics, 11(suppl.5):s3, 2010.
dc.identifier.uri.fl_str_mv https://www.arca.fiocruz.br/handle/icict/9378
dc.identifier.issn.none.fl_str_mv 1471-2164
dc.identifier.doi.none.fl_str_mv 10.1186/1471-2164-11-S5-S3
identifier_str_mv COIMBRA, Roney Santos; VANDERWALL, Dana E; OLIVEIRA, Guilherme Corrêa. Disclosing ambiguous gene aliases by automatic literature profiling. BMC Genomics, 11(suppl.5):s3, 2010.
1471-2164
10.1186/1471-2164-11-S5-S3
url https://www.arca.fiocruz.br/handle/icict/9378
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Biomed Central
publisher.none.fl_str_mv Biomed Central
dc.source.none.fl_str_mv reponame:Repositório Institucional da FIOCRUZ (ARCA)
instname:Fundação Oswaldo Cruz (FIOCRUZ)
instacron:FIOCRUZ
instname_str Fundação Oswaldo Cruz (FIOCRUZ)
instacron_str FIOCRUZ
institution FIOCRUZ
reponame_str Repositório Institucional da FIOCRUZ (ARCA)
collection Repositório Institucional da FIOCRUZ (ARCA)
bitstream.url.fl_str_mv https://www.arca.fiocruz.br/bitstream/icict/9378/1/license.txt
https://www.arca.fiocruz.br/bitstream/icict/9378/2/Disclosing%20ambiguous%20gene%20aliases%20by%20automatic.pdf
https://www.arca.fiocruz.br/bitstream/icict/9378/3/Disclosing%20ambiguous%20gene%20aliases%20by%20automatic.pdf.txt
bitstream.checksum.fl_str_mv 7d48279ffeed55da8dfe2f8e81f3b81f
ce54aa2c4ea49eb989f9e7308d827ce6
cef612ba271ba78030b89f8f4e7237a5
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)
repository.mail.fl_str_mv repositorio.arca@fiocruz.br
_version_ 1813008942580629504