Validation of Automated Protein Annotation

Detalhes bibliográficos
Autor(a) principal: Couto, Francisco M.
Data de Publicação: 2005
Outros Autores: Silva, Mário J., Coutinho, Pedro M.
Tipo de documento: Relatório
Idioma: por
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10451/14256
Resumo: Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotations
id RCAP_8624aea229b2d94b47fe1fbfd8756c25
oai_identifier_str oai:repositorio.ul.pt:10451/14256
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Validation of Automated Protein Annotationdata miningtext mininggene and protein annotationGiven the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotationsDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaCouto, Francisco M.Silva, Mário J.Coutinho, Pedro M.2009-02-10T13:11:48Z2005-122005-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14256porinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T15:59:57Zoai:repositorio.ul.pt:10451/14256Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:36:04.482465Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Validation of Automated Protein Annotation
title Validation of Automated Protein Annotation
spellingShingle Validation of Automated Protein Annotation
Couto, Francisco M.
data mining
text mining
gene and protein annotation
title_short Validation of Automated Protein Annotation
title_full Validation of Automated Protein Annotation
title_fullStr Validation of Automated Protein Annotation
title_full_unstemmed Validation of Automated Protein Annotation
title_sort Validation of Automated Protein Annotation
author Couto, Francisco M.
author_facet Couto, Francisco M.
Silva, Mário J.
Coutinho, Pedro M.
author_role author
author2 Silva, Mário J.
Coutinho, Pedro M.
author2_role author
author
dc.contributor.none.fl_str_mv Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Couto, Francisco M.
Silva, Mário J.
Coutinho, Pedro M.
dc.subject.por.fl_str_mv data mining
text mining
gene and protein annotation
topic data mining
text mining
gene and protein annotation
description Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotations
publishDate 2005
dc.date.none.fl_str_mv 2005-12
2005-12-01T00:00:00Z
2009-02-10T13:11:48Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/report
format report
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10451/14256
url http://hdl.handle.net/10451/14256
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Department of Informatics, University of Lisbon
publisher.none.fl_str_mv Department of Informatics, University of Lisbon
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134259384942592