Validation of Automated Protein Annotation
Autor(a) principal: | |
---|---|
Data de Publicação: | 2005 |
Outros Autores: | , |
Tipo de documento: | Relatório |
Idioma: | por |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10451/14256 |
Resumo: | Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotations |
id |
RCAP_8624aea229b2d94b47fe1fbfd8756c25 |
---|---|
oai_identifier_str |
oai:repositorio.ul.pt:10451/14256 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Validation of Automated Protein Annotationdata miningtext mininggene and protein annotationGiven the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotationsDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaCouto, Francisco M.Silva, Mário J.Coutinho, Pedro M.2009-02-10T13:11:48Z2005-122005-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14256porinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T15:59:57Zoai:repositorio.ul.pt:10451/14256Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:36:04.482465Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Validation of Automated Protein Annotation |
title |
Validation of Automated Protein Annotation |
spellingShingle |
Validation of Automated Protein Annotation Couto, Francisco M. data mining text mining gene and protein annotation |
title_short |
Validation of Automated Protein Annotation |
title_full |
Validation of Automated Protein Annotation |
title_fullStr |
Validation of Automated Protein Annotation |
title_full_unstemmed |
Validation of Automated Protein Annotation |
title_sort |
Validation of Automated Protein Annotation |
author |
Couto, Francisco M. |
author_facet |
Couto, Francisco M. Silva, Mário J. Coutinho, Pedro M. |
author_role |
author |
author2 |
Silva, Mário J. Coutinho, Pedro M. |
author2_role |
author author |
dc.contributor.none.fl_str_mv |
Repositório da Universidade de Lisboa |
dc.contributor.author.fl_str_mv |
Couto, Francisco M. Silva, Mário J. Coutinho, Pedro M. |
dc.subject.por.fl_str_mv |
data mining text mining gene and protein annotation |
topic |
data mining text mining gene and protein annotation |
description |
Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotations |
publishDate |
2005 |
dc.date.none.fl_str_mv |
2005-12 2005-12-01T00:00:00Z 2009-02-10T13:11:48Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/report |
format |
report |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10451/14256 |
url |
http://hdl.handle.net/10451/14256 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Department of Informatics, University of Lisbon |
publisher.none.fl_str_mv |
Department of Informatics, University of Lisbon |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1817553080396283904 |