A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature
Autor(a) principal: | |
---|---|
Data de Publicação: | 2011 |
Outros Autores: | , , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/1822/22366 |
Resumo: | Background We participated, as Team 81, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. Our main goal was to exploit the power of available named entity recognition and dictionary tools to aid in the classification of documents relevant to Protein-Protein Interaction (PPI). For the IMT, we focused on obtaining evidence in support of the interaction methods used, rather than on tagging the document with the method identifiers. We experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. In a nutshell, we exploited classifiers, simple pattern matching for potential PPI methods within sentences, and ranking of candidate matches using statistical considerations. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline. Results For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions to the challenge in terms of Area Under the Interpolated Precision and Recall Curve, Mathew’s Correlation Coefficient, and F-Score. We observe that the most useful Named Entity Recognition and Dictionary tools for classification of articles relevant to protein-protein interaction are: ABNER, NLPROT, OSCAR 3 and the PSI-MI ontology. For the IMT, our results are comparable to those of other systems, which took very different approaches. While the performance is not very high, we focus on providing evidence for potential interaction detection methods. A significant majority of the evidence sentences, as evaluated by independent annotators, are relevant to PPI detection methods. Conclusions For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as “rules” for human understanding of the classification. We also provide evidence supporting certain named entity recognition tools as beneficial for protein-interaction article classification, or demonstrating that some of the tools are not beneficial for the task. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment, where multiple independent annotators manually evaluated the evidence produced by one of our runs. Preliminary results from this experiment are reported here and suggest that the majority of the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods. Regarding the integration of both tasks, we note that the time required for running each pipeline is realistic within a curation effort, and that we can, without compromising the quality of the output, reduce the time necessary to extract entities from text for the ACT pipeline by pre-selecting candidate relevant text using the IMT pipeline. |
id |
RCAP_97ad87af34d7d6f08834c1b110301107 |
---|---|
oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/22366 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literatureScience & TechnologyBackground We participated, as Team 81, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. Our main goal was to exploit the power of available named entity recognition and dictionary tools to aid in the classification of documents relevant to Protein-Protein Interaction (PPI). For the IMT, we focused on obtaining evidence in support of the interaction methods used, rather than on tagging the document with the method identifiers. We experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. In a nutshell, we exploited classifiers, simple pattern matching for potential PPI methods within sentences, and ranking of candidate matches using statistical considerations. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline. Results For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions to the challenge in terms of Area Under the Interpolated Precision and Recall Curve, Mathew’s Correlation Coefficient, and F-Score. We observe that the most useful Named Entity Recognition and Dictionary tools for classification of articles relevant to protein-protein interaction are: ABNER, NLPROT, OSCAR 3 and the PSI-MI ontology. For the IMT, our results are comparable to those of other systems, which took very different approaches. While the performance is not very high, we focus on providing evidence for potential interaction detection methods. A significant majority of the evidence sentences, as evaluated by independent annotators, are relevant to PPI detection methods. Conclusions For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as “rules” for human understanding of the classification. We also provide evidence supporting certain named entity recognition tools as beneficial for protein-interaction article classification, or demonstrating that some of the tools are not beneficial for the task. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment, where multiple independent annotators manually evaluated the evidence produced by one of our runs. Preliminary results from this experiment are reported here and suggest that the majority of the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods. Regarding the integration of both tasks, we note that the time required for running each pipeline is realistic within a curation effort, and that we can, without compromising the quality of the output, reduce the time necessary to extract entities from text for the ACT pipeline by pre-selecting candidate relevant text using the IMT pipeline.We thank the annotators from Sharon Regan's lab and the department of Biology at Queen's University: Kyle Bender, Daniel Frank, Kyle Laursen, Brendan O'Leary and Hernan Del Vecchio. Their work as well as that of Andrew Wong's was supported by HS's NSERC Discovery and Discovery Accelerator awards #298292-08 and CFI New Opportunities Award 10437. Michael Conover and Azadeh Nematzadeh were supported with a grant from the FLAD Computational Biology Collaboratorium at the Instituto Gulbenkian de Ciencia in Oeiras, Portugal. We also thank support from these grants for travel, hosting and providing facilities used to conduct part of this research. We thank Artemy Kolchinsky for assistance in setting up the online server for the ACT.BioMed Central (BMC)Universidade do MinhoLourenço, AnáliaConover, M.Wong, A.Nematzadeh, A.Pan, F.Shatkay, HagitRocha, Luís M.20112011-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/1822/22366eng1471-210510.1186/1471-2105-12-S8-S1222151823info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-21T12:03:09Zoai:repositorium.sdum.uminho.pt:1822/22366Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T18:53:14.607583Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature |
title |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature |
spellingShingle |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature Lourenço, Anália Science & Technology |
title_short |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature |
title_full |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature |
title_fullStr |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature |
title_full_unstemmed |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature |
title_sort |
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature |
author |
Lourenço, Anália |
author_facet |
Lourenço, Anália Conover, M. Wong, A. Nematzadeh, A. Pan, F. Shatkay, Hagit Rocha, Luís M. |
author_role |
author |
author2 |
Conover, M. Wong, A. Nematzadeh, A. Pan, F. Shatkay, Hagit Rocha, Luís M. |
author2_role |
author author author author author author |
dc.contributor.none.fl_str_mv |
Universidade do Minho |
dc.contributor.author.fl_str_mv |
Lourenço, Anália Conover, M. Wong, A. Nematzadeh, A. Pan, F. Shatkay, Hagit Rocha, Luís M. |
dc.subject.por.fl_str_mv |
Science & Technology |
topic |
Science & Technology |
description |
Background We participated, as Team 81, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. Our main goal was to exploit the power of available named entity recognition and dictionary tools to aid in the classification of documents relevant to Protein-Protein Interaction (PPI). For the IMT, we focused on obtaining evidence in support of the interaction methods used, rather than on tagging the document with the method identifiers. We experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. In a nutshell, we exploited classifiers, simple pattern matching for potential PPI methods within sentences, and ranking of candidate matches using statistical considerations. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline. Results For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions to the challenge in terms of Area Under the Interpolated Precision and Recall Curve, Mathew’s Correlation Coefficient, and F-Score. We observe that the most useful Named Entity Recognition and Dictionary tools for classification of articles relevant to protein-protein interaction are: ABNER, NLPROT, OSCAR 3 and the PSI-MI ontology. For the IMT, our results are comparable to those of other systems, which took very different approaches. While the performance is not very high, we focus on providing evidence for potential interaction detection methods. A significant majority of the evidence sentences, as evaluated by independent annotators, are relevant to PPI detection methods. Conclusions For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as “rules” for human understanding of the classification. We also provide evidence supporting certain named entity recognition tools as beneficial for protein-interaction article classification, or demonstrating that some of the tools are not beneficial for the task. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment, where multiple independent annotators manually evaluated the evidence produced by one of our runs. Preliminary results from this experiment are reported here and suggest that the majority of the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods. Regarding the integration of both tasks, we note that the time required for running each pipeline is realistic within a curation effort, and that we can, without compromising the quality of the output, reduce the time necessary to extract entities from text for the ACT pipeline by pre-selecting candidate relevant text using the IMT pipeline. |
publishDate |
2011 |
dc.date.none.fl_str_mv |
2011 2011-01-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1822/22366 |
url |
http://hdl.handle.net/1822/22366 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
1471-2105 10.1186/1471-2105-12-S8-S12 22151823 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
BioMed Central (BMC) |
publisher.none.fl_str_mv |
BioMed Central (BMC) |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799132311677042688 |