Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra

Bibliographic Details
Main Author: Silva, André R.F.
Publication Date: 2021
Other Authors: Lima, Diogo B., Kurt, Louise U., Dupré, Mathieu, Chamot-Rooke, Julia, Santos, Marlon D.M., Nicolau, Carolina Alves, Valente, Richard Hemmi, Barbosa, Valmir C., Carvalho, Paulo C.
Format: Article
Language: eng
Source: Repositório Institucional da FIOCRUZ (ARCA)
Download full: https://www.arca.fiocruz.br/handle/icict/51188
Summary: Fiocruz Paraná. Instituto Carlos Chagas. Laboratório de Proteômica Estrutural e Computacional. Curitiba, PR, Brasil.
id CRUZ_0bd31213bfc333c3b35c57c9f0dc3d80
oai_identifier_str oai:www.arca.fiocruz.br:icict/51188
network_acronym_str CRUZ
network_name_str Repositório Institucional da FIOCRUZ (ARCA)
repository_id_str 2135
spelling Silva, André R.F.Lima, Diogo B.Kurt, Louise U.Dupré, MathieuChamot-Rooke, JuliaSantos, Marlon D.M.Nicolau, Carolina AlvesValente, Richard HemmiBarbosa, Valmir C.Carvalho, Paulo C.2022-02-14T20:01:45Z2022-02-14T20:01:45Z2021SILVA, André R. F. et al. Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra. Journal of Proteomics, v. 245, 104282, p. 1 - 8, June 2021.1874-3919https://www.arca.fiocruz.br/handle/icict/5118810.1016/j.jprot.2021.104282engElsevierAgrupamentoEspectros de massa em tandemFerramenta de avaliação de partiçãoClusteringTandem mass spectraPartition assessment toolLeveraging the partition selection bias to achieve a high-quality clustering of mass spectrainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleFiocruz Paraná. Instituto Carlos Chagas. Laboratório de Proteômica Estrutural e Computacional. Curitiba, PR, Brasil.Department of Chemical Biology, Leibniz – Forschungsinstitut für Molekulare Pharmakologie (FMP). Berlin, Germany.Fiocruz Paraná. Instituto Carlos Chagas. Laboratório de Proteômica Estrutural e Computacional. Curitiba, PR, Brasil.Mass Spectrometry for Biology Unit, CNRS USR 2000. Institut Pasteur, Paris, France.Mass Spectrometry for Biology Unit, CNRS USR 2000. Institut Pasteur, Paris, France.Fiocruz Paraná. Instituto Carlos Chagas. Laboratório de Proteômica Estrutural e Computacional. Curitiba, PR, Brasil.Fundação Oswaldo Cruz. Instituto Oswaldo Cruz. Laboratório de Toxinologia. Rio de Janeiro, RJ, Brasil / Centre de Recherche en Cancérologie et Immunologie Nantes-Angers (CRCINA), Team SOAP, INSERM U1232. Nantes, France.Fundação Oswaldo Cruz. Instituto Oswaldo Cruz. Laboratório de Toxinologia. Rio de Janeiro, RJ, Brasil.Universidade Federal do Rio de Janeiro. Programa de Engenharia de Sistemas e Ciência da Computação. Rio de Janeiro, RJ, Brasi..Fiocruz Paraná. Instituto Carlos Chagas. Laboratório de Proteômica Estrutural e Computacional. Curitiba, PR, Brasil.In proteomics, the identification of peptides from mass spectral data can be mathematically described as the partitioning of mass spectra into clusters (i.e., groups of spectra derived from the same peptide). The way partitions are validated is just as important, having evolved side by side with the clustering algorithms themselves and given rise to many partition assessment measures. An assessment measure is said to have a selection bias if, and only if, the probability that a randomly chosen partition scoring a high value depends on the number of clusters in the partition. In the context of clustering mass spectra, this might mislead the validation process to favor clustering algorithms that generate too many (or few) spectral clusters, regardless of the underlying peptide sequence. A selection bias toward the number of peptides is desirable for proteomics as it estimates the number of peptides in a complex protein mixture. Here, we introduce an assessment measure that is purposely biased toward the number of peptide ion species. We also introduce a partition assessment framework for proteomics, called the Partition Assessment Tool, and demonstrate its importance by evaluating the performance of eight clustering algorithms on seven proteomics datasets while discussing the trade-offs involved. Significance: Clustering algorithms are widely adopted in proteomics for undertaking several tasks such as speeding up search engines, generating consensus mass spectra, and to aid in the classification of proteomic profiles. Choosing which algorithm is most fit for the task at hand is not simple as each algorithm has advantages and disadvantages; furthermore, specifying clustering parameters is also a necessary and fundamental step. For example, deciding on whether to generate “pure clusters” or fewer clusters but accepting noise. With this as motivation, we verify the performance of several widely adopted algorithms on proteomic datasets and introduce a theoretical framework for drawing conclusions on which approach is suitable for the task at hand.info:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txtlicense.txttext/plain; charset=utf-82991https://www.arca.fiocruz.br/bitstream/icict/51188/1/license.txt5a560609d32a3863062d77ff32785d58MD51ORIGINALRichardHValente_CarolinaNicolaru_etal_IOC_2021.pdfRichardHValente_CarolinaNicolaru_etal_IOC_2021.pdfapplication/pdf1679791https://www.arca.fiocruz.br/bitstream/icict/51188/2/RichardHValente_CarolinaNicolaru_etal_IOC_2021.pdfe382cc815f1f9db6393229c953f4bf08MD52icict/511882022-02-14 17:01:45.918oai:www.arca.fiocruz.br:icict/51188Q0VTU8ODTyBOw4NPIEVYQ0xVU0lWQSBERSBESVJFSVRPUyBBVVRPUkFJUwoKQW8gYWNlaXRhciBvcyBURVJNT1MgZSBDT05EScOHw5VFUyBkZXN0YSBDRVNTw4NPLCBvIEFVVE9SIGUvb3UgVElUVUxBUiBkZSBkaXJlaXRvcwphdXRvcmFpcyBzb2JyZSBhIE9CUkEgZGUgcXVlIHRyYXRhIGVzdGUgZG9jdW1lbnRvOgoKKDEpIENFREUgZSBUUkFOU0ZFUkUsIHRvdGFsIGUgZ3JhdHVpdGFtZW50ZSwgw6AgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaLCBlbQpjYXLDoXRlciBwZXJtYW5lbnRlLCBpcnJldm9nw6F2ZWwgZSBOw4NPIEVYQ0xVU0lWTywgdG9kb3Mgb3MgZGlyZWl0b3MgcGF0cmltb25pYWlzIE7Dg08KQ09NRVJDSUFJUyBkZSB1dGlsaXphw6fDo28gZGEgT0JSQSBhcnTDrXN0aWNhIGUvb3UgY2llbnTDrWZpY2EgaW5kaWNhZGEgYWNpbWEsIGluY2x1c2l2ZSBvcyBkaXJlaXRvcwpkZSB2b3ogZSBpbWFnZW0gdmluY3VsYWRvcyDDoCBPQlJBLCBkdXJhbnRlIHRvZG8gbyBwcmF6byBkZSBkdXJhw6fDo28gZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBlbQpxdWFscXVlciBpZGlvbWEgZSBlbSB0b2RvcyBvcyBwYcOtc2VzOwoKKDIpIEFDRUlUQSBxdWUgYSBjZXNzw6NvIHRvdGFsIG7Do28gZXhjbHVzaXZhLCBwZXJtYW5lbnRlIGUgaXJyZXZvZ8OhdmVsIGRvcyBkaXJlaXRvcyBhdXRvcmFpcwpwYXRyaW1vbmlhaXMgbsOjbyBjb21lcmNpYWlzIGRlIHV0aWxpemHDp8OjbyBkZSBxdWUgdHJhdGEgZXN0ZSBkb2N1bWVudG8gaW5jbHVpLCBleGVtcGxpZmljYXRpdmFtZW50ZSwKb3MgZGlyZWl0b3MgZGUgZGlzcG9uaWJpbGl6YcOnw6NvIGUgY29tdW5pY2HDp8OjbyBww7pibGljYSBkYSBPQlJBLCBlbSBxdWFscXVlciBtZWlvIG91IHZlw61jdWxvLAppbmNsdXNpdmUgZW0gUmVwb3NpdMOzcmlvcyBEaWdpdGFpcywgYmVtIGNvbW8gb3MgZGlyZWl0b3MgZGUgcmVwcm9kdcOnw6NvLCBleGliacOnw6NvLCBleGVjdcOnw6NvLApkZWNsYW1hw6fDo28sIHJlY2l0YcOnw6NvLCBleHBvc2nDp8OjbywgYXJxdWl2YW1lbnRvLCBpbmNsdXPDo28gZW0gYmFuY28gZGUgZGFkb3MsIHByZXNlcnZhw6fDo28sIGRpZnVzw6NvLApkaXN0cmlidWnDp8OjbywgZGl2dWxnYcOnw6NvLCBlbXByw6lzdGltbywgdHJhZHXDp8OjbywgZHVibGFnZW0sIGxlZ2VuZGFnZW0sIGluY2x1c8OjbyBlbSBub3ZhcyBvYnJhcyBvdQpjb2xldMOibmVhcywgcmV1dGlsaXphw6fDo28sIGVkacOnw6NvLCBwcm9kdcOnw6NvIGRlIG1hdGVyaWFsIGRpZMOhdGljbyBlIGN1cnNvcyBvdSBxdWFscXVlciBmb3JtYSBkZQp1dGlsaXphw6fDo28gbsOjbyBjb21lcmNpYWw7CgooMykgUkVDT05IRUNFIHF1ZSBhIGNlc3PDo28gYXF1aSBlc3BlY2lmaWNhZGEgY29uY2VkZSDDoCBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPCkNSVVogbyBkaXJlaXRvIGRlIGF1dG9yaXphciBxdWFscXVlciBwZXNzb2Eg4oCTIGbDrXNpY2Egb3UganVyw61kaWNhLCBww7pibGljYSBvdSBwcml2YWRhLCBuYWNpb25hbCBvdQplc3RyYW5nZWlyYSDigJMgYSBhY2Vzc2FyIGUgdXRpbGl6YXIgYW1wbGFtZW50ZSBhIE9CUkEsIHNlbSBleGNsdXNpdmlkYWRlLCBwYXJhIHF1YWlzcXVlcgpmaW5hbGlkYWRlcyBuw6NvIGNvbWVyY2lhaXM7CgooNCkgREVDTEFSQSBxdWUgYSBvYnJhIMOpIGNyaWHDp8OjbyBvcmlnaW5hbCBlIHF1ZSDDqSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGFxdWkgY2VkaWRvcyBlIGF1dG9yaXphZG9zLApyZXNwb25zYWJpbGl6YW5kby1zZSBpbnRlZ3JhbG1lbnRlIHBlbG8gY29udGXDumRvIGUgb3V0cm9zIGVsZW1lbnRvcyBxdWUgZmF6ZW0gcGFydGUgZGEgT0JSQSwKaW5jbHVzaXZlIG9zIGRpcmVpdG9zIGRlIHZveiBlIGltYWdlbSB2aW5jdWxhZG9zIMOgIE9CUkEsIG9icmlnYW5kby1zZSBhIGluZGVuaXphciB0ZXJjZWlyb3MgcG9yCmRhbm9zLCBiZW0gY29tbyBpbmRlbml6YXIgZSByZXNzYXJjaXIgYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVogZGUKZXZlbnR1YWlzIGRlc3Blc2FzIHF1ZSB2aWVyZW0gYSBzdXBvcnRhciwgZW0gcmF6w6NvIGRlIHF1YWxxdWVyIG9mZW5zYSBhIGRpcmVpdG9zIGF1dG9yYWlzIG91CmRpcmVpdG9zIGRlIHZveiBvdSBpbWFnZW0sIHByaW5jaXBhbG1lbnRlIG5vIHF1ZSBkaXogcmVzcGVpdG8gYSBwbMOhZ2lvIGUgdmlvbGHDp8O1ZXMgZGUgZGlyZWl0b3M7CgooNSkgQUZJUk1BIHF1ZSBjb25oZWNlIGEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTwpPU1dBTERPIENSVVogZSBhcyBkaXJldHJpemVzIHBhcmEgbyBmdW5jaW9uYW1lbnRvIGRvIHJlcG9zaXTDs3JpbyBpbnN0aXR1Y2lvbmFsIEFSQ0EuCgpBIFBvbMOtdGljYSBJbnN0aXR1Y2lvbmFsIGRlIEFjZXNzbyBBYmVydG8gZGEgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaIHJlc2VydmEKZXhjbHVzaXZhbWVudGUgYW8gQVVUT1Igb3MgZGlyZWl0b3MgbW9yYWlzIGUgb3MgdXNvcyBjb21lcmNpYWlzIHNvYnJlIGFzIG9icmFzIGRlIHN1YSBhdXRvcmlhCmUvb3UgdGl0dWxhcmlkYWRlLCBzZW5kbyBvcyB0ZXJjZWlyb3MgdXN1w6FyaW9zIHJlc3BvbnPDoXZlaXMgcGVsYSBhdHJpYnVpw6fDo28gZGUgYXV0b3JpYSBlIG1hbnV0ZW7Dp8OjbwpkYSBpbnRlZ3JpZGFkZSBkYSBPQlJBIGVtIHF1YWxxdWVyIHV0aWxpemHDp8Ojby4KCkEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVoKcmVzcGVpdGEgb3MgY29udHJhdG9zIGUgYWNvcmRvcyBwcmVleGlzdGVudGVzIGRvcyBBdXRvcmVzIGNvbSB0ZXJjZWlyb3MsIGNhYmVuZG8gYW9zIEF1dG9yZXMKaW5mb3JtYXIgw6AgSW5zdGl0dWnDp8OjbyBhcyBjb25kacOnw7VlcyBlIG91dHJhcyByZXN0cmnDp8O1ZXMgaW1wb3N0YXMgcG9yIGVzdGVzIGluc3RydW1lbnRvcy4KRepositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352022-02-14T20:01:45Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false
dc.title.pt_BR.fl_str_mv Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
title Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
spellingShingle Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
Silva, André R.F.
Agrupamento
Espectros de massa em tandem
Ferramenta de avaliação de partição
Clustering
Tandem mass spectra
Partition assessment tool
title_short Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
title_full Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
title_fullStr Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
title_full_unstemmed Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
title_sort Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra
author Silva, André R.F.
author_facet Silva, André R.F.
Lima, Diogo B.
Kurt, Louise U.
Dupré, Mathieu
Chamot-Rooke, Julia
Santos, Marlon D.M.
Nicolau, Carolina Alves
Valente, Richard Hemmi
Barbosa, Valmir C.
Carvalho, Paulo C.
author_role author
author2 Lima, Diogo B.
Kurt, Louise U.
Dupré, Mathieu
Chamot-Rooke, Julia
Santos, Marlon D.M.
Nicolau, Carolina Alves
Valente, Richard Hemmi
Barbosa, Valmir C.
Carvalho, Paulo C.
author2_role author
author
author
author
author
author
author
author
author
dc.contributor.author.fl_str_mv Silva, André R.F.
Lima, Diogo B.
Kurt, Louise U.
Dupré, Mathieu
Chamot-Rooke, Julia
Santos, Marlon D.M.
Nicolau, Carolina Alves
Valente, Richard Hemmi
Barbosa, Valmir C.
Carvalho, Paulo C.
dc.subject.other.pt_BR.fl_str_mv Agrupamento
Espectros de massa em tandem
Ferramenta de avaliação de partição
topic Agrupamento
Espectros de massa em tandem
Ferramenta de avaliação de partição
Clustering
Tandem mass spectra
Partition assessment tool
dc.subject.en.pt_BR.fl_str_mv Clustering
Tandem mass spectra
Partition assessment tool
description Fiocruz Paraná. Instituto Carlos Chagas. Laboratório de Proteômica Estrutural e Computacional. Curitiba, PR, Brasil.
publishDate 2021
dc.date.issued.fl_str_mv 2021
dc.date.accessioned.fl_str_mv 2022-02-14T20:01:45Z
dc.date.available.fl_str_mv 2022-02-14T20:01:45Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.citation.fl_str_mv SILVA, André R. F. et al. Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra. Journal of Proteomics, v. 245, 104282, p. 1 - 8, June 2021.
dc.identifier.uri.fl_str_mv https://www.arca.fiocruz.br/handle/icict/51188
dc.identifier.issn.pt_BR.fl_str_mv 1874-3919
dc.identifier.doi.none.fl_str_mv 10.1016/j.jprot.2021.104282
identifier_str_mv SILVA, André R. F. et al. Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra. Journal of Proteomics, v. 245, 104282, p. 1 - 8, June 2021.
1874-3919
10.1016/j.jprot.2021.104282
url https://www.arca.fiocruz.br/handle/icict/51188
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:Repositório Institucional da FIOCRUZ (ARCA)
instname:Fundação Oswaldo Cruz (FIOCRUZ)
instacron:FIOCRUZ
instname_str Fundação Oswaldo Cruz (FIOCRUZ)
instacron_str FIOCRUZ
institution FIOCRUZ
reponame_str Repositório Institucional da FIOCRUZ (ARCA)
collection Repositório Institucional da FIOCRUZ (ARCA)
bitstream.url.fl_str_mv https://www.arca.fiocruz.br/bitstream/icict/51188/1/license.txt
https://www.arca.fiocruz.br/bitstream/icict/51188/2/RichardHValente_CarolinaNicolaru_etal_IOC_2021.pdf
bitstream.checksum.fl_str_mv 5a560609d32a3863062d77ff32785d58
e382cc815f1f9db6393229c953f4bf08
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)
repository.mail.fl_str_mv repositorio.arca@fiocruz.br
_version_ 1798324653150699520