Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures

Detalhes bibliográficos
Autor(a) principal: Duan, Tiehang
Data de Publicação: 2018
Outros Autores: Pinto, José P., Xie, Xiaohui
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10400.1/12472
Resumo: Motivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (1) the clustering quality still needs to be improved; (2) most models need prior knowledge on number of clusters, which is not always available; (3) there is a demand for faster computational speed. Results: We propose to tackle these challenges with Parallel Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive clustering on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. Availability: Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package
id RCAP_fbe7c1e06cedf128869b9b43cad5f7d7
oai_identifier_str oai:sapientia.ualg.pt:10400.1/12472
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixturesMotivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (1) the clustering quality still needs to be improved; (2) most models need prior knowledge on number of clusters, which is not always available; (3) there is a demand for faster computational speed. Results: We propose to tackle these challenges with Parallel Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive clustering on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. Availability: Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_packageOxford University PressSapientiaDuan, TiehangPinto, José P.Xie, Xiaohui2019-04-11T19:30:20Z2018-12-252018-12-25T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.1/12472eng1367-480310.1093/bioinformatics/bty702info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-11-29T10:25:18Zoai:sapientia.ualg.pt:10400.1/12472Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-11-29T10:25:18Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
title Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
spellingShingle Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
Duan, Tiehang
title_short Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
title_full Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
title_fullStr Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
title_full_unstemmed Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
title_sort Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures
author Duan, Tiehang
author_facet Duan, Tiehang
Pinto, José P.
Xie, Xiaohui
author_role author
author2 Pinto, José P.
Xie, Xiaohui
author2_role author
author
dc.contributor.none.fl_str_mv Sapientia
dc.contributor.author.fl_str_mv Duan, Tiehang
Pinto, José P.
Xie, Xiaohui
description Motivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (1) the clustering quality still needs to be improved; (2) most models need prior knowledge on number of clusters, which is not always available; (3) there is a demand for faster computational speed. Results: We propose to tackle these challenges with Parallel Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive clustering on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. Availability: Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package
publishDate 2018
dc.date.none.fl_str_mv 2018-12-25
2018-12-25T00:00:00Z
2019-04-11T19:30:20Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.1/12472
url http://hdl.handle.net/10400.1/12472
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 1367-4803
10.1093/bioinformatics/bty702
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Oxford University Press
publisher.none.fl_str_mv Oxford University Press
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv mluisa.alvim@gmail.com
_version_ 1817549691345174528