Clustering genomic words in human DNA using peaks and trends of distributions

Tavares, Ana Helena; Raymaekers, Jakob; Rousseeuw, Peter J.; Brito, Paula; Afreixo, Vera

Clustering genomic words in human DNA using peaks and trends of distributions

Detalhes bibliográficos
Autor(a) principal:	Tavares, Ana Helena
Data de Publicação:	2020
Outros Autores:	Raymaekers, Jakob, Rousseeuw, Peter J., Brito, Paula, Afreixo, Vera
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10773/30267
Resumo:	In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.

Metadados do item

id	RCAP_2435b0b255c5cbcb7c7f34f70a060d39
oai_identifier_str	oai:ria.ua.pt:10773/30267
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Clustering genomic words in human DNA using peaks and trends of distributionsClassificationPattern recognitionRobustnessWord distancesIn this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.Springer2021-01-11T10:58:24Z2020-03-01T00:00:00Z2020-03info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10773/30267eng1862-534710.1007/s11634-019-00362-xTavares, Ana HelenaRaymaekers, JakobRousseeuw, Peter J.Brito, PaulaAfreixo, Verainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-02-22T11:58:28Zoai:ria.ua.pt:10773/30267Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:02:23.672853Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Clustering genomic words in human DNA using peaks and trends of distributions
title	Clustering genomic words in human DNA using peaks and trends of distributions
spellingShingle	Clustering genomic words in human DNA using peaks and trends of distributions Tavares, Ana Helena Classification Pattern recognition Robustness Word distances
title_short	Clustering genomic words in human DNA using peaks and trends of distributions
title_full	Clustering genomic words in human DNA using peaks and trends of distributions
title_fullStr	Clustering genomic words in human DNA using peaks and trends of distributions
title_full_unstemmed	Clustering genomic words in human DNA using peaks and trends of distributions
title_sort	Clustering genomic words in human DNA using peaks and trends of distributions
author	Tavares, Ana Helena
author_facet	Tavares, Ana Helena Raymaekers, Jakob Rousseeuw, Peter J. Brito, Paula Afreixo, Vera
author_role	author
author2	Raymaekers, Jakob Rousseeuw, Peter J. Brito, Paula Afreixo, Vera
author2_role	author author author author
dc.contributor.author.fl_str_mv	Tavares, Ana Helena Raymaekers, Jakob Rousseeuw, Peter J. Brito, Paula Afreixo, Vera
dc.subject.por.fl_str_mv	Classification Pattern recognition Robustness Word distances
topic	Classification Pattern recognition Robustness Word distances
description	In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.
publishDate	2020
dc.date.none.fl_str_mv	2020-03-01T00:00:00Z 2020-03 2021-01-11T10:58:24Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10773/30267
url	http://hdl.handle.net/10773/30267
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	1862-5347 10.1007/s11634-019-00362-x
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Springer
publisher.none.fl_str_mv	Springer
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799137679378481152

Clustering genomic words in human DNA using peaks and trends of distributions

Registros relacionados