Clustering genomic words in human DNA using peaks and trends of distributions

Detalhes bibliográficos
Autor(a) principal: Tavares, Ana Helena
Data de Publicação: 2020
Outros Autores: Raymaekers, Jakob, Rousseeuw, Peter J., Brito, Paula, Afreixo, Vera
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10773/30267
Resumo: In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.
id RCAP_2435b0b255c5cbcb7c7f34f70a060d39
oai_identifier_str oai:ria.ua.pt:10773/30267
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Clustering genomic words in human DNA using peaks and trends of distributionsClassificationPattern recognitionRobustnessWord distancesIn this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.Springer2021-01-11T10:58:24Z2020-03-01T00:00:00Z2020-03info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10773/30267eng1862-534710.1007/s11634-019-00362-xTavares, Ana HelenaRaymaekers, JakobRousseeuw, Peter J.Brito, PaulaAfreixo, Verainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-02-22T11:58:28Zoai:ria.ua.pt:10773/30267Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:02:23.672853Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Clustering genomic words in human DNA using peaks and trends of distributions
title Clustering genomic words in human DNA using peaks and trends of distributions
spellingShingle Clustering genomic words in human DNA using peaks and trends of distributions
Tavares, Ana Helena
Classification
Pattern recognition
Robustness
Word distances
title_short Clustering genomic words in human DNA using peaks and trends of distributions
title_full Clustering genomic words in human DNA using peaks and trends of distributions
title_fullStr Clustering genomic words in human DNA using peaks and trends of distributions
title_full_unstemmed Clustering genomic words in human DNA using peaks and trends of distributions
title_sort Clustering genomic words in human DNA using peaks and trends of distributions
author Tavares, Ana Helena
author_facet Tavares, Ana Helena
Raymaekers, Jakob
Rousseeuw, Peter J.
Brito, Paula
Afreixo, Vera
author_role author
author2 Raymaekers, Jakob
Rousseeuw, Peter J.
Brito, Paula
Afreixo, Vera
author2_role author
author
author
author
dc.contributor.author.fl_str_mv Tavares, Ana Helena
Raymaekers, Jakob
Rousseeuw, Peter J.
Brito, Paula
Afreixo, Vera
dc.subject.por.fl_str_mv Classification
Pattern recognition
Robustness
Word distances
topic Classification
Pattern recognition
Robustness
Word distances
description In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.
publishDate 2020
dc.date.none.fl_str_mv 2020-03-01T00:00:00Z
2020-03
2021-01-11T10:58:24Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10773/30267
url http://hdl.handle.net/10773/30267
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 1862-5347
10.1007/s11634-019-00362-x
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Springer
publisher.none.fl_str_mv Springer
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799137679378481152