Reconstruction and classification of unknown DNA sequences

Detalhes bibliográficos
Autor(a) principal: Lourenço, Alexandre Emanuel Monteiro
Data de Publicação: 2021
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10773/33658
Resumo: The continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.
id RCAP_42cd86d111a63ef52891dc0cf2b3cdbf
oai_identifier_str oai:ria.ua.pt:10773/33658
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Reconstruction and classification of unknown DNA sequencesSequence classificationMachine learningData compressionExperts ensembleMetagenomicsThe continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.Os contínuos avanços em tecnologias de sequenciação de ADN e técnicas em meta genómica requerem metodologias de reconstrução confiáveis e de classificação precisas para o aumento da diversidade do repositório natural, contribuindo, entretanto, para a descrição e organização dos organismos. No entanto, após a sequenciação e a montagem de-novo, um dos desafios mais complexos advém das sequências de ADN que não correspondem ou se assemelham a qualquer sequencia biológica da literatura. São três as principais razões que contribuem para essa exceção: uma irregularidade emergiu no processo de reconstrução, a sequência do organismo é altamente dissimilar dos organismos da literatura, ou um novo e diferente organismo foi reconstruído. A incapacidade de classificar com eficiência essas sequências desconhecidas aumenta a incerteza da constituição da amostra e desperdiça a oportunidade de descobrir novas espécies, uma vez que muitas vezes são descartadas. Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional eficiente para resolver este desafio com base em um conjunto de especialistas, nomeadamente preditores baseados em compressão, a distribuição de conteúdo de sequência e comprimentos de sequência normalizados. O método usa sequências de ADN e de aminoácidos e fornece classificação eficiente além das comparações referenciais padrão. Excecionalmente, ele classifica as sequências de ADN sem recorrer diretamente a genomas de referência, mas sim às características que as sequências biológicas da espécie compartilham. Especificamente, ele usa apenas recursos extraídos individualmente de cada genoma sem usar comparações de sequência. Além disso, o pipeline é totalmente automático e permite a reconstrução sem referência de genomas a partir de reads FASTQ com a garantia adicional de armazenamento seguro de informações sensíveis. O RFSC é então um pipeline de classificação de aprendizagem automática que se baseia em um conjunto de especialistas para fornecer classificação eficiente em contextos meta genómicos. Este pipeline foi aplicado em dados sintéticos e reais, alcançando em ambos resultados precisos e exatos que, no momento do desenvolvimento desta dissertação, não foram relatados na literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma precisão de aproximadamente 97% na classificação de domínio/tipo.2022-04-11T10:20:37Z2021-12-15T00:00:00Z2021-12-15info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/33658engLourenço, Alexandre Emanuel Monteiroinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-02-22T12:04:46Zoai:ria.ua.pt:10773/33658Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:05:02.210992Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Reconstruction and classification of unknown DNA sequences
title Reconstruction and classification of unknown DNA sequences
spellingShingle Reconstruction and classification of unknown DNA sequences
Lourenço, Alexandre Emanuel Monteiro
Sequence classification
Machine learning
Data compression
Experts ensemble
Metagenomics
title_short Reconstruction and classification of unknown DNA sequences
title_full Reconstruction and classification of unknown DNA sequences
title_fullStr Reconstruction and classification of unknown DNA sequences
title_full_unstemmed Reconstruction and classification of unknown DNA sequences
title_sort Reconstruction and classification of unknown DNA sequences
author Lourenço, Alexandre Emanuel Monteiro
author_facet Lourenço, Alexandre Emanuel Monteiro
author_role author
dc.contributor.author.fl_str_mv Lourenço, Alexandre Emanuel Monteiro
dc.subject.por.fl_str_mv Sequence classification
Machine learning
Data compression
Experts ensemble
Metagenomics
topic Sequence classification
Machine learning
Data compression
Experts ensemble
Metagenomics
description The continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.
publishDate 2021
dc.date.none.fl_str_mv 2021-12-15T00:00:00Z
2021-12-15
2022-04-11T10:20:37Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10773/33658
url http://hdl.handle.net/10773/33658
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799137705645309952