Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.

Detalhes bibliográficos
Autor(a) principal: Teng, Carolina
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/3/3140/tde-05092022-084236/
Resumo: Genetic sequencing can provide crucial information in medicine and in biology studies. The technologies developed in the field are advancing rapidly and the current third-generation of genome sequencers have significant improvements over the secondgeneration. In parallel to that, sequencing throughput has been increasing at an exponential rate, which, coupled with price reduction, has resulted in a leap of generation of genomic data to be processed. Transistor technology is reaching its fundamental limits, and Moores Law is becoming obsolete, so other alternatives are required to efficiently process such an amount of data. Long-reads from the third generation of sequencers are shown to be an emerging type of genetic data, with average lengths of thousands of nucleotides each. State-of-the-Art algorithm Minimap2 is able to assemble these reads into the genome that was sampled, but it is a computationally-intensive process: for the human genome size with sufficient coverage, running times can reach up to dozens of CPU hours. Hardware acceleration has been proposed as an effort to make Minimap2 more efficient, but up to the present moment, only one of its main bottlenecks, the chaining step, has been successfully accelerated on FPGA. No efficient solution has been proposed for the aligning step, implemented as the ksw function. GACT-X is a Cloud FPGA design that performs a banded SWG alignment with fixed memory, suitable for any size of input. GACT-X with tiles of size 4,000 can be 2x faster than ksw when aligning long sequences. Replacing the alignment function ksw in Minimap2 with GACT-X on a Cloud hybrid system can provide up to 1.41x acceleration on the entire execution to the software counterpart, with comparable accuracy for data that have high similarity to the reference genome. This dissertation presents all the relevant background information, the development stages and methods, the results achieved on three different datasets, and the proposed future work on this acceleration project.
id USP_b720fe3ef6f95677640a5cc28bb47a44
oai_identifier_str oai:teses.usp.br:tde-05092022-084236
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.Acelerando a etapa de alinhamento do algoritmo de montagem de genoma Minimap2 usando GACT-X em uma máquina FPGA comercial na nuvem.AccelerationAlgorítmosBioinformáticaCircuitos FPGACloud computingCo-processorsComputação em nuvemField programmable gate arraysGenômicaGenomicsMinimap2Smith- waterman-gotohGenetic sequencing can provide crucial information in medicine and in biology studies. The technologies developed in the field are advancing rapidly and the current third-generation of genome sequencers have significant improvements over the secondgeneration. In parallel to that, sequencing throughput has been increasing at an exponential rate, which, coupled with price reduction, has resulted in a leap of generation of genomic data to be processed. Transistor technology is reaching its fundamental limits, and Moores Law is becoming obsolete, so other alternatives are required to efficiently process such an amount of data. Long-reads from the third generation of sequencers are shown to be an emerging type of genetic data, with average lengths of thousands of nucleotides each. State-of-the-Art algorithm Minimap2 is able to assemble these reads into the genome that was sampled, but it is a computationally-intensive process: for the human genome size with sufficient coverage, running times can reach up to dozens of CPU hours. Hardware acceleration has been proposed as an effort to make Minimap2 more efficient, but up to the present moment, only one of its main bottlenecks, the chaining step, has been successfully accelerated on FPGA. No efficient solution has been proposed for the aligning step, implemented as the ksw function. GACT-X is a Cloud FPGA design that performs a banded SWG alignment with fixed memory, suitable for any size of input. GACT-X with tiles of size 4,000 can be 2x faster than ksw when aligning long sequences. Replacing the alignment function ksw in Minimap2 with GACT-X on a Cloud hybrid system can provide up to 1.41x acceleration on the entire execution to the software counterpart, with comparable accuracy for data that have high similarity to the reference genome. This dissertation presents all the relevant background information, the development stages and methods, the results achieved on three different datasets, and the proposed future work on this acceleration project.O sequenciamento genético pode fornecer informações cruciais em medicina e em estudos de biologia. As tecnologias desenvolvidas na área estão avançando rapidamente e a atual terceira-geração de sequenciadores de genoma possuem melhorias significantes sobre a segunda-geração. Paralelamente a isso, a taxa de sequenciamento vem aumentando exponencialmente, o que, aliado à redução de preços, resultou em um salto de geração de dados genômicos a serem processados. A tecnologia de transistores está atingindo seus limites fundamentais, e a Lei de Moore está se tornando obsoleta, então outras alternativas são necessárias para processar tal quantidade de dados. Long-reads da terceira geração de sequenciadores são um tipo emergente de dados genéticos, com comprimentos médios de milhares de nucleotídeos cada. O algoritmo do Estado-da-Arte Minimap2 é capaz de montar essas reads de volta ao genoma que foi amostrado, mas é um processo computacionalmente intensivo: para o tamanho do genoma humano com cobertura suficiente, os tempos de execução podem chegar a dezenas de horas de CPU. Aceleração em hardware foi proposta como uma aplicação para tornar o Minimap2 mais eficiente, mas até o presente momento, apenas um de seus principais gargalos, a etapa de chaining, foi acelerada com sucesso em FPGA. Nenhuma solução eficiente foi proposta para a etapa de alinhamento, implementada como a função ksw. O GACT-X ´e um design de FPGA em nuvem que executa o alinhamento de SWG em banda, com consumo de memória fixo, adequado para qualquer tamanho de entrada. O GACT-X com tiles de tamanho 4.000 pode ser 2x mais rápido que o ksw ao alinhar sequencias longas. Substituir a função de alinhamento ksw no Minimap2 pelo GACT-X em um sistema híbrido na nuvem pode proporcionar aceleração de até 1,41x sobre toda a execução do software, com precisão comparável para dados que tem alta similaridade com o genoma de referencia. Esta dissertação apresenta todas as informações básicas relevantes, as etapas e os métodos desenvolvimento, os resultados alcançados em três conjuntos de dados diferentes e os trabalhos futuros propostos para este projeto de aceleração.Biblioteca Digitais de Teses e Dissertações da USPFonseca, Fernando JosepettiTeng, Carolina2022-07-27info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/3/3140/tde-05092022-084236/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2022-09-06T11:14:07Zoai:teses.usp.br:tde-05092022-084236Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212022-09-06T11:14:07Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
Acelerando a etapa de alinhamento do algoritmo de montagem de genoma Minimap2 usando GACT-X em uma máquina FPGA comercial na nuvem.
title Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
spellingShingle Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
Teng, Carolina
Acceleration
Algorítmos
Bioinformática
Circuitos FPGA
Cloud computing
Co-processors
Computação em nuvem
Field programmable gate arrays
Genômica
Genomics
Minimap2
Smith- waterman-gotoh
title_short Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
title_full Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
title_fullStr Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
title_full_unstemmed Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
title_sort Accelerating the alignment phase of Minimap2 genome assembly algorithm Using GACT-X in a commercial Cloud FPGA machine.
author Teng, Carolina
author_facet Teng, Carolina
author_role author
dc.contributor.none.fl_str_mv Fonseca, Fernando Josepetti
dc.contributor.author.fl_str_mv Teng, Carolina
dc.subject.por.fl_str_mv Acceleration
Algorítmos
Bioinformática
Circuitos FPGA
Cloud computing
Co-processors
Computação em nuvem
Field programmable gate arrays
Genômica
Genomics
Minimap2
Smith- waterman-gotoh
topic Acceleration
Algorítmos
Bioinformática
Circuitos FPGA
Cloud computing
Co-processors
Computação em nuvem
Field programmable gate arrays
Genômica
Genomics
Minimap2
Smith- waterman-gotoh
description Genetic sequencing can provide crucial information in medicine and in biology studies. The technologies developed in the field are advancing rapidly and the current third-generation of genome sequencers have significant improvements over the secondgeneration. In parallel to that, sequencing throughput has been increasing at an exponential rate, which, coupled with price reduction, has resulted in a leap of generation of genomic data to be processed. Transistor technology is reaching its fundamental limits, and Moores Law is becoming obsolete, so other alternatives are required to efficiently process such an amount of data. Long-reads from the third generation of sequencers are shown to be an emerging type of genetic data, with average lengths of thousands of nucleotides each. State-of-the-Art algorithm Minimap2 is able to assemble these reads into the genome that was sampled, but it is a computationally-intensive process: for the human genome size with sufficient coverage, running times can reach up to dozens of CPU hours. Hardware acceleration has been proposed as an effort to make Minimap2 more efficient, but up to the present moment, only one of its main bottlenecks, the chaining step, has been successfully accelerated on FPGA. No efficient solution has been proposed for the aligning step, implemented as the ksw function. GACT-X is a Cloud FPGA design that performs a banded SWG alignment with fixed memory, suitable for any size of input. GACT-X with tiles of size 4,000 can be 2x faster than ksw when aligning long sequences. Replacing the alignment function ksw in Minimap2 with GACT-X on a Cloud hybrid system can provide up to 1.41x acceleration on the entire execution to the software counterpart, with comparable accuracy for data that have high similarity to the reference genome. This dissertation presents all the relevant background information, the development stages and methods, the results achieved on three different datasets, and the proposed future work on this acceleration project.
publishDate 2022
dc.date.none.fl_str_mv 2022-07-27
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/3/3140/tde-05092022-084236/
url https://www.teses.usp.br/teses/disponiveis/3/3140/tde-05092022-084236/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1809090419014762496