Burrows-wheeler transform in secondary memory

Detalhes bibliográficos
Autor(a) principal: Pereira, Sérgio Miguel Cachucho
Data de Publicação: 2010
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/5308
Resumo: Master’s Thesis in Computer Engineering
id RCAP_86ede9c08a019732275a6b15dd55f721
oai_identifier_str oai:run.unl.pt:10362/5308
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Burrows-wheeler transform in secondary memorySuffix arraysExternal sortingHeapPattern matchingIndexesMaster’s Thesis in Computer EngineeringA suffix array is an index, a data structure that allows searching for sequences of characters. Such structures are of key importance for a large set of problems related to sequences of characters. An especially important use of suffix arrays is to compute the Burrows-Wheeler Transform, which can be used for compressing text. This procedure is the base of the UNIX utility bzip2. The Burrows-Wheeler transform is a key step in the construction of more sophisticated indexes. For large sequences of characters, such as DNA sequences of about 10 GB, it is not possible to calculate the Burrows-Wheeler transform in an average computer without using secondary memory. In this dissertation we will study the state-of-the-art algorithms to construct the Burrows-Wheeler transform in secondary memory. Based on this research we propose an algorithm and compare it against the previous ones to determine its relative performance. Our algorithm is based on the classical external Heapsort. The novelty lies in a heap that is especially designed for suffix arrays, which we call String Heap. This algorithm aims to be space-conscious, while trying to handle the disk access dominance over main memory access. We divide our solution in two parts, splitting and merging suffix arrays, the latter is the main application of the String Heap. The merging part produces the BWT, as a side effect of merging a set of partial suffix arrays of a text. We also compare its performance against the other algorithms. We also study a second version of the algorithm that accesses secondary memory in blocks.Faculdade de Ciências e TecnologiaRusso, LuísRUNPereira, Sérgio Miguel Cachucho2011-03-02T10:30:00Z20102010-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/5308enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T03:35:45Zoai:run.unl.pt:10362/5308Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:16:11.777980Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Burrows-wheeler transform in secondary memory
title Burrows-wheeler transform in secondary memory
spellingShingle Burrows-wheeler transform in secondary memory
Pereira, Sérgio Miguel Cachucho
Suffix arrays
External sorting
Heap
Pattern matching
Indexes
title_short Burrows-wheeler transform in secondary memory
title_full Burrows-wheeler transform in secondary memory
title_fullStr Burrows-wheeler transform in secondary memory
title_full_unstemmed Burrows-wheeler transform in secondary memory
title_sort Burrows-wheeler transform in secondary memory
author Pereira, Sérgio Miguel Cachucho
author_facet Pereira, Sérgio Miguel Cachucho
author_role author
dc.contributor.none.fl_str_mv Russo, Luís
RUN
dc.contributor.author.fl_str_mv Pereira, Sérgio Miguel Cachucho
dc.subject.por.fl_str_mv Suffix arrays
External sorting
Heap
Pattern matching
Indexes
topic Suffix arrays
External sorting
Heap
Pattern matching
Indexes
description Master’s Thesis in Computer Engineering
publishDate 2010
dc.date.none.fl_str_mv 2010
2010-01-01T00:00:00Z
2011-03-02T10:30:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/5308
url http://hdl.handle.net/10362/5308
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Faculdade de Ciências e Tecnologia
publisher.none.fl_str_mv Faculdade de Ciências e Tecnologia
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799137812164902912