Burrows-wheeler transform in secondary memory
Autor(a) principal: | |
---|---|
Data de Publicação: | 2010 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/5308 |
Resumo: | Master’s Thesis in Computer Engineering |
id |
RCAP_86ede9c08a019732275a6b15dd55f721 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/5308 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Burrows-wheeler transform in secondary memorySuffix arraysExternal sortingHeapPattern matchingIndexesMaster’s Thesis in Computer EngineeringA suffix array is an index, a data structure that allows searching for sequences of characters. Such structures are of key importance for a large set of problems related to sequences of characters. An especially important use of suffix arrays is to compute the Burrows-Wheeler Transform, which can be used for compressing text. This procedure is the base of the UNIX utility bzip2. The Burrows-Wheeler transform is a key step in the construction of more sophisticated indexes. For large sequences of characters, such as DNA sequences of about 10 GB, it is not possible to calculate the Burrows-Wheeler transform in an average computer without using secondary memory. In this dissertation we will study the state-of-the-art algorithms to construct the Burrows-Wheeler transform in secondary memory. Based on this research we propose an algorithm and compare it against the previous ones to determine its relative performance. Our algorithm is based on the classical external Heapsort. The novelty lies in a heap that is especially designed for suffix arrays, which we call String Heap. This algorithm aims to be space-conscious, while trying to handle the disk access dominance over main memory access. We divide our solution in two parts, splitting and merging suffix arrays, the latter is the main application of the String Heap. The merging part produces the BWT, as a side effect of merging a set of partial suffix arrays of a text. We also compare its performance against the other algorithms. We also study a second version of the algorithm that accesses secondary memory in blocks.Faculdade de Ciências e TecnologiaRusso, LuísRUNPereira, Sérgio Miguel Cachucho2011-03-02T10:30:00Z20102010-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/5308enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T03:35:45Zoai:run.unl.pt:10362/5308Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:16:11.777980Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Burrows-wheeler transform in secondary memory |
title |
Burrows-wheeler transform in secondary memory |
spellingShingle |
Burrows-wheeler transform in secondary memory Pereira, Sérgio Miguel Cachucho Suffix arrays External sorting Heap Pattern matching Indexes |
title_short |
Burrows-wheeler transform in secondary memory |
title_full |
Burrows-wheeler transform in secondary memory |
title_fullStr |
Burrows-wheeler transform in secondary memory |
title_full_unstemmed |
Burrows-wheeler transform in secondary memory |
title_sort |
Burrows-wheeler transform in secondary memory |
author |
Pereira, Sérgio Miguel Cachucho |
author_facet |
Pereira, Sérgio Miguel Cachucho |
author_role |
author |
dc.contributor.none.fl_str_mv |
Russo, Luís RUN |
dc.contributor.author.fl_str_mv |
Pereira, Sérgio Miguel Cachucho |
dc.subject.por.fl_str_mv |
Suffix arrays External sorting Heap Pattern matching Indexes |
topic |
Suffix arrays External sorting Heap Pattern matching Indexes |
description |
Master’s Thesis in Computer Engineering |
publishDate |
2010 |
dc.date.none.fl_str_mv |
2010 2010-01-01T00:00:00Z 2011-03-02T10:30:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/5308 |
url |
http://hdl.handle.net/10362/5308 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Faculdade de Ciências e Tecnologia |
publisher.none.fl_str_mv |
Faculdade de Ciências e Tecnologia |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799137812164902912 |