Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach

Detalhes bibliográficos
Autor(a) principal: Araujo, José Deney
Data de Publicação: 2022
Outros Autores: Silva, Juan Carlo Santos E, Martins, André Guilherme Costa, Sampaio, Vanderson, Castro, Daniel Barros de, Souza, Robson F de, Giddaluru, Jeevan, Ramos, Pablo Ivan P, Pita, Robespierre, Barreto, Mauricio L, Barral Netto, Manoel, Nakaya, Helder I
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da FIOCRUZ (ARCA)
Texto Completo: https://www.arca.fiocruz.br/handle/icict/54852
Resumo: Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).
id CRUZ_3ef553fb84495b5e563488ad3a837aea
oai_identifier_str oai:www.arca.fiocruz.br:icict/54852
network_acronym_str CRUZ
network_name_str Repositório Institucional da FIOCRUZ (ARCA)
repository_id_str 2135
spelling Araujo, José DeneySilva, Juan Carlo Santos EMartins, André Guilherme CostaSampaio, VandersonCastro, Daniel Barros deSouza, Robson F deGiddaluru, JeevanRamos, Pablo Ivan PPita, RobespierreBarreto, Mauricio LBarral Netto, ManoelNakaya, Helder I2022-09-23T12:44:57Z2022-09-23T12:44:57Z2022ARAUJO, José Deney et al. Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach. PeerJ . p. 1-17, 2022.2167-8359https://www.arca.fiocruz.br/handle/icict/5485210.7717/peerj.13507Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil / Plataforma Científica Pasteur USP. São Paulo, SP, Brasil.Fundação de Medicina Tropical Dr. Heitor Vieira Dourado. Manaus, AM, Brasil / Instituto Todos pela Saúde. São Paulo, SP, Brasil.Fundação de Vigilância em Saúde do Amazonas. Manaus, AM, Brasil.Universidade de São Paulo. Departamento de Microbiologia. São Paulo, SP, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil / Plataforma Científica Pasteur USP. São Paulo, SP, Brasil / Instituto Todos pela Saúde. São Paulo, SP, Brasil / Hospital Israelita Albert Einstein. São Paulo, SP, Brasil.Background: Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. Methods: We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. Results: Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.engPeerJEpidemiologyCodificado NALigação de registroFerramentas genômicasEpidemiologiaExplosãoNA-encodedRecord linkageGenomic toolsEpidemiologyBlastEpidemiologyTucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approachinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txtlicense.txttext/plain; charset=utf-82991https://www.arca.fiocruz.br/bitstream/icict/54852/1/license.txt5a560609d32a3863062d77ff32785d58MD51ORIGINALAraújo, José Deney Alves - Tucuxi-blast.pdfAraújo, José Deney Alves - Tucuxi-blast.pdfapplication/pdf1848009https://www.arca.fiocruz.br/bitstream/icict/54852/2/Ara%c3%bajo%2c%20Jos%c3%a9%20Deney%20Alves%20-%20Tucuxi-blast.pdf5c9e1608f49ccda79a89fe3901ca69e0MD52icict/548522023-03-15 14:32:50.029oai:www.arca.fiocruz.br:icict/54852Q0VTU8ODTyBOw4NPIEVYQ0xVU0lWQSBERSBESVJFSVRPUyBBVVRPUkFJUwoKQW8gYWNlaXRhciBvcyBURVJNT1MgZSBDT05EScOHw5VFUyBkZXN0YSBDRVNTw4NPLCBvIEFVVE9SIGUvb3UgVElUVUxBUiBkZSBkaXJlaXRvcwphdXRvcmFpcyBzb2JyZSBhIE9CUkEgZGUgcXVlIHRyYXRhIGVzdGUgZG9jdW1lbnRvOgoKKDEpIENFREUgZSBUUkFOU0ZFUkUsIHRvdGFsIGUgZ3JhdHVpdGFtZW50ZSwgw6AgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaLCBlbQpjYXLDoXRlciBwZXJtYW5lbnRlLCBpcnJldm9nw6F2ZWwgZSBOw4NPIEVYQ0xVU0lWTywgdG9kb3Mgb3MgZGlyZWl0b3MgcGF0cmltb25pYWlzIE7Dg08KQ09NRVJDSUFJUyBkZSB1dGlsaXphw6fDo28gZGEgT0JSQSBhcnTDrXN0aWNhIGUvb3UgY2llbnTDrWZpY2EgaW5kaWNhZGEgYWNpbWEsIGluY2x1c2l2ZSBvcyBkaXJlaXRvcwpkZSB2b3ogZSBpbWFnZW0gdmluY3VsYWRvcyDDoCBPQlJBLCBkdXJhbnRlIHRvZG8gbyBwcmF6byBkZSBkdXJhw6fDo28gZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBlbQpxdWFscXVlciBpZGlvbWEgZSBlbSB0b2RvcyBvcyBwYcOtc2VzOwoKKDIpIEFDRUlUQSBxdWUgYSBjZXNzw6NvIHRvdGFsIG7Do28gZXhjbHVzaXZhLCBwZXJtYW5lbnRlIGUgaXJyZXZvZ8OhdmVsIGRvcyBkaXJlaXRvcyBhdXRvcmFpcwpwYXRyaW1vbmlhaXMgbsOjbyBjb21lcmNpYWlzIGRlIHV0aWxpemHDp8OjbyBkZSBxdWUgdHJhdGEgZXN0ZSBkb2N1bWVudG8gaW5jbHVpLCBleGVtcGxpZmljYXRpdmFtZW50ZSwKb3MgZGlyZWl0b3MgZGUgZGlzcG9uaWJpbGl6YcOnw6NvIGUgY29tdW5pY2HDp8OjbyBww7pibGljYSBkYSBPQlJBLCBlbSBxdWFscXVlciBtZWlvIG91IHZlw61jdWxvLAppbmNsdXNpdmUgZW0gUmVwb3NpdMOzcmlvcyBEaWdpdGFpcywgYmVtIGNvbW8gb3MgZGlyZWl0b3MgZGUgcmVwcm9kdcOnw6NvLCBleGliacOnw6NvLCBleGVjdcOnw6NvLApkZWNsYW1hw6fDo28sIHJlY2l0YcOnw6NvLCBleHBvc2nDp8OjbywgYXJxdWl2YW1lbnRvLCBpbmNsdXPDo28gZW0gYmFuY28gZGUgZGFkb3MsIHByZXNlcnZhw6fDo28sIGRpZnVzw6NvLApkaXN0cmlidWnDp8OjbywgZGl2dWxnYcOnw6NvLCBlbXByw6lzdGltbywgdHJhZHXDp8OjbywgZHVibGFnZW0sIGxlZ2VuZGFnZW0sIGluY2x1c8OjbyBlbSBub3ZhcyBvYnJhcyBvdQpjb2xldMOibmVhcywgcmV1dGlsaXphw6fDo28sIGVkacOnw6NvLCBwcm9kdcOnw6NvIGRlIG1hdGVyaWFsIGRpZMOhdGljbyBlIGN1cnNvcyBvdSBxdWFscXVlciBmb3JtYSBkZQp1dGlsaXphw6fDo28gbsOjbyBjb21lcmNpYWw7CgooMykgUkVDT05IRUNFIHF1ZSBhIGNlc3PDo28gYXF1aSBlc3BlY2lmaWNhZGEgY29uY2VkZSDDoCBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPCkNSVVogbyBkaXJlaXRvIGRlIGF1dG9yaXphciBxdWFscXVlciBwZXNzb2Eg4oCTIGbDrXNpY2Egb3UganVyw61kaWNhLCBww7pibGljYSBvdSBwcml2YWRhLCBuYWNpb25hbCBvdQplc3RyYW5nZWlyYSDigJMgYSBhY2Vzc2FyIGUgdXRpbGl6YXIgYW1wbGFtZW50ZSBhIE9CUkEsIHNlbSBleGNsdXNpdmlkYWRlLCBwYXJhIHF1YWlzcXVlcgpmaW5hbGlkYWRlcyBuw6NvIGNvbWVyY2lhaXM7CgooNCkgREVDTEFSQSBxdWUgYSBvYnJhIMOpIGNyaWHDp8OjbyBvcmlnaW5hbCBlIHF1ZSDDqSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGFxdWkgY2VkaWRvcyBlIGF1dG9yaXphZG9zLApyZXNwb25zYWJpbGl6YW5kby1zZSBpbnRlZ3JhbG1lbnRlIHBlbG8gY29udGXDumRvIGUgb3V0cm9zIGVsZW1lbnRvcyBxdWUgZmF6ZW0gcGFydGUgZGEgT0JSQSwKaW5jbHVzaXZlIG9zIGRpcmVpdG9zIGRlIHZveiBlIGltYWdlbSB2aW5jdWxhZG9zIMOgIE9CUkEsIG9icmlnYW5kby1zZSBhIGluZGVuaXphciB0ZXJjZWlyb3MgcG9yCmRhbm9zLCBiZW0gY29tbyBpbmRlbml6YXIgZSByZXNzYXJjaXIgYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVogZGUKZXZlbnR1YWlzIGRlc3Blc2FzIHF1ZSB2aWVyZW0gYSBzdXBvcnRhciwgZW0gcmF6w6NvIGRlIHF1YWxxdWVyIG9mZW5zYSBhIGRpcmVpdG9zIGF1dG9yYWlzIG91CmRpcmVpdG9zIGRlIHZveiBvdSBpbWFnZW0sIHByaW5jaXBhbG1lbnRlIG5vIHF1ZSBkaXogcmVzcGVpdG8gYSBwbMOhZ2lvIGUgdmlvbGHDp8O1ZXMgZGUgZGlyZWl0b3M7CgooNSkgQUZJUk1BIHF1ZSBjb25oZWNlIGEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTwpPU1dBTERPIENSVVogZSBhcyBkaXJldHJpemVzIHBhcmEgbyBmdW5jaW9uYW1lbnRvIGRvIHJlcG9zaXTDs3JpbyBpbnN0aXR1Y2lvbmFsIEFSQ0EuCgpBIFBvbMOtdGljYSBJbnN0aXR1Y2lvbmFsIGRlIEFjZXNzbyBBYmVydG8gZGEgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaIHJlc2VydmEKZXhjbHVzaXZhbWVudGUgYW8gQVVUT1Igb3MgZGlyZWl0b3MgbW9yYWlzIGUgb3MgdXNvcyBjb21lcmNpYWlzIHNvYnJlIGFzIG9icmFzIGRlIHN1YSBhdXRvcmlhCmUvb3UgdGl0dWxhcmlkYWRlLCBzZW5kbyBvcyB0ZXJjZWlyb3MgdXN1w6FyaW9zIHJlc3BvbnPDoXZlaXMgcGVsYSBhdHJpYnVpw6fDo28gZGUgYXV0b3JpYSBlIG1hbnV0ZW7Dp8OjbwpkYSBpbnRlZ3JpZGFkZSBkYSBPQlJBIGVtIHF1YWxxdWVyIHV0aWxpemHDp8Ojby4KCkEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVoKcmVzcGVpdGEgb3MgY29udHJhdG9zIGUgYWNvcmRvcyBwcmVleGlzdGVudGVzIGRvcyBBdXRvcmVzIGNvbSB0ZXJjZWlyb3MsIGNhYmVuZG8gYW9zIEF1dG9yZXMKaW5mb3JtYXIgw6AgSW5zdGl0dWnDp8OjbyBhcyBjb25kacOnw7VlcyBlIG91dHJhcyByZXN0cmnDp8O1ZXMgaW1wb3N0YXMgcG9yIGVzdGVzIGluc3RydW1lbnRvcy4KRepositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352023-03-15T17:32:50Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false
dc.title.en_US.fl_str_mv Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
spellingShingle Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
Araujo, José Deney
Epidemiology
Codificado NA
Ligação de registro
Ferramentas genômicas
Epidemiologia
Explosão
NA-encoded
Record linkage
Genomic tools
Epidemiology
Blast
Epidemiology
title_short Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_full Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_fullStr Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_full_unstemmed Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_sort Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
author Araujo, José Deney
author_facet Araujo, José Deney
Silva, Juan Carlo Santos E
Martins, André Guilherme Costa
Sampaio, Vanderson
Castro, Daniel Barros de
Souza, Robson F de
Giddaluru, Jeevan
Ramos, Pablo Ivan P
Pita, Robespierre
Barreto, Mauricio L
Barral Netto, Manoel
Nakaya, Helder I
author_role author
author2 Silva, Juan Carlo Santos E
Martins, André Guilherme Costa
Sampaio, Vanderson
Castro, Daniel Barros de
Souza, Robson F de
Giddaluru, Jeevan
Ramos, Pablo Ivan P
Pita, Robespierre
Barreto, Mauricio L
Barral Netto, Manoel
Nakaya, Helder I
author2_role author
author
author
author
author
author
author
author
author
author
author
dc.contributor.author.fl_str_mv Araujo, José Deney
Silva, Juan Carlo Santos E
Martins, André Guilherme Costa
Sampaio, Vanderson
Castro, Daniel Barros de
Souza, Robson F de
Giddaluru, Jeevan
Ramos, Pablo Ivan P
Pita, Robespierre
Barreto, Mauricio L
Barral Netto, Manoel
Nakaya, Helder I
dc.subject.mesh.en_US.fl_str_mv Epidemiology
topic Epidemiology
Codificado NA
Ligação de registro
Ferramentas genômicas
Epidemiologia
Explosão
NA-encoded
Record linkage
Genomic tools
Epidemiology
Blast
Epidemiology
dc.subject.other.en_US.fl_str_mv Codificado NA
Ligação de registro
Ferramentas genômicas
Epidemiologia
Explosão
dc.subject.en.en_US.fl_str_mv NA-encoded
Record linkage
Genomic tools
Epidemiology
Blast
dc.subject.decs.en_US.fl_str_mv Epidemiology
description Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).
publishDate 2022
dc.date.accessioned.fl_str_mv 2022-09-23T12:44:57Z
dc.date.available.fl_str_mv 2022-09-23T12:44:57Z
dc.date.issued.fl_str_mv 2022
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.citation.fl_str_mv ARAUJO, José Deney et al. Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach. PeerJ . p. 1-17, 2022.
dc.identifier.uri.fl_str_mv https://www.arca.fiocruz.br/handle/icict/54852
dc.identifier.issn.en_US.fl_str_mv 2167-8359
dc.identifier.doi.none.fl_str_mv 10.7717/peerj.13507
identifier_str_mv ARAUJO, José Deney et al. Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach. PeerJ . p. 1-17, 2022.
2167-8359
10.7717/peerj.13507
url https://www.arca.fiocruz.br/handle/icict/54852
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv PeerJ
publisher.none.fl_str_mv PeerJ
dc.source.none.fl_str_mv reponame:Repositório Institucional da FIOCRUZ (ARCA)
instname:Fundação Oswaldo Cruz (FIOCRUZ)
instacron:FIOCRUZ
instname_str Fundação Oswaldo Cruz (FIOCRUZ)
instacron_str FIOCRUZ
institution FIOCRUZ
reponame_str Repositório Institucional da FIOCRUZ (ARCA)
collection Repositório Institucional da FIOCRUZ (ARCA)
bitstream.url.fl_str_mv https://www.arca.fiocruz.br/bitstream/icict/54852/1/license.txt
https://www.arca.fiocruz.br/bitstream/icict/54852/2/Ara%c3%bajo%2c%20Jos%c3%a9%20Deney%20Alves%20-%20Tucuxi-blast.pdf
bitstream.checksum.fl_str_mv 5a560609d32a3863062d77ff32785d58
5c9e1608f49ccda79a89fe3901ca69e0
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)
repository.mail.fl_str_mv repositorio.arca@fiocruz.br
_version_ 1798324925923065856