Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , , , , , , , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Institucional da FIOCRUZ (ARCA) |
Texto Completo: | https://www.arca.fiocruz.br/handle/icict/54852 |
Resumo: | Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES). |
id |
CRUZ_3ef553fb84495b5e563488ad3a837aea |
---|---|
oai_identifier_str |
oai:www.arca.fiocruz.br:icict/54852 |
network_acronym_str |
CRUZ |
network_name_str |
Repositório Institucional da FIOCRUZ (ARCA) |
repository_id_str |
2135 |
spelling |
Araujo, José DeneySilva, Juan Carlo Santos EMartins, André Guilherme CostaSampaio, VandersonCastro, Daniel Barros deSouza, Robson F deGiddaluru, JeevanRamos, Pablo Ivan PPita, RobespierreBarreto, Mauricio LBarral Netto, ManoelNakaya, Helder I2022-09-23T12:44:57Z2022-09-23T12:44:57Z2022ARAUJO, José Deney et al. Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach. PeerJ . p. 1-17, 2022.2167-8359https://www.arca.fiocruz.br/handle/icict/5485210.7717/peerj.13507Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil / Plataforma Científica Pasteur USP. São Paulo, SP, Brasil.Fundação de Medicina Tropical Dr. Heitor Vieira Dourado. Manaus, AM, Brasil / Instituto Todos pela Saúde. São Paulo, SP, Brasil.Fundação de Vigilância em Saúde do Amazonas. Manaus, AM, Brasil.Universidade de São Paulo. Departamento de Microbiologia. São Paulo, SP, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Salvador, BA, Brasil.Universidade de São Paulo. Departamento de Análises Clínicas e Toxicológicas. São Paulo, SP, Brasil / Plataforma Científica Pasteur USP. São Paulo, SP, Brasil / Instituto Todos pela Saúde. São Paulo, SP, Brasil / Hospital Israelita Albert Einstein. São Paulo, SP, Brasil.Background: Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. Methods: We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. Results: Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.engPeerJEpidemiologyCodificado NALigação de registroFerramentas genômicasEpidemiologiaExplosãoNA-encodedRecord linkageGenomic toolsEpidemiologyBlastEpidemiologyTucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approachinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txtlicense.txttext/plain; charset=utf-82991https://www.arca.fiocruz.br/bitstream/icict/54852/1/license.txt5a560609d32a3863062d77ff32785d58MD51ORIGINALAraújo, José Deney Alves - Tucuxi-blast.pdfAraújo, José Deney Alves - Tucuxi-blast.pdfapplication/pdf1848009https://www.arca.fiocruz.br/bitstream/icict/54852/2/Ara%c3%bajo%2c%20Jos%c3%a9%20Deney%20Alves%20-%20Tucuxi-blast.pdf5c9e1608f49ccda79a89fe3901ca69e0MD52icict/548522023-03-15 14:32:50.029oai:www.arca.fiocruz.br:icict/54852Q0VTU8ODTyBOw4NPIEVYQ0xVU0lWQSBERSBESVJFSVRPUyBBVVRPUkFJUwoKQW8gYWNlaXRhciBvcyBURVJNT1MgZSBDT05EScOHw5VFUyBkZXN0YSBDRVNTw4NPLCBvIEFVVE9SIGUvb3UgVElUVUxBUiBkZSBkaXJlaXRvcwphdXRvcmFpcyBzb2JyZSBhIE9CUkEgZGUgcXVlIHRyYXRhIGVzdGUgZG9jdW1lbnRvOgoKKDEpIENFREUgZSBUUkFOU0ZFUkUsIHRvdGFsIGUgZ3JhdHVpdGFtZW50ZSwgw6AgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaLCBlbQpjYXLDoXRlciBwZXJtYW5lbnRlLCBpcnJldm9nw6F2ZWwgZSBOw4NPIEVYQ0xVU0lWTywgdG9kb3Mgb3MgZGlyZWl0b3MgcGF0cmltb25pYWlzIE7Dg08KQ09NRVJDSUFJUyBkZSB1dGlsaXphw6fDo28gZGEgT0JSQSBhcnTDrXN0aWNhIGUvb3UgY2llbnTDrWZpY2EgaW5kaWNhZGEgYWNpbWEsIGluY2x1c2l2ZSBvcyBkaXJlaXRvcwpkZSB2b3ogZSBpbWFnZW0gdmluY3VsYWRvcyDDoCBPQlJBLCBkdXJhbnRlIHRvZG8gbyBwcmF6byBkZSBkdXJhw6fDo28gZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBlbQpxdWFscXVlciBpZGlvbWEgZSBlbSB0b2RvcyBvcyBwYcOtc2VzOwoKKDIpIEFDRUlUQSBxdWUgYSBjZXNzw6NvIHRvdGFsIG7Do28gZXhjbHVzaXZhLCBwZXJtYW5lbnRlIGUgaXJyZXZvZ8OhdmVsIGRvcyBkaXJlaXRvcyBhdXRvcmFpcwpwYXRyaW1vbmlhaXMgbsOjbyBjb21lcmNpYWlzIGRlIHV0aWxpemHDp8OjbyBkZSBxdWUgdHJhdGEgZXN0ZSBkb2N1bWVudG8gaW5jbHVpLCBleGVtcGxpZmljYXRpdmFtZW50ZSwKb3MgZGlyZWl0b3MgZGUgZGlzcG9uaWJpbGl6YcOnw6NvIGUgY29tdW5pY2HDp8OjbyBww7pibGljYSBkYSBPQlJBLCBlbSBxdWFscXVlciBtZWlvIG91IHZlw61jdWxvLAppbmNsdXNpdmUgZW0gUmVwb3NpdMOzcmlvcyBEaWdpdGFpcywgYmVtIGNvbW8gb3MgZGlyZWl0b3MgZGUgcmVwcm9kdcOnw6NvLCBleGliacOnw6NvLCBleGVjdcOnw6NvLApkZWNsYW1hw6fDo28sIHJlY2l0YcOnw6NvLCBleHBvc2nDp8OjbywgYXJxdWl2YW1lbnRvLCBpbmNsdXPDo28gZW0gYmFuY28gZGUgZGFkb3MsIHByZXNlcnZhw6fDo28sIGRpZnVzw6NvLApkaXN0cmlidWnDp8OjbywgZGl2dWxnYcOnw6NvLCBlbXByw6lzdGltbywgdHJhZHXDp8OjbywgZHVibGFnZW0sIGxlZ2VuZGFnZW0sIGluY2x1c8OjbyBlbSBub3ZhcyBvYnJhcyBvdQpjb2xldMOibmVhcywgcmV1dGlsaXphw6fDo28sIGVkacOnw6NvLCBwcm9kdcOnw6NvIGRlIG1hdGVyaWFsIGRpZMOhdGljbyBlIGN1cnNvcyBvdSBxdWFscXVlciBmb3JtYSBkZQp1dGlsaXphw6fDo28gbsOjbyBjb21lcmNpYWw7CgooMykgUkVDT05IRUNFIHF1ZSBhIGNlc3PDo28gYXF1aSBlc3BlY2lmaWNhZGEgY29uY2VkZSDDoCBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPCkNSVVogbyBkaXJlaXRvIGRlIGF1dG9yaXphciBxdWFscXVlciBwZXNzb2Eg4oCTIGbDrXNpY2Egb3UganVyw61kaWNhLCBww7pibGljYSBvdSBwcml2YWRhLCBuYWNpb25hbCBvdQplc3RyYW5nZWlyYSDigJMgYSBhY2Vzc2FyIGUgdXRpbGl6YXIgYW1wbGFtZW50ZSBhIE9CUkEsIHNlbSBleGNsdXNpdmlkYWRlLCBwYXJhIHF1YWlzcXVlcgpmaW5hbGlkYWRlcyBuw6NvIGNvbWVyY2lhaXM7CgooNCkgREVDTEFSQSBxdWUgYSBvYnJhIMOpIGNyaWHDp8OjbyBvcmlnaW5hbCBlIHF1ZSDDqSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGFxdWkgY2VkaWRvcyBlIGF1dG9yaXphZG9zLApyZXNwb25zYWJpbGl6YW5kby1zZSBpbnRlZ3JhbG1lbnRlIHBlbG8gY29udGXDumRvIGUgb3V0cm9zIGVsZW1lbnRvcyBxdWUgZmF6ZW0gcGFydGUgZGEgT0JSQSwKaW5jbHVzaXZlIG9zIGRpcmVpdG9zIGRlIHZveiBlIGltYWdlbSB2aW5jdWxhZG9zIMOgIE9CUkEsIG9icmlnYW5kby1zZSBhIGluZGVuaXphciB0ZXJjZWlyb3MgcG9yCmRhbm9zLCBiZW0gY29tbyBpbmRlbml6YXIgZSByZXNzYXJjaXIgYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVogZGUKZXZlbnR1YWlzIGRlc3Blc2FzIHF1ZSB2aWVyZW0gYSBzdXBvcnRhciwgZW0gcmF6w6NvIGRlIHF1YWxxdWVyIG9mZW5zYSBhIGRpcmVpdG9zIGF1dG9yYWlzIG91CmRpcmVpdG9zIGRlIHZveiBvdSBpbWFnZW0sIHByaW5jaXBhbG1lbnRlIG5vIHF1ZSBkaXogcmVzcGVpdG8gYSBwbMOhZ2lvIGUgdmlvbGHDp8O1ZXMgZGUgZGlyZWl0b3M7CgooNSkgQUZJUk1BIHF1ZSBjb25oZWNlIGEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTwpPU1dBTERPIENSVVogZSBhcyBkaXJldHJpemVzIHBhcmEgbyBmdW5jaW9uYW1lbnRvIGRvIHJlcG9zaXTDs3JpbyBpbnN0aXR1Y2lvbmFsIEFSQ0EuCgpBIFBvbMOtdGljYSBJbnN0aXR1Y2lvbmFsIGRlIEFjZXNzbyBBYmVydG8gZGEgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaIHJlc2VydmEKZXhjbHVzaXZhbWVudGUgYW8gQVVUT1Igb3MgZGlyZWl0b3MgbW9yYWlzIGUgb3MgdXNvcyBjb21lcmNpYWlzIHNvYnJlIGFzIG9icmFzIGRlIHN1YSBhdXRvcmlhCmUvb3UgdGl0dWxhcmlkYWRlLCBzZW5kbyBvcyB0ZXJjZWlyb3MgdXN1w6FyaW9zIHJlc3BvbnPDoXZlaXMgcGVsYSBhdHJpYnVpw6fDo28gZGUgYXV0b3JpYSBlIG1hbnV0ZW7Dp8OjbwpkYSBpbnRlZ3JpZGFkZSBkYSBPQlJBIGVtIHF1YWxxdWVyIHV0aWxpemHDp8Ojby4KCkEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVoKcmVzcGVpdGEgb3MgY29udHJhdG9zIGUgYWNvcmRvcyBwcmVleGlzdGVudGVzIGRvcyBBdXRvcmVzIGNvbSB0ZXJjZWlyb3MsIGNhYmVuZG8gYW9zIEF1dG9yZXMKaW5mb3JtYXIgw6AgSW5zdGl0dWnDp8OjbyBhcyBjb25kacOnw7VlcyBlIG91dHJhcyByZXN0cmnDp8O1ZXMgaW1wb3N0YXMgcG9yIGVzdGVzIGluc3RydW1lbnRvcy4KRepositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352023-03-15T17:32:50Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false |
dc.title.en_US.fl_str_mv |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach |
title |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach |
spellingShingle |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach Araujo, José Deney Epidemiology Codificado NA Ligação de registro Ferramentas genômicas Epidemiologia Explosão NA-encoded Record linkage Genomic tools Epidemiology Blast Epidemiology |
title_short |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach |
title_full |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach |
title_fullStr |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach |
title_full_unstemmed |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach |
title_sort |
Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach |
author |
Araujo, José Deney |
author_facet |
Araujo, José Deney Silva, Juan Carlo Santos E Martins, André Guilherme Costa Sampaio, Vanderson Castro, Daniel Barros de Souza, Robson F de Giddaluru, Jeevan Ramos, Pablo Ivan P Pita, Robespierre Barreto, Mauricio L Barral Netto, Manoel Nakaya, Helder I |
author_role |
author |
author2 |
Silva, Juan Carlo Santos E Martins, André Guilherme Costa Sampaio, Vanderson Castro, Daniel Barros de Souza, Robson F de Giddaluru, Jeevan Ramos, Pablo Ivan P Pita, Robespierre Barreto, Mauricio L Barral Netto, Manoel Nakaya, Helder I |
author2_role |
author author author author author author author author author author author |
dc.contributor.author.fl_str_mv |
Araujo, José Deney Silva, Juan Carlo Santos E Martins, André Guilherme Costa Sampaio, Vanderson Castro, Daniel Barros de Souza, Robson F de Giddaluru, Jeevan Ramos, Pablo Ivan P Pita, Robespierre Barreto, Mauricio L Barral Netto, Manoel Nakaya, Helder I |
dc.subject.mesh.en_US.fl_str_mv |
Epidemiology |
topic |
Epidemiology Codificado NA Ligação de registro Ferramentas genômicas Epidemiologia Explosão NA-encoded Record linkage Genomic tools Epidemiology Blast Epidemiology |
dc.subject.other.en_US.fl_str_mv |
Codificado NA Ligação de registro Ferramentas genômicas Epidemiologia Explosão |
dc.subject.en.en_US.fl_str_mv |
NA-encoded Record linkage Genomic tools Epidemiology Blast |
dc.subject.decs.en_US.fl_str_mv |
Epidemiology |
description |
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES). |
publishDate |
2022 |
dc.date.accessioned.fl_str_mv |
2022-09-23T12:44:57Z |
dc.date.available.fl_str_mv |
2022-09-23T12:44:57Z |
dc.date.issued.fl_str_mv |
2022 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
ARAUJO, José Deney et al. Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach. PeerJ . p. 1-17, 2022. |
dc.identifier.uri.fl_str_mv |
https://www.arca.fiocruz.br/handle/icict/54852 |
dc.identifier.issn.en_US.fl_str_mv |
2167-8359 |
dc.identifier.doi.none.fl_str_mv |
10.7717/peerj.13507 |
identifier_str_mv |
ARAUJO, José Deney et al. Tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach. PeerJ . p. 1-17, 2022. 2167-8359 10.7717/peerj.13507 |
url |
https://www.arca.fiocruz.br/handle/icict/54852 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
PeerJ |
publisher.none.fl_str_mv |
PeerJ |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da FIOCRUZ (ARCA) instname:Fundação Oswaldo Cruz (FIOCRUZ) instacron:FIOCRUZ |
instname_str |
Fundação Oswaldo Cruz (FIOCRUZ) |
instacron_str |
FIOCRUZ |
institution |
FIOCRUZ |
reponame_str |
Repositório Institucional da FIOCRUZ (ARCA) |
collection |
Repositório Institucional da FIOCRUZ (ARCA) |
bitstream.url.fl_str_mv |
https://www.arca.fiocruz.br/bitstream/icict/54852/1/license.txt https://www.arca.fiocruz.br/bitstream/icict/54852/2/Ara%c3%bajo%2c%20Jos%c3%a9%20Deney%20Alves%20-%20Tucuxi-blast.pdf |
bitstream.checksum.fl_str_mv |
5a560609d32a3863062d77ff32785d58 5c9e1608f49ccda79a89fe3901ca69e0 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ) |
repository.mail.fl_str_mv |
repositorio.arca@fiocruz.br |
_version_ |
1798324925923065856 |