On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort

Detalhes bibliográficos
Autor(a) principal: Pita, Robespierre
Data de Publicação: 2018
Outros Autores: Pinto, Clicia, Sena, Samila, Fiaccone, Rosemeire, Amorim, Leila D., Reis, Sandra, Barreto, Maurício Lima, Denaxas, Spiros, Barreto, Marcos Ennes
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da FIOCRUZ (ARCA)
Texto Completo: https://www.arca.fiocruz.br/handle/icict/26425
Resumo: CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research.
id CRUZ_5672755c3e0fe8f1bb7654a33d05aa78
oai_identifier_str oai:www.arca.fiocruz.br:icict/26425
network_acronym_str CRUZ
network_name_str Repositório Institucional da FIOCRUZ (ARCA)
repository_id_str 2135
spelling Pita, RobespierrePinto, CliciaSena, SamilaFiaccone, RosemeireAmorim, Leila D.Reis, SandraBarreto, Maurício LimaDenaxas, SpirosBarreto, Marcos Ennes2018-05-14T16:16:59Z2018-05-14T16:16:59Z2018PITA, Robespierre et al. On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort. IEEE Journal of Biomedical and Health Informatics, v. 22, n. 2, p. 346-353, 2018.2168-2194https://www.arca.fiocruz.br/handle/icict/2642510.1109/JBHI.2018.2796941CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.University College London. Institute of Health Informatics. London, WC, UK.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures.engInstitute of Electrical and Electronics Engineers (IEEE)Ligação de dadosAvaliação de precisãoEstudo de coorteData linkageAccuracy assessmentCohort studyOn the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohortinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txtlicense.txttext/plain; charset=utf-82991https://www.arca.fiocruz.br/bitstream/icict/26425/1/license.txt5a560609d32a3863062d77ff32785d58MD51ORIGINALPita R On the Accuracy and Scalability of Probabilistic ....pdfPita R On the Accuracy and Scalability of Probabilistic ....pdfapplication/pdf1096764https://www.arca.fiocruz.br/bitstream/icict/26425/2/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf00c3d76c863eee3c14952ae212d2fd30MD52TEXTPita R On the Accuracy and Scalability of Probabilistic ....pdf.txtPita R On the Accuracy and Scalability of Probabilistic ....pdf.txtExtracted texttext/plain39641https://www.arca.fiocruz.br/bitstream/icict/26425/3/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf.txt7d51eb575b9a06fde26384929570a049MD53icict/264252023-03-15 14:34:05.367oai:www.arca.fiocruz.br:icict/26425Q0VTU8ODTyBOw4NPIEVYQ0xVU0lWQSBERSBESVJFSVRPUyBBVVRPUkFJUwoKQW8gYWNlaXRhciBvcyBURVJNT1MgZSBDT05EScOHw5VFUyBkZXN0YSBDRVNTw4NPLCBvIEFVVE9SIGUvb3UgVElUVUxBUiBkZSBkaXJlaXRvcwphdXRvcmFpcyBzb2JyZSBhIE9CUkEgZGUgcXVlIHRyYXRhIGVzdGUgZG9jdW1lbnRvOgoKKDEpIENFREUgZSBUUkFOU0ZFUkUsIHRvdGFsIGUgZ3JhdHVpdGFtZW50ZSwgw6AgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaLCBlbQpjYXLDoXRlciBwZXJtYW5lbnRlLCBpcnJldm9nw6F2ZWwgZSBOw4NPIEVYQ0xVU0lWTywgdG9kb3Mgb3MgZGlyZWl0b3MgcGF0cmltb25pYWlzIE7Dg08KQ09NRVJDSUFJUyBkZSB1dGlsaXphw6fDo28gZGEgT0JSQSBhcnTDrXN0aWNhIGUvb3UgY2llbnTDrWZpY2EgaW5kaWNhZGEgYWNpbWEsIGluY2x1c2l2ZSBvcyBkaXJlaXRvcwpkZSB2b3ogZSBpbWFnZW0gdmluY3VsYWRvcyDDoCBPQlJBLCBkdXJhbnRlIHRvZG8gbyBwcmF6byBkZSBkdXJhw6fDo28gZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBlbQpxdWFscXVlciBpZGlvbWEgZSBlbSB0b2RvcyBvcyBwYcOtc2VzOwoKKDIpIEFDRUlUQSBxdWUgYSBjZXNzw6NvIHRvdGFsIG7Do28gZXhjbHVzaXZhLCBwZXJtYW5lbnRlIGUgaXJyZXZvZ8OhdmVsIGRvcyBkaXJlaXRvcyBhdXRvcmFpcwpwYXRyaW1vbmlhaXMgbsOjbyBjb21lcmNpYWlzIGRlIHV0aWxpemHDp8OjbyBkZSBxdWUgdHJhdGEgZXN0ZSBkb2N1bWVudG8gaW5jbHVpLCBleGVtcGxpZmljYXRpdmFtZW50ZSwKb3MgZGlyZWl0b3MgZGUgZGlzcG9uaWJpbGl6YcOnw6NvIGUgY29tdW5pY2HDp8OjbyBww7pibGljYSBkYSBPQlJBLCBlbSBxdWFscXVlciBtZWlvIG91IHZlw61jdWxvLAppbmNsdXNpdmUgZW0gUmVwb3NpdMOzcmlvcyBEaWdpdGFpcywgYmVtIGNvbW8gb3MgZGlyZWl0b3MgZGUgcmVwcm9kdcOnw6NvLCBleGliacOnw6NvLCBleGVjdcOnw6NvLApkZWNsYW1hw6fDo28sIHJlY2l0YcOnw6NvLCBleHBvc2nDp8OjbywgYXJxdWl2YW1lbnRvLCBpbmNsdXPDo28gZW0gYmFuY28gZGUgZGFkb3MsIHByZXNlcnZhw6fDo28sIGRpZnVzw6NvLApkaXN0cmlidWnDp8OjbywgZGl2dWxnYcOnw6NvLCBlbXByw6lzdGltbywgdHJhZHXDp8OjbywgZHVibGFnZW0sIGxlZ2VuZGFnZW0sIGluY2x1c8OjbyBlbSBub3ZhcyBvYnJhcyBvdQpjb2xldMOibmVhcywgcmV1dGlsaXphw6fDo28sIGVkacOnw6NvLCBwcm9kdcOnw6NvIGRlIG1hdGVyaWFsIGRpZMOhdGljbyBlIGN1cnNvcyBvdSBxdWFscXVlciBmb3JtYSBkZQp1dGlsaXphw6fDo28gbsOjbyBjb21lcmNpYWw7CgooMykgUkVDT05IRUNFIHF1ZSBhIGNlc3PDo28gYXF1aSBlc3BlY2lmaWNhZGEgY29uY2VkZSDDoCBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPCkNSVVogbyBkaXJlaXRvIGRlIGF1dG9yaXphciBxdWFscXVlciBwZXNzb2Eg4oCTIGbDrXNpY2Egb3UganVyw61kaWNhLCBww7pibGljYSBvdSBwcml2YWRhLCBuYWNpb25hbCBvdQplc3RyYW5nZWlyYSDigJMgYSBhY2Vzc2FyIGUgdXRpbGl6YXIgYW1wbGFtZW50ZSBhIE9CUkEsIHNlbSBleGNsdXNpdmlkYWRlLCBwYXJhIHF1YWlzcXVlcgpmaW5hbGlkYWRlcyBuw6NvIGNvbWVyY2lhaXM7CgooNCkgREVDTEFSQSBxdWUgYSBvYnJhIMOpIGNyaWHDp8OjbyBvcmlnaW5hbCBlIHF1ZSDDqSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGFxdWkgY2VkaWRvcyBlIGF1dG9yaXphZG9zLApyZXNwb25zYWJpbGl6YW5kby1zZSBpbnRlZ3JhbG1lbnRlIHBlbG8gY29udGXDumRvIGUgb3V0cm9zIGVsZW1lbnRvcyBxdWUgZmF6ZW0gcGFydGUgZGEgT0JSQSwKaW5jbHVzaXZlIG9zIGRpcmVpdG9zIGRlIHZveiBlIGltYWdlbSB2aW5jdWxhZG9zIMOgIE9CUkEsIG9icmlnYW5kby1zZSBhIGluZGVuaXphciB0ZXJjZWlyb3MgcG9yCmRhbm9zLCBiZW0gY29tbyBpbmRlbml6YXIgZSByZXNzYXJjaXIgYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVogZGUKZXZlbnR1YWlzIGRlc3Blc2FzIHF1ZSB2aWVyZW0gYSBzdXBvcnRhciwgZW0gcmF6w6NvIGRlIHF1YWxxdWVyIG9mZW5zYSBhIGRpcmVpdG9zIGF1dG9yYWlzIG91CmRpcmVpdG9zIGRlIHZveiBvdSBpbWFnZW0sIHByaW5jaXBhbG1lbnRlIG5vIHF1ZSBkaXogcmVzcGVpdG8gYSBwbMOhZ2lvIGUgdmlvbGHDp8O1ZXMgZGUgZGlyZWl0b3M7CgooNSkgQUZJUk1BIHF1ZSBjb25oZWNlIGEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTwpPU1dBTERPIENSVVogZSBhcyBkaXJldHJpemVzIHBhcmEgbyBmdW5jaW9uYW1lbnRvIGRvIHJlcG9zaXTDs3JpbyBpbnN0aXR1Y2lvbmFsIEFSQ0EuCgpBIFBvbMOtdGljYSBJbnN0aXR1Y2lvbmFsIGRlIEFjZXNzbyBBYmVydG8gZGEgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaIHJlc2VydmEKZXhjbHVzaXZhbWVudGUgYW8gQVVUT1Igb3MgZGlyZWl0b3MgbW9yYWlzIGUgb3MgdXNvcyBjb21lcmNpYWlzIHNvYnJlIGFzIG9icmFzIGRlIHN1YSBhdXRvcmlhCmUvb3UgdGl0dWxhcmlkYWRlLCBzZW5kbyBvcyB0ZXJjZWlyb3MgdXN1w6FyaW9zIHJlc3BvbnPDoXZlaXMgcGVsYSBhdHJpYnVpw6fDo28gZGUgYXV0b3JpYSBlIG1hbnV0ZW7Dp8OjbwpkYSBpbnRlZ3JpZGFkZSBkYSBPQlJBIGVtIHF1YWxxdWVyIHV0aWxpemHDp8Ojby4KCkEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVoKcmVzcGVpdGEgb3MgY29udHJhdG9zIGUgYWNvcmRvcyBwcmVleGlzdGVudGVzIGRvcyBBdXRvcmVzIGNvbSB0ZXJjZWlyb3MsIGNhYmVuZG8gYW9zIEF1dG9yZXMKaW5mb3JtYXIgw6AgSW5zdGl0dWnDp8OjbyBhcyBjb25kacOnw7VlcyBlIG91dHJhcyByZXN0cmnDp8O1ZXMgaW1wb3N0YXMgcG9yIGVzdGVzIGluc3RydW1lbnRvcy4KRepositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352023-03-15T17:34:05Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false
dc.title.pt_BR.fl_str_mv On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
title On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
spellingShingle On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
Pita, Robespierre
Ligação de dados
Avaliação de precisão
Estudo de coorte
Data linkage
Accuracy assessment
Cohort study
title_short On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
title_full On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
title_fullStr On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
title_full_unstemmed On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
title_sort On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
author Pita, Robespierre
author_facet Pita, Robespierre
Pinto, Clicia
Sena, Samila
Fiaccone, Rosemeire
Amorim, Leila D.
Reis, Sandra
Barreto, Maurício Lima
Denaxas, Spiros
Barreto, Marcos Ennes
author_role author
author2 Pinto, Clicia
Sena, Samila
Fiaccone, Rosemeire
Amorim, Leila D.
Reis, Sandra
Barreto, Maurício Lima
Denaxas, Spiros
Barreto, Marcos Ennes
author2_role author
author
author
author
author
author
author
author
dc.contributor.author.fl_str_mv Pita, Robespierre
Pinto, Clicia
Sena, Samila
Fiaccone, Rosemeire
Amorim, Leila D.
Reis, Sandra
Barreto, Maurício Lima
Denaxas, Spiros
Barreto, Marcos Ennes
dc.subject.other.pt_BR.fl_str_mv Ligação de dados
Avaliação de precisão
Estudo de coorte
topic Ligação de dados
Avaliação de precisão
Estudo de coorte
Data linkage
Accuracy assessment
Cohort study
dc.subject.en.pt_BR.fl_str_mv Data linkage
Accuracy assessment
Cohort study
description CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research.
publishDate 2018
dc.date.accessioned.fl_str_mv 2018-05-14T16:16:59Z
dc.date.available.fl_str_mv 2018-05-14T16:16:59Z
dc.date.issued.fl_str_mv 2018
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.citation.fl_str_mv PITA, Robespierre et al. On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort. IEEE Journal of Biomedical and Health Informatics, v. 22, n. 2, p. 346-353, 2018.
dc.identifier.uri.fl_str_mv https://www.arca.fiocruz.br/handle/icict/26425
dc.identifier.issn.pt_BR.fl_str_mv 2168-2194
dc.identifier.doi.none.fl_str_mv 10.1109/JBHI.2018.2796941
identifier_str_mv PITA, Robespierre et al. On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort. IEEE Journal of Biomedical and Health Informatics, v. 22, n. 2, p. 346-353, 2018.
2168-2194
10.1109/JBHI.2018.2796941
url https://www.arca.fiocruz.br/handle/icict/26425
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Institute of Electrical and Electronics Engineers (IEEE)
publisher.none.fl_str_mv Institute of Electrical and Electronics Engineers (IEEE)
dc.source.none.fl_str_mv reponame:Repositório Institucional da FIOCRUZ (ARCA)
instname:Fundação Oswaldo Cruz (FIOCRUZ)
instacron:FIOCRUZ
instname_str Fundação Oswaldo Cruz (FIOCRUZ)
instacron_str FIOCRUZ
institution FIOCRUZ
reponame_str Repositório Institucional da FIOCRUZ (ARCA)
collection Repositório Institucional da FIOCRUZ (ARCA)
bitstream.url.fl_str_mv https://www.arca.fiocruz.br/bitstream/icict/26425/1/license.txt
https://www.arca.fiocruz.br/bitstream/icict/26425/2/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf
https://www.arca.fiocruz.br/bitstream/icict/26425/3/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf.txt
bitstream.checksum.fl_str_mv 5a560609d32a3863062d77ff32785d58
00c3d76c863eee3c14952ae212d2fd30
7d51eb575b9a06fde26384929570a049
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)
repository.mail.fl_str_mv repositorio.arca@fiocruz.br
_version_ 1813008951642423296