On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
Autor(a) principal: | |
---|---|
Data de Publicação: | 2018 |
Outros Autores: | , , , , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Institucional da FIOCRUZ (ARCA) |
Texto Completo: | https://www.arca.fiocruz.br/handle/icict/26425 |
Resumo: | CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research. |
id |
CRUZ_5672755c3e0fe8f1bb7654a33d05aa78 |
---|---|
oai_identifier_str |
oai:www.arca.fiocruz.br:icict/26425 |
network_acronym_str |
CRUZ |
network_name_str |
Repositório Institucional da FIOCRUZ (ARCA) |
repository_id_str |
2135 |
spelling |
Pita, RobespierrePinto, CliciaSena, SamilaFiaccone, RosemeireAmorim, Leila D.Reis, SandraBarreto, Maurício LimaDenaxas, SpirosBarreto, Marcos Ennes2018-05-14T16:16:59Z2018-05-14T16:16:59Z2018PITA, Robespierre et al. On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort. IEEE Journal of Biomedical and Health Informatics, v. 22, n. 2, p. 346-353, 2018.2168-2194https://www.arca.fiocruz.br/handle/icict/2642510.1109/JBHI.2018.2796941CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.University College London. Institute of Health Informatics. London, WC, UK.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures.engInstitute of Electrical and Electronics Engineers (IEEE)Ligação de dadosAvaliação de precisãoEstudo de coorteData linkageAccuracy assessmentCohort studyOn the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohortinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txtlicense.txttext/plain; charset=utf-82991https://www.arca.fiocruz.br/bitstream/icict/26425/1/license.txt5a560609d32a3863062d77ff32785d58MD51ORIGINALPita R On the Accuracy and Scalability of Probabilistic ....pdfPita R On the Accuracy and Scalability of Probabilistic ....pdfapplication/pdf1096764https://www.arca.fiocruz.br/bitstream/icict/26425/2/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf00c3d76c863eee3c14952ae212d2fd30MD52TEXTPita R On the Accuracy and Scalability of Probabilistic ....pdf.txtPita R On the Accuracy and Scalability of Probabilistic ....pdf.txtExtracted texttext/plain39641https://www.arca.fiocruz.br/bitstream/icict/26425/3/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf.txt7d51eb575b9a06fde26384929570a049MD53icict/264252023-03-15 14:34:05.367oai:www.arca.fiocruz.br:icict/26425Q0VTU8ODTyBOw4NPIEVYQ0xVU0lWQSBERSBESVJFSVRPUyBBVVRPUkFJUwoKQW8gYWNlaXRhciBvcyBURVJNT1MgZSBDT05EScOHw5VFUyBkZXN0YSBDRVNTw4NPLCBvIEFVVE9SIGUvb3UgVElUVUxBUiBkZSBkaXJlaXRvcwphdXRvcmFpcyBzb2JyZSBhIE9CUkEgZGUgcXVlIHRyYXRhIGVzdGUgZG9jdW1lbnRvOgoKKDEpIENFREUgZSBUUkFOU0ZFUkUsIHRvdGFsIGUgZ3JhdHVpdGFtZW50ZSwgw6AgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaLCBlbQpjYXLDoXRlciBwZXJtYW5lbnRlLCBpcnJldm9nw6F2ZWwgZSBOw4NPIEVYQ0xVU0lWTywgdG9kb3Mgb3MgZGlyZWl0b3MgcGF0cmltb25pYWlzIE7Dg08KQ09NRVJDSUFJUyBkZSB1dGlsaXphw6fDo28gZGEgT0JSQSBhcnTDrXN0aWNhIGUvb3UgY2llbnTDrWZpY2EgaW5kaWNhZGEgYWNpbWEsIGluY2x1c2l2ZSBvcyBkaXJlaXRvcwpkZSB2b3ogZSBpbWFnZW0gdmluY3VsYWRvcyDDoCBPQlJBLCBkdXJhbnRlIHRvZG8gbyBwcmF6byBkZSBkdXJhw6fDo28gZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBlbQpxdWFscXVlciBpZGlvbWEgZSBlbSB0b2RvcyBvcyBwYcOtc2VzOwoKKDIpIEFDRUlUQSBxdWUgYSBjZXNzw6NvIHRvdGFsIG7Do28gZXhjbHVzaXZhLCBwZXJtYW5lbnRlIGUgaXJyZXZvZ8OhdmVsIGRvcyBkaXJlaXRvcyBhdXRvcmFpcwpwYXRyaW1vbmlhaXMgbsOjbyBjb21lcmNpYWlzIGRlIHV0aWxpemHDp8OjbyBkZSBxdWUgdHJhdGEgZXN0ZSBkb2N1bWVudG8gaW5jbHVpLCBleGVtcGxpZmljYXRpdmFtZW50ZSwKb3MgZGlyZWl0b3MgZGUgZGlzcG9uaWJpbGl6YcOnw6NvIGUgY29tdW5pY2HDp8OjbyBww7pibGljYSBkYSBPQlJBLCBlbSBxdWFscXVlciBtZWlvIG91IHZlw61jdWxvLAppbmNsdXNpdmUgZW0gUmVwb3NpdMOzcmlvcyBEaWdpdGFpcywgYmVtIGNvbW8gb3MgZGlyZWl0b3MgZGUgcmVwcm9kdcOnw6NvLCBleGliacOnw6NvLCBleGVjdcOnw6NvLApkZWNsYW1hw6fDo28sIHJlY2l0YcOnw6NvLCBleHBvc2nDp8OjbywgYXJxdWl2YW1lbnRvLCBpbmNsdXPDo28gZW0gYmFuY28gZGUgZGFkb3MsIHByZXNlcnZhw6fDo28sIGRpZnVzw6NvLApkaXN0cmlidWnDp8OjbywgZGl2dWxnYcOnw6NvLCBlbXByw6lzdGltbywgdHJhZHXDp8OjbywgZHVibGFnZW0sIGxlZ2VuZGFnZW0sIGluY2x1c8OjbyBlbSBub3ZhcyBvYnJhcyBvdQpjb2xldMOibmVhcywgcmV1dGlsaXphw6fDo28sIGVkacOnw6NvLCBwcm9kdcOnw6NvIGRlIG1hdGVyaWFsIGRpZMOhdGljbyBlIGN1cnNvcyBvdSBxdWFscXVlciBmb3JtYSBkZQp1dGlsaXphw6fDo28gbsOjbyBjb21lcmNpYWw7CgooMykgUkVDT05IRUNFIHF1ZSBhIGNlc3PDo28gYXF1aSBlc3BlY2lmaWNhZGEgY29uY2VkZSDDoCBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPCkNSVVogbyBkaXJlaXRvIGRlIGF1dG9yaXphciBxdWFscXVlciBwZXNzb2Eg4oCTIGbDrXNpY2Egb3UganVyw61kaWNhLCBww7pibGljYSBvdSBwcml2YWRhLCBuYWNpb25hbCBvdQplc3RyYW5nZWlyYSDigJMgYSBhY2Vzc2FyIGUgdXRpbGl6YXIgYW1wbGFtZW50ZSBhIE9CUkEsIHNlbSBleGNsdXNpdmlkYWRlLCBwYXJhIHF1YWlzcXVlcgpmaW5hbGlkYWRlcyBuw6NvIGNvbWVyY2lhaXM7CgooNCkgREVDTEFSQSBxdWUgYSBvYnJhIMOpIGNyaWHDp8OjbyBvcmlnaW5hbCBlIHF1ZSDDqSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGFxdWkgY2VkaWRvcyBlIGF1dG9yaXphZG9zLApyZXNwb25zYWJpbGl6YW5kby1zZSBpbnRlZ3JhbG1lbnRlIHBlbG8gY29udGXDumRvIGUgb3V0cm9zIGVsZW1lbnRvcyBxdWUgZmF6ZW0gcGFydGUgZGEgT0JSQSwKaW5jbHVzaXZlIG9zIGRpcmVpdG9zIGRlIHZveiBlIGltYWdlbSB2aW5jdWxhZG9zIMOgIE9CUkEsIG9icmlnYW5kby1zZSBhIGluZGVuaXphciB0ZXJjZWlyb3MgcG9yCmRhbm9zLCBiZW0gY29tbyBpbmRlbml6YXIgZSByZXNzYXJjaXIgYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVogZGUKZXZlbnR1YWlzIGRlc3Blc2FzIHF1ZSB2aWVyZW0gYSBzdXBvcnRhciwgZW0gcmF6w6NvIGRlIHF1YWxxdWVyIG9mZW5zYSBhIGRpcmVpdG9zIGF1dG9yYWlzIG91CmRpcmVpdG9zIGRlIHZveiBvdSBpbWFnZW0sIHByaW5jaXBhbG1lbnRlIG5vIHF1ZSBkaXogcmVzcGVpdG8gYSBwbMOhZ2lvIGUgdmlvbGHDp8O1ZXMgZGUgZGlyZWl0b3M7CgooNSkgQUZJUk1BIHF1ZSBjb25oZWNlIGEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTwpPU1dBTERPIENSVVogZSBhcyBkaXJldHJpemVzIHBhcmEgbyBmdW5jaW9uYW1lbnRvIGRvIHJlcG9zaXTDs3JpbyBpbnN0aXR1Y2lvbmFsIEFSQ0EuCgpBIFBvbMOtdGljYSBJbnN0aXR1Y2lvbmFsIGRlIEFjZXNzbyBBYmVydG8gZGEgRklPQ1JVWiAtIEZVTkRBw4fDg08gT1NXQUxETyBDUlVaIHJlc2VydmEKZXhjbHVzaXZhbWVudGUgYW8gQVVUT1Igb3MgZGlyZWl0b3MgbW9yYWlzIGUgb3MgdXNvcyBjb21lcmNpYWlzIHNvYnJlIGFzIG9icmFzIGRlIHN1YSBhdXRvcmlhCmUvb3UgdGl0dWxhcmlkYWRlLCBzZW5kbyBvcyB0ZXJjZWlyb3MgdXN1w6FyaW9zIHJlc3BvbnPDoXZlaXMgcGVsYSBhdHJpYnVpw6fDo28gZGUgYXV0b3JpYSBlIG1hbnV0ZW7Dp8OjbwpkYSBpbnRlZ3JpZGFkZSBkYSBPQlJBIGVtIHF1YWxxdWVyIHV0aWxpemHDp8Ojby4KCkEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVoKcmVzcGVpdGEgb3MgY29udHJhdG9zIGUgYWNvcmRvcyBwcmVleGlzdGVudGVzIGRvcyBBdXRvcmVzIGNvbSB0ZXJjZWlyb3MsIGNhYmVuZG8gYW9zIEF1dG9yZXMKaW5mb3JtYXIgw6AgSW5zdGl0dWnDp8OjbyBhcyBjb25kacOnw7VlcyBlIG91dHJhcyByZXN0cmnDp8O1ZXMgaW1wb3N0YXMgcG9yIGVzdGVzIGluc3RydW1lbnRvcy4KRepositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352023-03-15T17:34:05Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false |
dc.title.pt_BR.fl_str_mv |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort |
title |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort |
spellingShingle |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort Pita, Robespierre Ligação de dados Avaliação de precisão Estudo de coorte Data linkage Accuracy assessment Cohort study |
title_short |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort |
title_full |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort |
title_fullStr |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort |
title_full_unstemmed |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort |
title_sort |
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort |
author |
Pita, Robespierre |
author_facet |
Pita, Robespierre Pinto, Clicia Sena, Samila Fiaccone, Rosemeire Amorim, Leila D. Reis, Sandra Barreto, Maurício Lima Denaxas, Spiros Barreto, Marcos Ennes |
author_role |
author |
author2 |
Pinto, Clicia Sena, Samila Fiaccone, Rosemeire Amorim, Leila D. Reis, Sandra Barreto, Maurício Lima Denaxas, Spiros Barreto, Marcos Ennes |
author2_role |
author author author author author author author author |
dc.contributor.author.fl_str_mv |
Pita, Robespierre Pinto, Clicia Sena, Samila Fiaccone, Rosemeire Amorim, Leila D. Reis, Sandra Barreto, Maurício Lima Denaxas, Spiros Barreto, Marcos Ennes |
dc.subject.other.pt_BR.fl_str_mv |
Ligação de dados Avaliação de precisão Estudo de coorte |
topic |
Ligação de dados Avaliação de precisão Estudo de coorte Data linkage Accuracy assessment Cohort study |
dc.subject.en.pt_BR.fl_str_mv |
Data linkage Accuracy assessment Cohort study |
description |
CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research. |
publishDate |
2018 |
dc.date.accessioned.fl_str_mv |
2018-05-14T16:16:59Z |
dc.date.available.fl_str_mv |
2018-05-14T16:16:59Z |
dc.date.issued.fl_str_mv |
2018 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
PITA, Robespierre et al. On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort. IEEE Journal of Biomedical and Health Informatics, v. 22, n. 2, p. 346-353, 2018. |
dc.identifier.uri.fl_str_mv |
https://www.arca.fiocruz.br/handle/icict/26425 |
dc.identifier.issn.pt_BR.fl_str_mv |
2168-2194 |
dc.identifier.doi.none.fl_str_mv |
10.1109/JBHI.2018.2796941 |
identifier_str_mv |
PITA, Robespierre et al. On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort. IEEE Journal of Biomedical and Health Informatics, v. 22, n. 2, p. 346-353, 2018. 2168-2194 10.1109/JBHI.2018.2796941 |
url |
https://www.arca.fiocruz.br/handle/icict/26425 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Institute of Electrical and Electronics Engineers (IEEE) |
publisher.none.fl_str_mv |
Institute of Electrical and Electronics Engineers (IEEE) |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da FIOCRUZ (ARCA) instname:Fundação Oswaldo Cruz (FIOCRUZ) instacron:FIOCRUZ |
instname_str |
Fundação Oswaldo Cruz (FIOCRUZ) |
instacron_str |
FIOCRUZ |
institution |
FIOCRUZ |
reponame_str |
Repositório Institucional da FIOCRUZ (ARCA) |
collection |
Repositório Institucional da FIOCRUZ (ARCA) |
bitstream.url.fl_str_mv |
https://www.arca.fiocruz.br/bitstream/icict/26425/1/license.txt https://www.arca.fiocruz.br/bitstream/icict/26425/2/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf https://www.arca.fiocruz.br/bitstream/icict/26425/3/Pita%20R%20On%20the%20Accuracy%20and%20Scalability%20of%20Probabilistic%20....pdf.txt |
bitstream.checksum.fl_str_mv |
5a560609d32a3863062d77ff32785d58 00c3d76c863eee3c14952ae212d2fd30 7d51eb575b9a06fde26384929570a049 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ) |
repository.mail.fl_str_mv |
repositorio.arca@fiocruz.br |
_version_ |
1813008951642423296 |