A trainable model to assess the accuracy of probabilistic record linkage

Detalhes bibliográficos
Autor(a) principal: Pita, Robespierre
Data de Publicação: 2017
Outros Autores: Mendonça, Everton, Reis, Sandra, Barreto, Marcos, Denaxas, Spiros
Tipo de documento: Artigo de conferência
Idioma: eng
Título da fonte: Repositório Institucional da UFBA
Texto Completo: http://repositorio.ufba.br/ri/handle/ri/24738
Resumo: Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.
id UFBA-2_3f40de6cf92c41bf96557d1b7070fd63
oai_identifier_str oai:repositorio.ufba.br:ri/24738
network_acronym_str UFBA-2
network_name_str Repositório Institucional da UFBA
repository_id_str 1932
spelling Pita, RobespierreMendonça, EvertonReis, SandraBarreto, MarcosDenaxas, SpirosPita, RobespierreMendonça, EvertonReis, SandraBarreto, MarcosDenaxas, SpirosBellatreche, LadjelChakravarthy, Sharma2017-12-06T20:43:05Z2017-08-03Pita R., Mendonça E., Reis S., Barreto M., Denaxas S. (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche L., Chakravarthy S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science, vol 10440. Springer, Cham978-3-319-64282-6http://repositorio.ufba.br/ri/handle/ri/24738v.10440Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.Submitted by Marcos Barreto (marcosb@ufba.br) on 2017-12-02T18:00:03Z No. of bitstreams: 1 DaWaK2017_vFinal_104400016.pdf: 1341281 bytes, checksum: 3d561110d02a92a597b4f7f7db3fbd63 (MD5)Approved for entry into archive by NUBIA OLIVEIRA (nubia.marilia@ufba.br) on 2017-12-06T20:43:05Z (GMT) No. of bitstreams: 1 DaWaK2017_vFinal_104400016.pdf: 1341281 bytes, checksum: 3d561110d02a92a597b4f7f7db3fbd63 (MD5)Made available in DSpace on 2017-12-06T20:43:05Z (GMT). No. of bitstreams: 1 DaWaK2017_vFinal_104400016.pdf: 1341281 bytes, checksum: 3d561110d02a92a597b4f7f7db3fbd63 (MD5) Previous issue date: 2017-08-03Bill & Melinda Gates Foundation, The Royal Society (UK), Wellcome Trust (UK), Medical Research Council (UK), CNPqLyonSpringer, ChamBrasilhttps://doi.org/10.1007/978-3-319-64283-3_16reponame:Repositório Institucional da UFBAinstname:Universidade Federal da Bahia (UFBA)instacron:UFBAData linkageMachine learningA trainable model to assess the accuracy of probabilistic record linkageinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObject10000-01-01info:eu-repo/semantics/openAccessengORIGINALDaWaK2017_vFinal_104400016.pdfDaWaK2017_vFinal_104400016.pdfapplication/pdf1341281https://repositorio.ufba.br/bitstream/ri/24738/1/DaWaK2017_vFinal_104400016.pdf3d561110d02a92a597b4f7f7db3fbd63MD51LICENSElicense.txtlicense.txttext/plain1345https://repositorio.ufba.br/bitstream/ri/24738/2/license.txt0d4b811ef71182510d2015daa7c8a900MD52TEXTDaWaK2017_vFinal_104400016.pdf.txtDaWaK2017_vFinal_104400016.pdf.txtExtracted texttext/plain36087https://repositorio.ufba.br/bitstream/ri/24738/3/DaWaK2017_vFinal_104400016.pdf.txt6e2f3b68a1a3034d81db46dbdd2fe4e8MD53ri/247382022-08-08 12:00:19.996oai:repositorio.ufba.br:ri/24738VGVybW8gZGUgTGljZW4/YSwgbj9vIGV4Y2x1c2l2bywgcGFyYSBvIGRlcD9zaXRvIG5vIFJlcG9zaXQ/cmlvIEluc3RpdHVjaW9uYWwgZGEgVUZCQS4KCiBQZWxvIHByb2Nlc3NvIGRlIHN1Ym1pc3M/byBkZSBkb2N1bWVudG9zLCBvIGF1dG9yIG91IHNldSByZXByZXNlbnRhbnRlIGxlZ2FsLCBhbyBhY2VpdGFyIAplc3NlIHRlcm1vIGRlIGxpY2VuP2EsIGNvbmNlZGUgYW8gUmVwb3NpdD9yaW8gSW5zdGl0dWNpb25hbCBkYSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkYSBCYWhpYSAKbyBkaXJlaXRvIGRlIG1hbnRlciB1bWEgYz9waWEgZW0gc2V1IHJlcG9zaXQ/cmlvIGNvbSBhIGZpbmFsaWRhZGUsIHByaW1laXJhLCBkZSBwcmVzZXJ2YT8/by4gCkVzc2VzIHRlcm1vcywgbj9vIGV4Y2x1c2l2b3MsIG1hbnQ/bSBvcyBkaXJlaXRvcyBkZSBhdXRvci9jb3B5cmlnaHQsIG1hcyBlbnRlbmRlIG8gZG9jdW1lbnRvIApjb21vIHBhcnRlIGRvIGFjZXJ2byBpbnRlbGVjdHVhbCBkZXNzYSBVbml2ZXJzaWRhZGUuCgogUGFyYSBvcyBkb2N1bWVudG9zIHB1YmxpY2Fkb3MgY29tIHJlcGFzc2UgZGUgZGlyZWl0b3MgZGUgZGlzdHJpYnVpPz9vLCBlc3NlIHRlcm1vIGRlIGxpY2VuP2EgCmVudGVuZGUgcXVlOgoKIE1hbnRlbmRvIG9zIGRpcmVpdG9zIGF1dG9yYWlzLCByZXBhc3NhZG9zIGEgdGVyY2Vpcm9zLCBlbSBjYXNvIGRlIHB1YmxpY2E/P2VzLCBvIHJlcG9zaXQ/cmlvCnBvZGUgcmVzdHJpbmdpciBvIGFjZXNzbyBhbyB0ZXh0byBpbnRlZ3JhbCwgbWFzIGxpYmVyYSBhcyBpbmZvcm1hPz9lcyBzb2JyZSBvIGRvY3VtZW50bwooTWV0YWRhZG9zIGVzY3JpdGl2b3MpLgoKIERlc3RhIGZvcm1hLCBhdGVuZGVuZG8gYW9zIGFuc2Vpb3MgZGVzc2EgdW5pdmVyc2lkYWRlIGVtIG1hbnRlciBzdWEgcHJvZHU/P28gY2llbnQ/ZmljYSBjb20gCmFzIHJlc3RyaT8/ZXMgaW1wb3N0YXMgcGVsb3MgZWRpdG9yZXMgZGUgcGVyaT9kaWNvcy4KCiBQYXJhIGFzIHB1YmxpY2E/P2VzIHNlbSBpbmljaWF0aXZhcyBxdWUgc2VndWVtIGEgcG9sP3RpY2EgZGUgQWNlc3NvIEFiZXJ0bywgb3MgZGVwP3NpdG9zIApjb21wdWxzP3Jpb3MgbmVzc2UgcmVwb3NpdD9yaW8gbWFudD9tIG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBtYXMgbWFudD9tIGFjZXNzbyBpcnJlc3RyaXRvIAphbyBtZXRhZGFkb3MgZSB0ZXh0byBjb21wbGV0by4gQXNzaW0sIGEgYWNlaXRhPz9vIGRlc3NlIHRlcm1vIG4/byBuZWNlc3NpdGEgZGUgY29uc2VudGltZW50bwogcG9yIHBhcnRlIGRlIGF1dG9yZXMvZGV0ZW50b3JlcyBkb3MgZGlyZWl0b3MsIHBvciBlc3RhcmVtIGVtIGluaWNpYXRpdmFzIGRlIGFjZXNzbyBhYmVydG8uCg==Repositório InstitucionalPUBhttp://192.188.11.11:8080/oai/requestopendoar:19322022-08-08T15:00:19Repositório Institucional da UFBA - Universidade Federal da Bahia (UFBA)false
dc.title.pt_BR.fl_str_mv A trainable model to assess the accuracy of probabilistic record linkage
title A trainable model to assess the accuracy of probabilistic record linkage
spellingShingle A trainable model to assess the accuracy of probabilistic record linkage
Pita, Robespierre
Data linkage
Machine learning
title_short A trainable model to assess the accuracy of probabilistic record linkage
title_full A trainable model to assess the accuracy of probabilistic record linkage
title_fullStr A trainable model to assess the accuracy of probabilistic record linkage
title_full_unstemmed A trainable model to assess the accuracy of probabilistic record linkage
title_sort A trainable model to assess the accuracy of probabilistic record linkage
author Pita, Robespierre
author_facet Pita, Robespierre
Mendonça, Everton
Reis, Sandra
Barreto, Marcos
Denaxas, Spiros
author_role author
author2 Mendonça, Everton
Reis, Sandra
Barreto, Marcos
Denaxas, Spiros
author2_role author
author
author
author
dc.contributor.editor.none.fl_str_mv Bellatreche, Ladjel
Chakravarthy, Sharma
dc.contributor.author.fl_str_mv Pita, Robespierre
Mendonça, Everton
Reis, Sandra
Barreto, Marcos
Denaxas, Spiros
Pita, Robespierre
Mendonça, Everton
Reis, Sandra
Barreto, Marcos
Denaxas, Spiros
dc.subject.por.fl_str_mv Data linkage
Machine learning
topic Data linkage
Machine learning
description Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.
publishDate 2017
dc.date.accessioned.fl_str_mv 2017-12-06T20:43:05Z
dc.date.issued.fl_str_mv 2017-08-03
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/conferenceObject
format conferenceObject
status_str publishedVersion
dc.identifier.citation.fl_str_mv Pita R., Mendonça E., Reis S., Barreto M., Denaxas S. (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche L., Chakravarthy S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science, vol 10440. Springer, Cham
dc.identifier.uri.fl_str_mv http://repositorio.ufba.br/ri/handle/ri/24738
dc.identifier.issn.none.fl_str_mv 978-3-319-64282-6
dc.identifier.number.pt_BR.fl_str_mv v.10440
identifier_str_mv Pita R., Mendonça E., Reis S., Barreto M., Denaxas S. (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche L., Chakravarthy S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science, vol 10440. Springer, Cham
978-3-319-64282-6
v.10440
url http://repositorio.ufba.br/ri/handle/ri/24738
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Springer, Cham
dc.publisher.country.fl_str_mv Brasil
publisher.none.fl_str_mv Springer, Cham
dc.source.pt_BR.fl_str_mv https://doi.org/10.1007/978-3-319-64283-3_16
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFBA
instname:Universidade Federal da Bahia (UFBA)
instacron:UFBA
instname_str Universidade Federal da Bahia (UFBA)
instacron_str UFBA
institution UFBA
reponame_str Repositório Institucional da UFBA
collection Repositório Institucional da UFBA
bitstream.url.fl_str_mv https://repositorio.ufba.br/bitstream/ri/24738/1/DaWaK2017_vFinal_104400016.pdf
https://repositorio.ufba.br/bitstream/ri/24738/2/license.txt
https://repositorio.ufba.br/bitstream/ri/24738/3/DaWaK2017_vFinal_104400016.pdf.txt
bitstream.checksum.fl_str_mv 3d561110d02a92a597b4f7f7db3fbd63
0d4b811ef71182510d2015daa7c8a900
6e2f3b68a1a3034d81db46dbdd2fe4e8
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFBA - Universidade Federal da Bahia (UFBA)
repository.mail.fl_str_mv
_version_ 1808459551964397568