A trainable model to assess the accuracy of probabilistic record linkage
Autor(a) principal: | |
---|---|
Data de Publicação: | 2017 |
Outros Autores: | , , , |
Tipo de documento: | Artigo de conferência |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFBA |
Texto Completo: | http://repositorio.ufba.br/ri/handle/ri/24738 |
Resumo: | Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results. |
id |
UFBA-2_3f40de6cf92c41bf96557d1b7070fd63 |
---|---|
oai_identifier_str |
oai:repositorio.ufba.br:ri/24738 |
network_acronym_str |
UFBA-2 |
network_name_str |
Repositório Institucional da UFBA |
repository_id_str |
1932 |
spelling |
Pita, RobespierreMendonça, EvertonReis, SandraBarreto, MarcosDenaxas, SpirosPita, RobespierreMendonça, EvertonReis, SandraBarreto, MarcosDenaxas, SpirosBellatreche, LadjelChakravarthy, Sharma2017-12-06T20:43:05Z2017-08-03Pita R., Mendonça E., Reis S., Barreto M., Denaxas S. (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche L., Chakravarthy S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science, vol 10440. Springer, Cham978-3-319-64282-6http://repositorio.ufba.br/ri/handle/ri/24738v.10440Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.Submitted by Marcos Barreto (marcosb@ufba.br) on 2017-12-02T18:00:03Z No. of bitstreams: 1 DaWaK2017_vFinal_104400016.pdf: 1341281 bytes, checksum: 3d561110d02a92a597b4f7f7db3fbd63 (MD5)Approved for entry into archive by NUBIA OLIVEIRA (nubia.marilia@ufba.br) on 2017-12-06T20:43:05Z (GMT) No. of bitstreams: 1 DaWaK2017_vFinal_104400016.pdf: 1341281 bytes, checksum: 3d561110d02a92a597b4f7f7db3fbd63 (MD5)Made available in DSpace on 2017-12-06T20:43:05Z (GMT). No. of bitstreams: 1 DaWaK2017_vFinal_104400016.pdf: 1341281 bytes, checksum: 3d561110d02a92a597b4f7f7db3fbd63 (MD5) Previous issue date: 2017-08-03Bill & Melinda Gates Foundation, The Royal Society (UK), Wellcome Trust (UK), Medical Research Council (UK), CNPqLyonSpringer, ChamBrasilhttps://doi.org/10.1007/978-3-319-64283-3_16reponame:Repositório Institucional da UFBAinstname:Universidade Federal da Bahia (UFBA)instacron:UFBAData linkageMachine learningA trainable model to assess the accuracy of probabilistic record linkageinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObject10000-01-01info:eu-repo/semantics/openAccessengORIGINALDaWaK2017_vFinal_104400016.pdfDaWaK2017_vFinal_104400016.pdfapplication/pdf1341281https://repositorio.ufba.br/bitstream/ri/24738/1/DaWaK2017_vFinal_104400016.pdf3d561110d02a92a597b4f7f7db3fbd63MD51LICENSElicense.txtlicense.txttext/plain1345https://repositorio.ufba.br/bitstream/ri/24738/2/license.txt0d4b811ef71182510d2015daa7c8a900MD52TEXTDaWaK2017_vFinal_104400016.pdf.txtDaWaK2017_vFinal_104400016.pdf.txtExtracted texttext/plain36087https://repositorio.ufba.br/bitstream/ri/24738/3/DaWaK2017_vFinal_104400016.pdf.txt6e2f3b68a1a3034d81db46dbdd2fe4e8MD53ri/247382022-08-08 12:00:19.996oai:repositorio.ufba.br:ri/24738VGVybW8gZGUgTGljZW4/YSwgbj9vIGV4Y2x1c2l2bywgcGFyYSBvIGRlcD9zaXRvIG5vIFJlcG9zaXQ/cmlvIEluc3RpdHVjaW9uYWwgZGEgVUZCQS4KCiBQZWxvIHByb2Nlc3NvIGRlIHN1Ym1pc3M/byBkZSBkb2N1bWVudG9zLCBvIGF1dG9yIG91IHNldSByZXByZXNlbnRhbnRlIGxlZ2FsLCBhbyBhY2VpdGFyIAplc3NlIHRlcm1vIGRlIGxpY2VuP2EsIGNvbmNlZGUgYW8gUmVwb3NpdD9yaW8gSW5zdGl0dWNpb25hbCBkYSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkYSBCYWhpYSAKbyBkaXJlaXRvIGRlIG1hbnRlciB1bWEgYz9waWEgZW0gc2V1IHJlcG9zaXQ/cmlvIGNvbSBhIGZpbmFsaWRhZGUsIHByaW1laXJhLCBkZSBwcmVzZXJ2YT8/by4gCkVzc2VzIHRlcm1vcywgbj9vIGV4Y2x1c2l2b3MsIG1hbnQ/bSBvcyBkaXJlaXRvcyBkZSBhdXRvci9jb3B5cmlnaHQsIG1hcyBlbnRlbmRlIG8gZG9jdW1lbnRvIApjb21vIHBhcnRlIGRvIGFjZXJ2byBpbnRlbGVjdHVhbCBkZXNzYSBVbml2ZXJzaWRhZGUuCgogUGFyYSBvcyBkb2N1bWVudG9zIHB1YmxpY2Fkb3MgY29tIHJlcGFzc2UgZGUgZGlyZWl0b3MgZGUgZGlzdHJpYnVpPz9vLCBlc3NlIHRlcm1vIGRlIGxpY2VuP2EgCmVudGVuZGUgcXVlOgoKIE1hbnRlbmRvIG9zIGRpcmVpdG9zIGF1dG9yYWlzLCByZXBhc3NhZG9zIGEgdGVyY2Vpcm9zLCBlbSBjYXNvIGRlIHB1YmxpY2E/P2VzLCBvIHJlcG9zaXQ/cmlvCnBvZGUgcmVzdHJpbmdpciBvIGFjZXNzbyBhbyB0ZXh0byBpbnRlZ3JhbCwgbWFzIGxpYmVyYSBhcyBpbmZvcm1hPz9lcyBzb2JyZSBvIGRvY3VtZW50bwooTWV0YWRhZG9zIGVzY3JpdGl2b3MpLgoKIERlc3RhIGZvcm1hLCBhdGVuZGVuZG8gYW9zIGFuc2Vpb3MgZGVzc2EgdW5pdmVyc2lkYWRlIGVtIG1hbnRlciBzdWEgcHJvZHU/P28gY2llbnQ/ZmljYSBjb20gCmFzIHJlc3RyaT8/ZXMgaW1wb3N0YXMgcGVsb3MgZWRpdG9yZXMgZGUgcGVyaT9kaWNvcy4KCiBQYXJhIGFzIHB1YmxpY2E/P2VzIHNlbSBpbmljaWF0aXZhcyBxdWUgc2VndWVtIGEgcG9sP3RpY2EgZGUgQWNlc3NvIEFiZXJ0bywgb3MgZGVwP3NpdG9zIApjb21wdWxzP3Jpb3MgbmVzc2UgcmVwb3NpdD9yaW8gbWFudD9tIG9zIGRpcmVpdG9zIGF1dG9yYWlzLCBtYXMgbWFudD9tIGFjZXNzbyBpcnJlc3RyaXRvIAphbyBtZXRhZGFkb3MgZSB0ZXh0byBjb21wbGV0by4gQXNzaW0sIGEgYWNlaXRhPz9vIGRlc3NlIHRlcm1vIG4/byBuZWNlc3NpdGEgZGUgY29uc2VudGltZW50bwogcG9yIHBhcnRlIGRlIGF1dG9yZXMvZGV0ZW50b3JlcyBkb3MgZGlyZWl0b3MsIHBvciBlc3RhcmVtIGVtIGluaWNpYXRpdmFzIGRlIGFjZXNzbyBhYmVydG8uCg==Repositório InstitucionalPUBhttp://192.188.11.11:8080/oai/requestopendoar:19322022-08-08T15:00:19Repositório Institucional da UFBA - Universidade Federal da Bahia (UFBA)false |
dc.title.pt_BR.fl_str_mv |
A trainable model to assess the accuracy of probabilistic record linkage |
title |
A trainable model to assess the accuracy of probabilistic record linkage |
spellingShingle |
A trainable model to assess the accuracy of probabilistic record linkage Pita, Robespierre Data linkage Machine learning |
title_short |
A trainable model to assess the accuracy of probabilistic record linkage |
title_full |
A trainable model to assess the accuracy of probabilistic record linkage |
title_fullStr |
A trainable model to assess the accuracy of probabilistic record linkage |
title_full_unstemmed |
A trainable model to assess the accuracy of probabilistic record linkage |
title_sort |
A trainable model to assess the accuracy of probabilistic record linkage |
author |
Pita, Robespierre |
author_facet |
Pita, Robespierre Mendonça, Everton Reis, Sandra Barreto, Marcos Denaxas, Spiros |
author_role |
author |
author2 |
Mendonça, Everton Reis, Sandra Barreto, Marcos Denaxas, Spiros |
author2_role |
author author author author |
dc.contributor.editor.none.fl_str_mv |
Bellatreche, Ladjel Chakravarthy, Sharma |
dc.contributor.author.fl_str_mv |
Pita, Robespierre Mendonça, Everton Reis, Sandra Barreto, Marcos Denaxas, Spiros Pita, Robespierre Mendonça, Everton Reis, Sandra Barreto, Marcos Denaxas, Spiros |
dc.subject.por.fl_str_mv |
Data linkage Machine learning |
topic |
Data linkage Machine learning |
description |
Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results. |
publishDate |
2017 |
dc.date.accessioned.fl_str_mv |
2017-12-06T20:43:05Z |
dc.date.issued.fl_str_mv |
2017-08-03 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/conferenceObject |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
Pita R., Mendonça E., Reis S., Barreto M., Denaxas S. (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche L., Chakravarthy S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science, vol 10440. Springer, Cham |
dc.identifier.uri.fl_str_mv |
http://repositorio.ufba.br/ri/handle/ri/24738 |
dc.identifier.issn.none.fl_str_mv |
978-3-319-64282-6 |
dc.identifier.number.pt_BR.fl_str_mv |
v.10440 |
identifier_str_mv |
Pita R., Mendonça E., Reis S., Barreto M., Denaxas S. (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche L., Chakravarthy S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science, vol 10440. Springer, Cham 978-3-319-64282-6 v.10440 |
url |
http://repositorio.ufba.br/ri/handle/ri/24738 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Springer, Cham |
dc.publisher.country.fl_str_mv |
Brasil |
publisher.none.fl_str_mv |
Springer, Cham |
dc.source.pt_BR.fl_str_mv |
https://doi.org/10.1007/978-3-319-64283-3_16 |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFBA instname:Universidade Federal da Bahia (UFBA) instacron:UFBA |
instname_str |
Universidade Federal da Bahia (UFBA) |
instacron_str |
UFBA |
institution |
UFBA |
reponame_str |
Repositório Institucional da UFBA |
collection |
Repositório Institucional da UFBA |
bitstream.url.fl_str_mv |
https://repositorio.ufba.br/bitstream/ri/24738/1/DaWaK2017_vFinal_104400016.pdf https://repositorio.ufba.br/bitstream/ri/24738/2/license.txt https://repositorio.ufba.br/bitstream/ri/24738/3/DaWaK2017_vFinal_104400016.pdf.txt |
bitstream.checksum.fl_str_mv |
3d561110d02a92a597b4f7f7db3fbd63 0d4b811ef71182510d2015daa7c8a900 6e2f3b68a1a3034d81db46dbdd2fe4e8 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFBA - Universidade Federal da Bahia (UFBA) |
repository.mail.fl_str_mv |
|
_version_ |
1808459551964397568 |