Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis

Detalhes bibliográficos
Autor(a) principal: Oliveira,Gisele Pinto de
Data de Publicação: 2016
Outros Autores: Bierrenbach,Ana Luiza de Souza, Camargo Júnior,Kenneth Rochel de, Coeli,Cláudia Medina, Pinheiro,Rejane Sobrino
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Revista de Saúde Pública
Texto Completo: http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236
Resumo: ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.
id USP-23_a5e30f011094a47d488fa410e2b07ff3
oai_identifier_str oai:scielo:S0034-89102016000100236
network_acronym_str USP-23
network_name_str Revista de Saúde Pública
repository_id_str
spelling Accuracy of probabilistic and deterministic record linkage: the case of tuberculosisTuberculosis, epidemiologyData AccuracySensitivity and SpecificityEpidemiological Surveillance, statistics & numerical dataABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.Faculdade de Saúde Pública da Universidade de São Paulo2016-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236Revista de Saúde Pública v.50 2016reponame:Revista de Saúde Públicainstname:Universidade de São Paulo (USP)instacron:USP10.1590/S1518-8787.2016050006327info:eu-repo/semantics/openAccessOliveira,Gisele Pinto deBierrenbach,Ana Luiza de SouzaCamargo Júnior,Kenneth Rochel deCoeli,Cláudia MedinaPinheiro,Rejane Sobrinoeng2016-11-04T00:00:00Zoai:scielo:S0034-89102016000100236Revistahttp://www.scielo.br/scielo.php?script=sci_serial&pid=0034-8910&lng=pt&nrm=isoONGhttps://old.scielo.br/oai/scielo-oai.phprevsp@org.usp.br||revsp1@usp.br1518-87870034-8910opendoar:2016-11-04T00:00Revista de Saúde Pública - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
spellingShingle Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
Oliveira,Gisele Pinto de
Tuberculosis, epidemiology
Data Accuracy
Sensitivity and Specificity
Epidemiological Surveillance, statistics & numerical data
title_short Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_full Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_fullStr Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_full_unstemmed Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_sort Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
author Oliveira,Gisele Pinto de
author_facet Oliveira,Gisele Pinto de
Bierrenbach,Ana Luiza de Souza
Camargo Júnior,Kenneth Rochel de
Coeli,Cláudia Medina
Pinheiro,Rejane Sobrino
author_role author
author2 Bierrenbach,Ana Luiza de Souza
Camargo Júnior,Kenneth Rochel de
Coeli,Cláudia Medina
Pinheiro,Rejane Sobrino
author2_role author
author
author
author
dc.contributor.author.fl_str_mv Oliveira,Gisele Pinto de
Bierrenbach,Ana Luiza de Souza
Camargo Júnior,Kenneth Rochel de
Coeli,Cláudia Medina
Pinheiro,Rejane Sobrino
dc.subject.por.fl_str_mv Tuberculosis, epidemiology
Data Accuracy
Sensitivity and Specificity
Epidemiological Surveillance, statistics & numerical data
topic Tuberculosis, epidemiology
Data Accuracy
Sensitivity and Specificity
Epidemiological Surveillance, statistics & numerical data
description ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.
publishDate 2016
dc.date.none.fl_str_mv 2016-01-01
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236
url http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 10.1590/S1518-8787.2016050006327
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv text/html
dc.publisher.none.fl_str_mv Faculdade de Saúde Pública da Universidade de São Paulo
publisher.none.fl_str_mv Faculdade de Saúde Pública da Universidade de São Paulo
dc.source.none.fl_str_mv Revista de Saúde Pública v.50 2016
reponame:Revista de Saúde Pública
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Revista de Saúde Pública
collection Revista de Saúde Pública
repository.name.fl_str_mv Revista de Saúde Pública - Universidade de São Paulo (USP)
repository.mail.fl_str_mv revsp@org.usp.br||revsp1@usp.br
_version_ 1748936503385391104