Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
Autor(a) principal: | |
---|---|
Data de Publicação: | 2016 |
Outros Autores: | , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Revista de Saúde Pública |
Texto Completo: | http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236 |
Resumo: | ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used. |
id |
USP-23_a5e30f011094a47d488fa410e2b07ff3 |
---|---|
oai_identifier_str |
oai:scielo:S0034-89102016000100236 |
network_acronym_str |
USP-23 |
network_name_str |
Revista de Saúde Pública |
repository_id_str |
|
spelling |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosisTuberculosis, epidemiologyData AccuracySensitivity and SpecificityEpidemiological Surveillance, statistics & numerical dataABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.Faculdade de Saúde Pública da Universidade de São Paulo2016-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236Revista de Saúde Pública v.50 2016reponame:Revista de Saúde Públicainstname:Universidade de São Paulo (USP)instacron:USP10.1590/S1518-8787.2016050006327info:eu-repo/semantics/openAccessOliveira,Gisele Pinto deBierrenbach,Ana Luiza de SouzaCamargo Júnior,Kenneth Rochel deCoeli,Cláudia MedinaPinheiro,Rejane Sobrinoeng2016-11-04T00:00:00Zoai:scielo:S0034-89102016000100236Revistahttp://www.scielo.br/scielo.php?script=sci_serial&pid=0034-8910&lng=pt&nrm=isoONGhttps://old.scielo.br/oai/scielo-oai.phprevsp@org.usp.br||revsp1@usp.br1518-87870034-8910opendoar:2016-11-04T00:00Revista de Saúde Pública - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis |
title |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis |
spellingShingle |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis Oliveira,Gisele Pinto de Tuberculosis, epidemiology Data Accuracy Sensitivity and Specificity Epidemiological Surveillance, statistics & numerical data |
title_short |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis |
title_full |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis |
title_fullStr |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis |
title_full_unstemmed |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis |
title_sort |
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis |
author |
Oliveira,Gisele Pinto de |
author_facet |
Oliveira,Gisele Pinto de Bierrenbach,Ana Luiza de Souza Camargo Júnior,Kenneth Rochel de Coeli,Cláudia Medina Pinheiro,Rejane Sobrino |
author_role |
author |
author2 |
Bierrenbach,Ana Luiza de Souza Camargo Júnior,Kenneth Rochel de Coeli,Cláudia Medina Pinheiro,Rejane Sobrino |
author2_role |
author author author author |
dc.contributor.author.fl_str_mv |
Oliveira,Gisele Pinto de Bierrenbach,Ana Luiza de Souza Camargo Júnior,Kenneth Rochel de Coeli,Cláudia Medina Pinheiro,Rejane Sobrino |
dc.subject.por.fl_str_mv |
Tuberculosis, epidemiology Data Accuracy Sensitivity and Specificity Epidemiological Surveillance, statistics & numerical data |
topic |
Tuberculosis, epidemiology Data Accuracy Sensitivity and Specificity Epidemiological Surveillance, statistics & numerical data |
description |
ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used. |
publishDate |
2016 |
dc.date.none.fl_str_mv |
2016-01-01 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236 |
url |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0034-89102016000100236 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
10.1590/S1518-8787.2016050006327 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
text/html |
dc.publisher.none.fl_str_mv |
Faculdade de Saúde Pública da Universidade de São Paulo |
publisher.none.fl_str_mv |
Faculdade de Saúde Pública da Universidade de São Paulo |
dc.source.none.fl_str_mv |
Revista de Saúde Pública v.50 2016 reponame:Revista de Saúde Pública instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Revista de Saúde Pública |
collection |
Revista de Saúde Pública |
repository.name.fl_str_mv |
Revista de Saúde Pública - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
revsp@org.usp.br||revsp1@usp.br |
_version_ |
1748936503385391104 |