A classification-based approach for bibliographic metadata deduplication

Detalhes bibliográficos
Autor(a) principal: Borges, Eduardo Nunes
Data de Publicação: 2011
Outros Autores: Becker, Karin, Heuser, Carlos, Galante, Renata
Tipo de documento: Artigo de conferência
Idioma: eng
Título da fonte: Repositório Institucional da FURG (RI FURG)
Texto Completo: http://repositorio.furg.br/handle/1/1701
Resumo: Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.
id FURG_c3c18ed54010093a113a9f2d989deebe
oai_identifier_str oai:repositorio.furg.br:1/1701
network_acronym_str FURG
network_name_str Repositório Institucional da FURG (RI FURG)
repository_id_str
spelling A classification-based approach for bibliographic metadata deduplicationDeduplicationBibliographic metadataClassificationMachine learningDigital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.2012-01-07T22:43:02Z2012-01-07T22:43:02Z2011info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObjectapplication/pdfBORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011.http://repositorio.furg.br/handle/1/1701engBorges, Eduardo NunesBecker, KarinHeuser, CarlosGalante, Renatainfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FURG (RI FURG)instname:Universidade Federal do Rio Grande (FURG)instacron:FURG2014-08-22T14:39:30Zoai:repositorio.furg.br:1/1701Repositório InstitucionalPUBhttps://repositorio.furg.br/oai/request || http://200.19.254.174/oai/requestopendoar:2014-08-22T14:39:30Repositório Institucional da FURG (RI FURG) - Universidade Federal do Rio Grande (FURG)false
dc.title.none.fl_str_mv A classification-based approach for bibliographic metadata deduplication
title A classification-based approach for bibliographic metadata deduplication
spellingShingle A classification-based approach for bibliographic metadata deduplication
Borges, Eduardo Nunes
Deduplication
Bibliographic metadata
Classification
Machine learning
title_short A classification-based approach for bibliographic metadata deduplication
title_full A classification-based approach for bibliographic metadata deduplication
title_fullStr A classification-based approach for bibliographic metadata deduplication
title_full_unstemmed A classification-based approach for bibliographic metadata deduplication
title_sort A classification-based approach for bibliographic metadata deduplication
author Borges, Eduardo Nunes
author_facet Borges, Eduardo Nunes
Becker, Karin
Heuser, Carlos
Galante, Renata
author_role author
author2 Becker, Karin
Heuser, Carlos
Galante, Renata
author2_role author
author
author
dc.contributor.author.fl_str_mv Borges, Eduardo Nunes
Becker, Karin
Heuser, Carlos
Galante, Renata
dc.subject.por.fl_str_mv Deduplication
Bibliographic metadata
Classification
Machine learning
topic Deduplication
Bibliographic metadata
Classification
Machine learning
description Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.
publishDate 2011
dc.date.none.fl_str_mv 2011
2012-01-07T22:43:02Z
2012-01-07T22:43:02Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/conferenceObject
format conferenceObject
status_str publishedVersion
dc.identifier.uri.fl_str_mv BORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011.
http://repositorio.furg.br/handle/1/1701
identifier_str_mv BORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011.
url http://repositorio.furg.br/handle/1/1701
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Institucional da FURG (RI FURG)
instname:Universidade Federal do Rio Grande (FURG)
instacron:FURG
instname_str Universidade Federal do Rio Grande (FURG)
instacron_str FURG
institution FURG
reponame_str Repositório Institucional da FURG (RI FURG)
collection Repositório Institucional da FURG (RI FURG)
repository.name.fl_str_mv Repositório Institucional da FURG (RI FURG) - Universidade Federal do Rio Grande (FURG)
repository.mail.fl_str_mv
_version_ 1813187272182333440