A classification-based approach for bibliographic metadata deduplication
Autor(a) principal: | |
---|---|
Data de Publicação: | 2011 |
Outros Autores: | , , |
Tipo de documento: | Artigo de conferência |
Idioma: | eng |
Título da fonte: | Repositório Institucional da FURG (RI FURG) |
Texto Completo: | http://repositorio.furg.br/handle/1/1701 |
Resumo: | Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach. |
id |
FURG_c3c18ed54010093a113a9f2d989deebe |
---|---|
oai_identifier_str |
oai:repositorio.furg.br:1/1701 |
network_acronym_str |
FURG |
network_name_str |
Repositório Institucional da FURG (RI FURG) |
repository_id_str |
|
spelling |
A classification-based approach for bibliographic metadata deduplicationDeduplicationBibliographic metadataClassificationMachine learningDigital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.2012-01-07T22:43:02Z2012-01-07T22:43:02Z2011info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObjectapplication/pdfBORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011.http://repositorio.furg.br/handle/1/1701engBorges, Eduardo NunesBecker, KarinHeuser, CarlosGalante, Renatainfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FURG (RI FURG)instname:Universidade Federal do Rio Grande (FURG)instacron:FURG2014-08-22T14:39:30Zoai:repositorio.furg.br:1/1701Repositório InstitucionalPUBhttps://repositorio.furg.br/oai/request || http://200.19.254.174/oai/requestopendoar:2014-08-22T14:39:30Repositório Institucional da FURG (RI FURG) - Universidade Federal do Rio Grande (FURG)false |
dc.title.none.fl_str_mv |
A classification-based approach for bibliographic metadata deduplication |
title |
A classification-based approach for bibliographic metadata deduplication |
spellingShingle |
A classification-based approach for bibliographic metadata deduplication Borges, Eduardo Nunes Deduplication Bibliographic metadata Classification Machine learning |
title_short |
A classification-based approach for bibliographic metadata deduplication |
title_full |
A classification-based approach for bibliographic metadata deduplication |
title_fullStr |
A classification-based approach for bibliographic metadata deduplication |
title_full_unstemmed |
A classification-based approach for bibliographic metadata deduplication |
title_sort |
A classification-based approach for bibliographic metadata deduplication |
author |
Borges, Eduardo Nunes |
author_facet |
Borges, Eduardo Nunes Becker, Karin Heuser, Carlos Galante, Renata |
author_role |
author |
author2 |
Becker, Karin Heuser, Carlos Galante, Renata |
author2_role |
author author author |
dc.contributor.author.fl_str_mv |
Borges, Eduardo Nunes Becker, Karin Heuser, Carlos Galante, Renata |
dc.subject.por.fl_str_mv |
Deduplication Bibliographic metadata Classification Machine learning |
topic |
Deduplication Bibliographic metadata Classification Machine learning |
description |
Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach. |
publishDate |
2011 |
dc.date.none.fl_str_mv |
2011 2012-01-07T22:43:02Z 2012-01-07T22:43:02Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/conferenceObject |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
BORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011. http://repositorio.furg.br/handle/1/1701 |
identifier_str_mv |
BORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011. |
url |
http://repositorio.furg.br/handle/1/1701 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da FURG (RI FURG) instname:Universidade Federal do Rio Grande (FURG) instacron:FURG |
instname_str |
Universidade Federal do Rio Grande (FURG) |
instacron_str |
FURG |
institution |
FURG |
reponame_str |
Repositório Institucional da FURG (RI FURG) |
collection |
Repositório Institucional da FURG (RI FURG) |
repository.name.fl_str_mv |
Repositório Institucional da FURG (RI FURG) - Universidade Federal do Rio Grande (FURG) |
repository.mail.fl_str_mv |
|
_version_ |
1813187272182333440 |