A classification-based approach for bibliographic metadata deduplication

Borges, Eduardo Nunes; Becker, Karin; Heuser, Carlos; Galante, Renata

A classification-based approach for bibliographic metadata deduplication

Detalhes bibliográficos
Autor(a) principal:	Borges, Eduardo Nunes
Data de Publicação:	2011
Outros Autores:	Becker, Karin, Heuser, Carlos, Galante, Renata
Tipo de documento:	Artigo de conferência
Idioma:	eng
Título da fonte:	Repositório Institucional da FURG (RI FURG)
Texto Completo:	http://repositorio.furg.br/handle/1/1701
Resumo:	Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.

Metadados do item

id	FURG_c3c18ed54010093a113a9f2d989deebe
oai_identifier_str	oai:repositorio.furg.br:1/1701
network_acronym_str	FURG
network_name_str	Repositório Institucional da FURG (RI FURG)
repository_id_str
spelling	A classification-based approach for bibliographic metadata deduplicationDeduplicationBibliographic metadataClassificationMachine learningDigital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.2012-01-07T22:43:02Z2012-01-07T22:43:02Z2011info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObjectapplication/pdfBORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011.http://repositorio.furg.br/handle/1/1701engBorges, Eduardo NunesBecker, KarinHeuser, CarlosGalante, Renatainfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FURG (RI FURG)instname:Universidade Federal do Rio Grande (FURG)instacron:FURG2014-08-22T14:39:30Zoai:repositorio.furg.br:1/1701Repositório InstitucionalPUBhttps://repositorio.furg.br/oai/request \|\| http://200.19.254.174/oai/requestopendoar:2014-08-22T14:39:30Repositório Institucional da FURG (RI FURG) - Universidade Federal do Rio Grande (FURG)false
dc.title.none.fl_str_mv	A classification-based approach for bibliographic metadata deduplication
title	A classification-based approach for bibliographic metadata deduplication
spellingShingle	A classification-based approach for bibliographic metadata deduplication Borges, Eduardo Nunes Deduplication Bibliographic metadata Classification Machine learning
title_short	A classification-based approach for bibliographic metadata deduplication
title_full	A classification-based approach for bibliographic metadata deduplication
title_fullStr	A classification-based approach for bibliographic metadata deduplication
title_full_unstemmed	A classification-based approach for bibliographic metadata deduplication
title_sort	A classification-based approach for bibliographic metadata deduplication
author	Borges, Eduardo Nunes
author_facet	Borges, Eduardo Nunes Becker, Karin Heuser, Carlos Galante, Renata
author_role	author
author2	Becker, Karin Heuser, Carlos Galante, Renata
author2_role	author author author
dc.contributor.author.fl_str_mv	Borges, Eduardo Nunes Becker, Karin Heuser, Carlos Galante, Renata
dc.subject.por.fl_str_mv	Deduplication Bibliographic metadata Classification Machine learning
topic	Deduplication Bibliographic metadata Classification Machine learning
description	Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.
publishDate	2011
dc.date.none.fl_str_mv	2011 2012-01-07T22:43:02Z 2012-01-07T22:43:02Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/conferenceObject
format	conferenceObject
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	BORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011. http://repositorio.furg.br/handle/1/1701
identifier_str_mv	BORGES, Eduardo et al. A classification-based approach for bibliographic metadata deduplication. In: IADIS INTERNATIONAL CONFERENCE WWW/INTERNET, 2011, Rio de Janeiro. Anais eletrônicos... Rio de Janeiro: IADIS, 2011. Disponível em: <http://www.eduardo.c3.furg.br/arquivos/download/www-internet2011.pdf>. Acesso em: 24 dez. 2011.
url	http://repositorio.furg.br/handle/1/1701
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Institucional da FURG (RI FURG) instname:Universidade Federal do Rio Grande (FURG) instacron:FURG
instname_str	Universidade Federal do Rio Grande (FURG)
instacron_str	FURG
institution	FURG
reponame_str	Repositório Institucional da FURG (RI FURG)
collection	Repositório Institucional da FURG (RI FURG)
repository.name.fl_str_mv	Repositório Institucional da FURG (RI FURG) - Universidade Federal do Rio Grande (FURG)
repository.mail.fl_str_mv
_version_	1822808175050489856

A classification-based approach for bibliographic metadata deduplication

Registros relacionados