Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations

Detalhes bibliográficos
Autor(a) principal: Anderson Almeida Ferreira
Data de Publicação: 2012
Tipo de documento: Tese
Idioma: por
Título da fonte: Repositório Institucional da UFMG
Texto Completo: http://hdl.handle.net/1843/ESSA-998NKM
Resumo: Bibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios.
id UFMG_95dba70a9356a91153310cbe70b921cc
oai_identifier_str oai:repositorio.ufmg.br:1843/ESSA-998NKM
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Marcos Andre GoncalvesAlberto Henrique Frade LaenderClodoveu Augusto Davis JuniorGisele Lobo PappaCarlos Alberto HeuserRicardo da Silva TorresAnderson Almeida Ferreira2019-08-09T16:01:27Z2019-08-09T16:01:27Z2012-06-21http://hdl.handle.net/1843/ESSA-998NKMBibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios.Bibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios.Universidade Federal de Minas GeraisUFMGBibliotecas digitaisComputaçãoSistemas de recuperação da informaçãoambiguidadecitações bibliográficasContributions for Solving the Author Name Ambiguity Problem in Bibliographic Citationsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALandersonalmeida.pdfapplication/pdf1891171https://repositorio.ufmg.br/bitstream/1843/ESSA-998NKM/1/andersonalmeida.pdfe1162dd599bcbf4fb728ae00f6e9afa6MD51TEXTandersonalmeida.pdf.txtandersonalmeida.pdf.txtExtracted texttext/plain258333https://repositorio.ufmg.br/bitstream/1843/ESSA-998NKM/2/andersonalmeida.pdf.txta3ce31c76abf8c8ab48a7e78bffe7d9cMD521843/ESSA-998NKM2019-11-14 05:40:13.973oai:repositorio.ufmg.br:1843/ESSA-998NKMRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2019-11-14T08:40:13Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
title Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
spellingShingle Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
Anderson Almeida Ferreira
ambiguidade
citações bibliográficas
Bibliotecas digitais
Computação
Sistemas de recuperação da informação
title_short Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
title_full Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
title_fullStr Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
title_full_unstemmed Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
title_sort Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
author Anderson Almeida Ferreira
author_facet Anderson Almeida Ferreira
author_role author
dc.contributor.advisor1.fl_str_mv Marcos Andre Goncalves
dc.contributor.advisor-co1.fl_str_mv Alberto Henrique Frade Laender
dc.contributor.referee1.fl_str_mv Clodoveu Augusto Davis Junior
dc.contributor.referee2.fl_str_mv Gisele Lobo Pappa
dc.contributor.referee3.fl_str_mv Carlos Alberto Heuser
dc.contributor.referee4.fl_str_mv Ricardo da Silva Torres
dc.contributor.author.fl_str_mv Anderson Almeida Ferreira
contributor_str_mv Marcos Andre Goncalves
Alberto Henrique Frade Laender
Clodoveu Augusto Davis Junior
Gisele Lobo Pappa
Carlos Alberto Heuser
Ricardo da Silva Torres
dc.subject.por.fl_str_mv ambiguidade
citações bibliográficas
topic ambiguidade
citações bibliográficas
Bibliotecas digitais
Computação
Sistemas de recuperação da informação
dc.subject.other.pt_BR.fl_str_mv Bibliotecas digitais
Computação
Sistemas de recuperação da informação
description Bibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios.
publishDate 2012
dc.date.issued.fl_str_mv 2012-06-21
dc.date.accessioned.fl_str_mv 2019-08-09T16:01:27Z
dc.date.available.fl_str_mv 2019-08-09T16:01:27Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1843/ESSA-998NKM
url http://hdl.handle.net/1843/ESSA-998NKM
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.publisher.initials.fl_str_mv UFMG
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
bitstream.url.fl_str_mv https://repositorio.ufmg.br/bitstream/1843/ESSA-998NKM/1/andersonalmeida.pdf
https://repositorio.ufmg.br/bitstream/1843/ESSA-998NKM/2/andersonalmeida.pdf.txt
bitstream.checksum.fl_str_mv e1162dd599bcbf4fb728ae00f6e9afa6
a3ce31c76abf8c8ab48a7e78bffe7d9c
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_ 1801676784618438656