Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations
Autor(a) principal: | |
---|---|
Data de Publicação: | 2012 |
Tipo de documento: | Tese |
Idioma: | por |
Título da fonte: | Repositório Institucional da UFMG |
Texto Completo: | http://hdl.handle.net/1843/ESSA-998NKM |
Resumo: | Bibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios. |
id |
UFMG_95dba70a9356a91153310cbe70b921cc |
---|---|
oai_identifier_str |
oai:repositorio.ufmg.br:1843/ESSA-998NKM |
network_acronym_str |
UFMG |
network_name_str |
Repositório Institucional da UFMG |
repository_id_str |
|
spelling |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citationsambiguidadecitações bibliográficasBibliotecas digitaisComputaçãoSistemas de recuperação da informaçãoBibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios.Bibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios.Universidade Federal de Minas GeraisUFMGMarcos Andre GoncalvesAlberto Henrique Frade LaenderClodoveu Augusto Davis JuniorGisele Lobo PappaCarlos Alberto HeuserRicardo da Silva TorresAnderson Almeida Ferreira2019-08-09T16:01:27Z2019-08-09T16:01:27Z2012-06-21info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://hdl.handle.net/1843/ESSA-998NKMinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2019-11-14T08:40:13Zoai:repositorio.ufmg.br:1843/ESSA-998NKMRepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2019-11-14T08:40:13Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
dc.title.none.fl_str_mv |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations |
title |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations |
spellingShingle |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations Anderson Almeida Ferreira ambiguidade citações bibliográficas Bibliotecas digitais Computação Sistemas de recuperação da informação |
title_short |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations |
title_full |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations |
title_fullStr |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations |
title_full_unstemmed |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations |
title_sort |
Contributions for Solving the Author Name Ambiguity Problem in Bibliographic Citations |
author |
Anderson Almeida Ferreira |
author_facet |
Anderson Almeida Ferreira |
author_role |
author |
dc.contributor.none.fl_str_mv |
Marcos Andre Goncalves Alberto Henrique Frade Laender Clodoveu Augusto Davis Junior Gisele Lobo Pappa Carlos Alberto Heuser Ricardo da Silva Torres |
dc.contributor.author.fl_str_mv |
Anderson Almeida Ferreira |
dc.subject.por.fl_str_mv |
ambiguidade citações bibliográficas Bibliotecas digitais Computação Sistemas de recuperação da informação |
topic |
ambiguidade citações bibliográficas Bibliotecas digitais Computação Sistemas de recuperação da informação |
description |
Bibliographic citations are an essential component of scientific-publication digital libraries. Studies about bibliographic citations can lead to interesting results about the coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. However, it is usual to find ambiguous author names in bibliographic citations due to authors referenced by multiple name variations (synonyms) or when two or more authors have exactly the same name or share a same name variation (polysems). This can lead to an incorrect assignment of a citation to an author, or the separation of several citations of the same author as if they belong to different authors. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. In this thesis, we describe a new three-step disambiguation method, SAND (standing for Self-training Associative Name Disambiguator). SAND eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. SAND also is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, we here propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various desired patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection, considering several dynamic and realistic scenarios. |
publishDate |
2012 |
dc.date.none.fl_str_mv |
2012-06-21 2019-08-09T16:01:27Z 2019-08-09T16:01:27Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1843/ESSA-998NKM |
url |
http://hdl.handle.net/1843/ESSA-998NKM |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais UFMG |
publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais UFMG |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
instname_str |
Universidade Federal de Minas Gerais (UFMG) |
instacron_str |
UFMG |
institution |
UFMG |
reponame_str |
Repositório Institucional da UFMG |
collection |
Repositório Institucional da UFMG |
repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
repository.mail.fl_str_mv |
repositorio@ufmg.br |
_version_ |
1816829686482731008 |