Extracting information from PDF documents for use in automatic indexing of e-books

Detalhes bibliográficos
Autor(a) principal: Gil-Leiva, Isidoro
Data de Publicação: 2022
Outros Autores: Fujita, Mariângela Spotti Lopes, Redigolo, Franciele Marques, Saran, Jordan Ferreira
Tipo de documento: Artigo
Idioma: spa
Título da fonte: Transinformação (Online)
Texto Completo: https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870
Resumo: The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.
id PUC_CAMP-4_56614557488b08b188f9e8367ec4b78d
oai_identifier_str oai:ojs.periodicos.puc-campinas.edu.br:article/6870
network_acronym_str PUC_CAMP-4
network_name_str Transinformação (Online)
repository_id_str
spelling Extracting information from PDF documents for use in automatic indexing of e-booksExtracción de información de documentos PDF para su uso en la indización automática de e-booksSoftware evaluationDFMiner.six.PDFAct.PDF-extractPDFExtractGrobibAutomatic indexingEvaluación de softwareGrobibIndización automáticaPDFMiner.sixPDFAct.DF-extract.PDFExtract.The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.El número de libros electrónicos que ingresan en las bibliotecas en formato PDF cada día es mayor, complicando y haciendocasi inviables algunos procesos realizados tradicionalmente de forma manual por los bibliotecarios, como es la asignación dematerias. En este contexto, se hace necesario el diseño y desarrollo de aplicaciones que asistan a los bibliotecarios. Teniendoesto en consideración, presentamos en este trabajo la evaluación de herramientas de extracción de información de libros enPDF que podrían usarse posteriormente como materia prima para un sistema de indización automática. Para ello, realizamosuna primera evaluación de cinco softwares (PDFMiner.six, PDFAct, PDF-extract, PDFExtract y Grobib) y, posteriormente, comoPDFAct consiguió el mejor rendimiento, hicimos una segunda evaluación para averiguar su capacidad para identificar yextraer informaciones de los libros, tales como títulos, índices, secciones, títulos de tablas y gráficos y referencias bibliográficas,informaciones relevantes para cualquier sistema de indización. Se concluye que ninguna de las herramientas evaluadas extraeadecuadamente las diferentes partes de libros en PDF, si bien, PDFAct ha logrado un rendimiento superior al del resto.Núcleo de Editoração - PUC-Campinas2022-09-23info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionPeer-reviewed ArticleArtículo revisado por paresAvaliado pelos Paresapplication/pdfhttps://periodicos.puc-campinas.edu.br/transinfo/article/view/6870Transinformação; Vol. 34 (2022); 1-11Transinformação; Vol. 34 (2022); 1-11Transinformação; v. 34 (2022); 1-112318-08890103-3786reponame:Transinformação (Online)instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)instacron:PUC_CAMPspahttps://periodicos.puc-campinas.edu.br/transinfo/article/view/6870/4480https://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessGil-Leiva, Isidoro Fujita, Mariângela Spotti LopesRedigolo, Franciele MarquesSaran, Jordan Ferreira2024-04-02T12:31:03Zoai:ojs.periodicos.puc-campinas.edu.br:article/6870Revistahttp://periodicos.puc-campinas.edu.br/seer/index.php/transinfo/indexPRIhttps://old.scielo.br/oai/scielo-oai.phpsbi.nucleodeeditoracao@puc-campinas.edu.br2318-08890103-3786opendoar:2024-04-02T12:31:03Transinformação (Online) - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)false
dc.title.none.fl_str_mv Extracting information from PDF documents for use in automatic indexing of e-books
Extracción de información de documentos PDF para su uso en la indización automática de e-books
title Extracting information from PDF documents for use in automatic indexing of e-books
spellingShingle Extracting information from PDF documents for use in automatic indexing of e-books
Gil-Leiva, Isidoro
Software evaluation
DFMiner.six.
PDFAct.
PDF-extract
PDFExtract
Grobib
Automatic indexing
Evaluación de software
Grobib
Indización automática
PDFMiner.six
PDFAct.
DF-extract.
PDFExtract.
title_short Extracting information from PDF documents for use in automatic indexing of e-books
title_full Extracting information from PDF documents for use in automatic indexing of e-books
title_fullStr Extracting information from PDF documents for use in automatic indexing of e-books
title_full_unstemmed Extracting information from PDF documents for use in automatic indexing of e-books
title_sort Extracting information from PDF documents for use in automatic indexing of e-books
author Gil-Leiva, Isidoro
author_facet Gil-Leiva, Isidoro
Fujita, Mariângela Spotti Lopes
Redigolo, Franciele Marques
Saran, Jordan Ferreira
author_role author
author2 Fujita, Mariângela Spotti Lopes
Redigolo, Franciele Marques
Saran, Jordan Ferreira
author2_role author
author
author
dc.contributor.author.fl_str_mv Gil-Leiva, Isidoro
Fujita, Mariângela Spotti Lopes
Redigolo, Franciele Marques
Saran, Jordan Ferreira
dc.subject.por.fl_str_mv Software evaluation
DFMiner.six.
PDFAct.
PDF-extract
PDFExtract
Grobib
Automatic indexing
Evaluación de software
Grobib
Indización automática
PDFMiner.six
PDFAct.
DF-extract.
PDFExtract.
topic Software evaluation
DFMiner.six.
PDFAct.
PDF-extract
PDFExtract
Grobib
Automatic indexing
Evaluación de software
Grobib
Indización automática
PDFMiner.six
PDFAct.
DF-extract.
PDFExtract.
description The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.
publishDate 2022
dc.date.none.fl_str_mv 2022-09-23
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Peer-reviewed Article
Artículo revisado por pares
Avaliado pelos Pares
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870
url https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870
dc.language.iso.fl_str_mv spa
language spa
dc.relation.none.fl_str_mv https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870/4480
dc.rights.driver.fl_str_mv https://creativecommons.org/licenses/by/4.0
info:eu-repo/semantics/openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/4.0
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Núcleo de Editoração - PUC-Campinas
publisher.none.fl_str_mv Núcleo de Editoração - PUC-Campinas
dc.source.none.fl_str_mv Transinformação; Vol. 34 (2022); 1-11
Transinformação; Vol. 34 (2022); 1-11
Transinformação; v. 34 (2022); 1-11
2318-0889
0103-3786
reponame:Transinformação (Online)
instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
instacron:PUC_CAMP
instname_str Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
instacron_str PUC_CAMP
institution PUC_CAMP
reponame_str Transinformação (Online)
collection Transinformação (Online)
repository.name.fl_str_mv Transinformação (Online) - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
repository.mail.fl_str_mv sbi.nucleodeeditoracao@puc-campinas.edu.br
_version_ 1799125986822848512