Extracting information from PDF documents for use in automatic indexing of e-books

Gil-Leiva, Isidoro; Fujita, Mariângela Spotti Lopes; Redigolo, Franciele Marques; Saran, Jordan Ferreira

Extracting information from PDF documents for use in automatic indexing of e-books

Detalhes bibliográficos
Autor(a) principal:	Gil-Leiva, Isidoro
Data de Publicação:	2022
Outros Autores:	Fujita, Mariângela Spotti Lopes, Redigolo, Franciele Marques, Saran, Jordan Ferreira
Tipo de documento:	Artigo
Idioma:	spa
Título da fonte:	Transinformação (Online)
Texto Completo:	https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870
Resumo:	The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.

Metadados do item

id	PUC_CAMP-4_56614557488b08b188f9e8367ec4b78d
oai_identifier_str	oai:ojs.periodicos.puc-campinas.edu.br:article/6870
network_acronym_str	PUC_CAMP-4
network_name_str	Transinformação (Online)
repository_id_str
spelling	Extracting information from PDF documents for use in automatic indexing of e-booksExtracción de información de documentos PDF para su uso en la indización automática de e-booksSoftware evaluationDFMiner.six.PDFAct.PDF-extractPDFExtractGrobibAutomatic indexingEvaluación de softwareGrobibIndización automáticaPDFMiner.sixPDFAct.DF-extract.PDFExtract.The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.El número de libros electrónicos que ingresan en las bibliotecas en formato PDF cada día es mayor, complicando y haciendocasi inviables algunos procesos realizados tradicionalmente de forma manual por los bibliotecarios, como es la asignación dematerias. En este contexto, se hace necesario el diseño y desarrollo de aplicaciones que asistan a los bibliotecarios. Teniendoesto en consideración, presentamos en este trabajo la evaluación de herramientas de extracción de información de libros enPDF que podrían usarse posteriormente como materia prima para un sistema de indización automática. Para ello, realizamosuna primera evaluación de cinco softwares (PDFMiner.six, PDFAct, PDF-extract, PDFExtract y Grobib) y, posteriormente, comoPDFAct consiguió el mejor rendimiento, hicimos una segunda evaluación para averiguar su capacidad para identificar yextraer informaciones de los libros, tales como títulos, índices, secciones, títulos de tablas y gráficos y referencias bibliográficas,informaciones relevantes para cualquier sistema de indización. Se concluye que ninguna de las herramientas evaluadas extraeadecuadamente las diferentes partes de libros en PDF, si bien, PDFAct ha logrado un rendimiento superior al del resto.Núcleo de Editoração - PUC-Campinas2022-09-23info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionPeer-reviewed ArticleArtículo revisado por paresAvaliado pelos Paresapplication/pdfhttps://periodicos.puc-campinas.edu.br/transinfo/article/view/6870Transinformação; Vol. 34 (2022); 1-11Transinformação; Vol. 34 (2022); 1-11Transinformação; v. 34 (2022); 1-112318-08890103-3786reponame:Transinformação (Online)instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)instacron:PUC_CAMPspahttps://periodicos.puc-campinas.edu.br/transinfo/article/view/6870/4480https://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessGil-Leiva, Isidoro Fujita, Mariângela Spotti LopesRedigolo, Franciele MarquesSaran, Jordan Ferreira2024-04-02T12:31:03Zoai:ojs.periodicos.puc-campinas.edu.br:article/6870Revistahttp://periodicos.puc-campinas.edu.br/seer/index.php/transinfo/indexPRIhttps://old.scielo.br/oai/scielo-oai.phpsbi.nucleodeeditoracao@puc-campinas.edu.br2318-08890103-3786opendoar:2024-04-02T12:31:03Transinformação (Online) - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)false
dc.title.none.fl_str_mv	Extracting information from PDF documents for use in automatic indexing of e-books Extracción de información de documentos PDF para su uso en la indización automática de e-books
title	Extracting information from PDF documents for use in automatic indexing of e-books
spellingShingle	Extracting information from PDF documents for use in automatic indexing of e-books Gil-Leiva, Isidoro Software evaluation DFMiner.six. PDFAct. PDF-extract PDFExtract Grobib Automatic indexing Evaluación de software Grobib Indización automática PDFMiner.six PDFAct. DF-extract. PDFExtract.
title_short	Extracting information from PDF documents for use in automatic indexing of e-books
title_full	Extracting information from PDF documents for use in automatic indexing of e-books
title_fullStr	Extracting information from PDF documents for use in automatic indexing of e-books
title_full_unstemmed	Extracting information from PDF documents for use in automatic indexing of e-books
title_sort	Extracting information from PDF documents for use in automatic indexing of e-books
author	Gil-Leiva, Isidoro
author_facet	Gil-Leiva, Isidoro Fujita, Mariângela Spotti Lopes Redigolo, Franciele Marques Saran, Jordan Ferreira
author_role	author
author2	Fujita, Mariângela Spotti Lopes Redigolo, Franciele Marques Saran, Jordan Ferreira
author2_role	author author author
dc.contributor.author.fl_str_mv	Gil-Leiva, Isidoro Fujita, Mariângela Spotti Lopes Redigolo, Franciele Marques Saran, Jordan Ferreira
dc.subject.por.fl_str_mv	Software evaluation DFMiner.six. PDFAct. PDF-extract PDFExtract Grobib Automatic indexing Evaluación de software Grobib Indización automática PDFMiner.six PDFAct. DF-extract. PDFExtract.
topic	Software evaluation DFMiner.six. PDFAct. PDF-extract PDFExtract Grobib Automatic indexing Evaluación de software Grobib Indización automática PDFMiner.six PDFAct. DF-extract. PDFExtract.
description	The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.
publishDate	2022
dc.date.none.fl_str_mv	2022-09-23
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Peer-reviewed Article Artículo revisado por pares Avaliado pelos Pares
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870
url	https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870
dc.language.iso.fl_str_mv	spa
language	spa
dc.relation.none.fl_str_mv	https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870/4480
dc.rights.driver.fl_str_mv	https://creativecommons.org/licenses/by/4.0 info:eu-repo/semantics/openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by/4.0
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Núcleo de Editoração - PUC-Campinas
publisher.none.fl_str_mv	Núcleo de Editoração - PUC-Campinas
dc.source.none.fl_str_mv	Transinformação; Vol. 34 (2022); 1-11 Transinformação; Vol. 34 (2022); 1-11 Transinformação; v. 34 (2022); 1-11 2318-0889 0103-3786 reponame:Transinformação (Online) instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS) instacron:PUC_CAMP
instname_str	Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
instacron_str	PUC_CAMP
institution	PUC_CAMP
reponame_str	Transinformação (Online)
collection	Transinformação (Online)
repository.name.fl_str_mv	Transinformação (Online) - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
repository.mail.fl_str_mv	sbi.nucleodeeditoracao@puc-campinas.edu.br
_version_	1799125986822848512

Extracting information from PDF documents for use in automatic indexing of e-books

Registros relacionados