Extracting information from PDF documents for use in automatic indexing of e-books
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | spa |
Título da fonte: | Transinformação (Online) |
Texto Completo: | https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870 |
Resumo: | The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest. |
id |
PUC_CAMP-4_56614557488b08b188f9e8367ec4b78d |
---|---|
oai_identifier_str |
oai:ojs.periodicos.puc-campinas.edu.br:article/6870 |
network_acronym_str |
PUC_CAMP-4 |
network_name_str |
Transinformação (Online) |
repository_id_str |
|
spelling |
Extracting information from PDF documents for use in automatic indexing of e-booksExtracción de información de documentos PDF para su uso en la indización automática de e-booksSoftware evaluationDFMiner.six.PDFAct.PDF-extractPDFExtractGrobibAutomatic indexingEvaluación de softwareGrobibIndización automáticaPDFMiner.sixPDFAct.DF-extract.PDFExtract.The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.El número de libros electrónicos que ingresan en las bibliotecas en formato PDF cada día es mayor, complicando y haciendocasi inviables algunos procesos realizados tradicionalmente de forma manual por los bibliotecarios, como es la asignación dematerias. En este contexto, se hace necesario el diseño y desarrollo de aplicaciones que asistan a los bibliotecarios. Teniendoesto en consideración, presentamos en este trabajo la evaluación de herramientas de extracción de información de libros enPDF que podrían usarse posteriormente como materia prima para un sistema de indización automática. Para ello, realizamosuna primera evaluación de cinco softwares (PDFMiner.six, PDFAct, PDF-extract, PDFExtract y Grobib) y, posteriormente, comoPDFAct consiguió el mejor rendimiento, hicimos una segunda evaluación para averiguar su capacidad para identificar yextraer informaciones de los libros, tales como títulos, índices, secciones, títulos de tablas y gráficos y referencias bibliográficas,informaciones relevantes para cualquier sistema de indización. Se concluye que ninguna de las herramientas evaluadas extraeadecuadamente las diferentes partes de libros en PDF, si bien, PDFAct ha logrado un rendimiento superior al del resto.Núcleo de Editoração - PUC-Campinas2022-09-23info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionPeer-reviewed ArticleArtículo revisado por paresAvaliado pelos Paresapplication/pdfhttps://periodicos.puc-campinas.edu.br/transinfo/article/view/6870Transinformação; Vol. 34 (2022); 1-11Transinformação; Vol. 34 (2022); 1-11Transinformação; v. 34 (2022); 1-112318-08890103-3786reponame:Transinformação (Online)instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)instacron:PUC_CAMPspahttps://periodicos.puc-campinas.edu.br/transinfo/article/view/6870/4480https://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessGil-Leiva, Isidoro Fujita, Mariângela Spotti LopesRedigolo, Franciele MarquesSaran, Jordan Ferreira2024-04-02T12:31:03Zoai:ojs.periodicos.puc-campinas.edu.br:article/6870Revistahttp://periodicos.puc-campinas.edu.br/seer/index.php/transinfo/indexPRIhttps://old.scielo.br/oai/scielo-oai.phpsbi.nucleodeeditoracao@puc-campinas.edu.br2318-08890103-3786opendoar:2024-04-02T12:31:03Transinformação (Online) - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)false |
dc.title.none.fl_str_mv |
Extracting information from PDF documents for use in automatic indexing of e-books Extracción de información de documentos PDF para su uso en la indización automática de e-books |
title |
Extracting information from PDF documents for use in automatic indexing of e-books |
spellingShingle |
Extracting information from PDF documents for use in automatic indexing of e-books Gil-Leiva, Isidoro Software evaluation DFMiner.six. PDFAct. PDF-extract PDFExtract Grobib Automatic indexing Evaluación de software Grobib Indización automática PDFMiner.six PDFAct. DF-extract. PDFExtract. |
title_short |
Extracting information from PDF documents for use in automatic indexing of e-books |
title_full |
Extracting information from PDF documents for use in automatic indexing of e-books |
title_fullStr |
Extracting information from PDF documents for use in automatic indexing of e-books |
title_full_unstemmed |
Extracting information from PDF documents for use in automatic indexing of e-books |
title_sort |
Extracting information from PDF documents for use in automatic indexing of e-books |
author |
Gil-Leiva, Isidoro |
author_facet |
Gil-Leiva, Isidoro Fujita, Mariângela Spotti Lopes Redigolo, Franciele Marques Saran, Jordan Ferreira |
author_role |
author |
author2 |
Fujita, Mariângela Spotti Lopes Redigolo, Franciele Marques Saran, Jordan Ferreira |
author2_role |
author author author |
dc.contributor.author.fl_str_mv |
Gil-Leiva, Isidoro Fujita, Mariângela Spotti Lopes Redigolo, Franciele Marques Saran, Jordan Ferreira |
dc.subject.por.fl_str_mv |
Software evaluation DFMiner.six. PDFAct. PDF-extract PDFExtract Grobib Automatic indexing Evaluación de software Grobib Indización automática PDFMiner.six PDFAct. DF-extract. PDFExtract. |
topic |
Software evaluation DFMiner.six. PDFAct. PDF-extract PDFExtract Grobib Automatic indexing Evaluación de software Grobib Indización automática PDFMiner.six PDFAct. DF-extract. PDFExtract. |
description |
The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-09-23 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Peer-reviewed Article Artículo revisado por pares Avaliado pelos Pares |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870 |
url |
https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870 |
dc.language.iso.fl_str_mv |
spa |
language |
spa |
dc.relation.none.fl_str_mv |
https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870/4480 |
dc.rights.driver.fl_str_mv |
https://creativecommons.org/licenses/by/4.0 info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by/4.0 |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Núcleo de Editoração - PUC-Campinas |
publisher.none.fl_str_mv |
Núcleo de Editoração - PUC-Campinas |
dc.source.none.fl_str_mv |
Transinformação; Vol. 34 (2022); 1-11 Transinformação; Vol. 34 (2022); 1-11 Transinformação; v. 34 (2022); 1-11 2318-0889 0103-3786 reponame:Transinformação (Online) instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS) instacron:PUC_CAMP |
instname_str |
Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS) |
instacron_str |
PUC_CAMP |
institution |
PUC_CAMP |
reponame_str |
Transinformação (Online) |
collection |
Transinformação (Online) |
repository.name.fl_str_mv |
Transinformação (Online) - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS) |
repository.mail.fl_str_mv |
sbi.nucleodeeditoracao@puc-campinas.edu.br |
_version_ |
1799125986822848512 |