Fine-tuning a transformers-based model to extract relevant fields from invoices

Detalhes bibliográficos
Autor(a) principal: Cruz, Rui Francisco Pereira Moital Loureiro da
Data de Publicação: 2021
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/130277
Resumo: Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
id RCAP_23da8d0021aec8f9e1856c1333d5cc36
oai_identifier_str oai:run.unl.pt:10362/130277
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Fine-tuning a transformers-based model to extract relevant fields from invoicesDocument data extractionDeep LearningTransformersInvoice datasetDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceExtraction of relevant fields from documents has been a relevant matter for decades. Although there are well-established algorithms to perform this task since the late XX century, this field of study has again gathered more attention with the fast growth of deep learning models and transfer learning. One of these models is LayoutLM, which is a Transformer-based architecture pre-trained with additional features that represent the 2D position of the words. In this dissertation, LayoutLM is fine-tuned on a set of invoices to extract some of its relevant fields, such as company name, address, document date, among others. Given the objective of deploying the model in a company’s internal accountant software, an end-to-end machine learning pipeline is presented. The training layer receives batches with images of documents and their corresponding annotations and fine-tunes the model for a sequence labeling task. The production layer inputs images and predicts the relevant fields. The images are pre-processed extracting the whole document text and bounding boxes using OCR. To automatically label the samples using Transformers-based input format, the text is labeled using an algorithm that searches parts of the text equal or highly similar to the annotations. Also, a new dataset to support this work is created and made publicly available. The dataset consists of 813 pictures and the annotation text for every relevant field, which include company name, company address, document date, document number, buyer tax number, seller tax number, total amount and tax amount. The models are fine-tuned and compared with two baseline models, showing a performance very close to the presented by the model authors. A sensitivity analysis is made to understand the impact of two datasets with different characteristics. In addition, the learning curves for different datasets define empirically that 100 to 200 samples are enough to fine-tune the model and achieve top performance. Based on the results, a strategy for model deployment is defined. Empirical results show that the already fine-tuned model is enough to guarantee top performance in production without the need of using online learning algorithms.Castell, MauroRUNCruz, Rui Francisco Pereira Moital Loureiro da2022-01-05T15:06:12Z2021-12-202021-12-20T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/130277TID:202946142enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:08:59Zoai:run.unl.pt:10362/130277Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:46:44.173308Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Fine-tuning a transformers-based model to extract relevant fields from invoices
title Fine-tuning a transformers-based model to extract relevant fields from invoices
spellingShingle Fine-tuning a transformers-based model to extract relevant fields from invoices
Cruz, Rui Francisco Pereira Moital Loureiro da
Document data extraction
Deep Learning
Transformers
Invoice dataset
title_short Fine-tuning a transformers-based model to extract relevant fields from invoices
title_full Fine-tuning a transformers-based model to extract relevant fields from invoices
title_fullStr Fine-tuning a transformers-based model to extract relevant fields from invoices
title_full_unstemmed Fine-tuning a transformers-based model to extract relevant fields from invoices
title_sort Fine-tuning a transformers-based model to extract relevant fields from invoices
author Cruz, Rui Francisco Pereira Moital Loureiro da
author_facet Cruz, Rui Francisco Pereira Moital Loureiro da
author_role author
dc.contributor.none.fl_str_mv Castell, Mauro
RUN
dc.contributor.author.fl_str_mv Cruz, Rui Francisco Pereira Moital Loureiro da
dc.subject.por.fl_str_mv Document data extraction
Deep Learning
Transformers
Invoice dataset
topic Document data extraction
Deep Learning
Transformers
Invoice dataset
description Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
publishDate 2021
dc.date.none.fl_str_mv 2021-12-20
2021-12-20T00:00:00Z
2022-01-05T15:06:12Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/130277
TID:202946142
url http://hdl.handle.net/10362/130277
identifier_str_mv TID:202946142
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138070831824896