Fine-tuning a transformers-based model to extract relevant fields from invoices
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/130277 |
Resumo: | Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science |
id |
RCAP_23da8d0021aec8f9e1856c1333d5cc36 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/130277 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Fine-tuning a transformers-based model to extract relevant fields from invoicesDocument data extractionDeep LearningTransformersInvoice datasetDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceExtraction of relevant fields from documents has been a relevant matter for decades. Although there are well-established algorithms to perform this task since the late XX century, this field of study has again gathered more attention with the fast growth of deep learning models and transfer learning. One of these models is LayoutLM, which is a Transformer-based architecture pre-trained with additional features that represent the 2D position of the words. In this dissertation, LayoutLM is fine-tuned on a set of invoices to extract some of its relevant fields, such as company name, address, document date, among others. Given the objective of deploying the model in a company’s internal accountant software, an end-to-end machine learning pipeline is presented. The training layer receives batches with images of documents and their corresponding annotations and fine-tunes the model for a sequence labeling task. The production layer inputs images and predicts the relevant fields. The images are pre-processed extracting the whole document text and bounding boxes using OCR. To automatically label the samples using Transformers-based input format, the text is labeled using an algorithm that searches parts of the text equal or highly similar to the annotations. Also, a new dataset to support this work is created and made publicly available. The dataset consists of 813 pictures and the annotation text for every relevant field, which include company name, company address, document date, document number, buyer tax number, seller tax number, total amount and tax amount. The models are fine-tuned and compared with two baseline models, showing a performance very close to the presented by the model authors. A sensitivity analysis is made to understand the impact of two datasets with different characteristics. In addition, the learning curves for different datasets define empirically that 100 to 200 samples are enough to fine-tune the model and achieve top performance. Based on the results, a strategy for model deployment is defined. Empirical results show that the already fine-tuned model is enough to guarantee top performance in production without the need of using online learning algorithms.Castell, MauroRUNCruz, Rui Francisco Pereira Moital Loureiro da2022-01-05T15:06:12Z2021-12-202021-12-20T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/130277TID:202946142enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:08:59Zoai:run.unl.pt:10362/130277Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:46:44.173308Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Fine-tuning a transformers-based model to extract relevant fields from invoices |
title |
Fine-tuning a transformers-based model to extract relevant fields from invoices |
spellingShingle |
Fine-tuning a transformers-based model to extract relevant fields from invoices Cruz, Rui Francisco Pereira Moital Loureiro da Document data extraction Deep Learning Transformers Invoice dataset |
title_short |
Fine-tuning a transformers-based model to extract relevant fields from invoices |
title_full |
Fine-tuning a transformers-based model to extract relevant fields from invoices |
title_fullStr |
Fine-tuning a transformers-based model to extract relevant fields from invoices |
title_full_unstemmed |
Fine-tuning a transformers-based model to extract relevant fields from invoices |
title_sort |
Fine-tuning a transformers-based model to extract relevant fields from invoices |
author |
Cruz, Rui Francisco Pereira Moital Loureiro da |
author_facet |
Cruz, Rui Francisco Pereira Moital Loureiro da |
author_role |
author |
dc.contributor.none.fl_str_mv |
Castell, Mauro RUN |
dc.contributor.author.fl_str_mv |
Cruz, Rui Francisco Pereira Moital Loureiro da |
dc.subject.por.fl_str_mv |
Document data extraction Deep Learning Transformers Invoice dataset |
topic |
Document data extraction Deep Learning Transformers Invoice dataset |
description |
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-12-20 2021-12-20T00:00:00Z 2022-01-05T15:06:12Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/130277 TID:202946142 |
url |
http://hdl.handle.net/10362/130277 |
identifier_str_mv |
TID:202946142 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138070831824896 |