Fine-tuning a transformers-based model to extract relevant fields from invoices

Cruz, Rui Francisco Pereira Moital Loureiro da

Fine-tuning a transformers-based model to extract relevant fields from invoices

Detalhes bibliográficos
Autor(a) principal:	Cruz, Rui Francisco Pereira Moital Loureiro da
Data de Publicação:	2021
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10362/130277
Resumo:	Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science

Metadados do item

id	RCAP_23da8d0021aec8f9e1856c1333d5cc36
oai_identifier_str	oai:run.unl.pt:10362/130277
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Fine-tuning a transformers-based model to extract relevant fields from invoicesDocument data extractionDeep LearningTransformersInvoice datasetDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceExtraction of relevant fields from documents has been a relevant matter for decades. Although there are well-established algorithms to perform this task since the late XX century, this field of study has again gathered more attention with the fast growth of deep learning models and transfer learning. One of these models is LayoutLM, which is a Transformer-based architecture pre-trained with additional features that represent the 2D position of the words. In this dissertation, LayoutLM is fine-tuned on a set of invoices to extract some of its relevant fields, such as company name, address, document date, among others. Given the objective of deploying the model in a company’s internal accountant software, an end-to-end machine learning pipeline is presented. The training layer receives batches with images of documents and their corresponding annotations and fine-tunes the model for a sequence labeling task. The production layer inputs images and predicts the relevant fields. The images are pre-processed extracting the whole document text and bounding boxes using OCR. To automatically label the samples using Transformers-based input format, the text is labeled using an algorithm that searches parts of the text equal or highly similar to the annotations. Also, a new dataset to support this work is created and made publicly available. The dataset consists of 813 pictures and the annotation text for every relevant field, which include company name, company address, document date, document number, buyer tax number, seller tax number, total amount and tax amount. The models are fine-tuned and compared with two baseline models, showing a performance very close to the presented by the model authors. A sensitivity analysis is made to understand the impact of two datasets with different characteristics. In addition, the learning curves for different datasets define empirically that 100 to 200 samples are enough to fine-tune the model and achieve top performance. Based on the results, a strategy for model deployment is defined. Empirical results show that the already fine-tuned model is enough to guarantee top performance in production without the need of using online learning algorithms.Castell, MauroRUNCruz, Rui Francisco Pereira Moital Loureiro da2022-01-05T15:06:12Z2021-12-202021-12-20T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/130277TID:202946142enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:08:59Zoai:run.unl.pt:10362/130277Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:46:44.173308Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Fine-tuning a transformers-based model to extract relevant fields from invoices
title	Fine-tuning a transformers-based model to extract relevant fields from invoices
spellingShingle	Fine-tuning a transformers-based model to extract relevant fields from invoices Cruz, Rui Francisco Pereira Moital Loureiro da Document data extraction Deep Learning Transformers Invoice dataset
title_short	Fine-tuning a transformers-based model to extract relevant fields from invoices
title_full	Fine-tuning a transformers-based model to extract relevant fields from invoices
title_fullStr	Fine-tuning a transformers-based model to extract relevant fields from invoices
title_full_unstemmed	Fine-tuning a transformers-based model to extract relevant fields from invoices
title_sort	Fine-tuning a transformers-based model to extract relevant fields from invoices
author	Cruz, Rui Francisco Pereira Moital Loureiro da
author_facet	Cruz, Rui Francisco Pereira Moital Loureiro da
author_role	author
dc.contributor.none.fl_str_mv	Castell, Mauro RUN
dc.contributor.author.fl_str_mv	Cruz, Rui Francisco Pereira Moital Loureiro da
dc.subject.por.fl_str_mv	Document data extraction Deep Learning Transformers Invoice dataset
topic	Document data extraction Deep Learning Transformers Invoice dataset
description	Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
publishDate	2021
dc.date.none.fl_str_mv	2021-12-20 2021-12-20T00:00:00Z 2022-01-05T15:06:12Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/130277 TID:202946142
url	http://hdl.handle.net/10362/130277
identifier_str_mv	TID:202946142
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799138070831824896

Fine-tuning a transformers-based model to extract relevant fields from invoices

Registros relacionados