Post-OCR Correction on Complaint Processing

Gonçalo Batalhão Alves

Post-OCR Correction on Complaint Processing

Detalhes bibliográficos
Autor(a) principal:	Gonçalo Batalhão Alves
Data de Publicação:	2023
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://hdl.handle.net/10216/151937
Resumo:	Health regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach

Metadados do item

id	RCAP_86b745c87904d1c99b6bdf79c6a5cd68
oai_identifier_str	oai:repositorio-aberto.up.pt:10216/151937
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Post-OCR Correction on Complaint ProcessingEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringHealth regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach2023-07-202023-07-20T00:00:00Z2026-07-19T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/10216/151937TID:203428412engGonçalo Batalhão Alvesinfo:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-12-22T01:29:05Zoai:repositorio-aberto.up.pt:10216/151937Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:50:20.060150Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Post-OCR Correction on Complaint Processing
title	Post-OCR Correction on Complaint Processing
spellingShingle	Post-OCR Correction on Complaint Processing Gonçalo Batalhão Alves Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
title_short	Post-OCR Correction on Complaint Processing
title_full	Post-OCR Correction on Complaint Processing
title_fullStr	Post-OCR Correction on Complaint Processing
title_full_unstemmed	Post-OCR Correction on Complaint Processing
title_sort	Post-OCR Correction on Complaint Processing
author	Gonçalo Batalhão Alves
author_facet	Gonçalo Batalhão Alves
author_role	author
dc.contributor.author.fl_str_mv	Gonçalo Batalhão Alves
dc.subject.por.fl_str_mv	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
topic	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
description	Health regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach
publishDate	2023
dc.date.none.fl_str_mv	2023-07-20 2023-07-20T00:00:00Z 2026-07-19T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/10216/151937 TID:203428412
url	https://hdl.handle.net/10216/151937
identifier_str_mv	TID:203428412
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/embargoedAccess
eu_rights_str_mv	embargoedAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799135822647132160

Post-OCR Correction on Complaint Processing

Registros relacionados