Post-OCR Correction on Complaint Processing

Detalhes bibliográficos
Autor(a) principal: Gonçalo Batalhão Alves
Data de Publicação: 2023
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: https://hdl.handle.net/10216/151937
Resumo: Health regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach
id RCAP_86b745c87904d1c99b6bdf79c6a5cd68
oai_identifier_str oai:repositorio-aberto.up.pt:10216/151937
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Post-OCR Correction on Complaint ProcessingEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringHealth regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach2023-07-202023-07-20T00:00:00Z2026-07-19T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/10216/151937TID:203428412engGonçalo Batalhão Alvesinfo:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-12-22T01:29:05Zoai:repositorio-aberto.up.pt:10216/151937Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:50:20.060150Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Post-OCR Correction on Complaint Processing
title Post-OCR Correction on Complaint Processing
spellingShingle Post-OCR Correction on Complaint Processing
Gonçalo Batalhão Alves
Engenharia electrotécnica, electrónica e informática
Electrical engineering, Electronic engineering, Information engineering
title_short Post-OCR Correction on Complaint Processing
title_full Post-OCR Correction on Complaint Processing
title_fullStr Post-OCR Correction on Complaint Processing
title_full_unstemmed Post-OCR Correction on Complaint Processing
title_sort Post-OCR Correction on Complaint Processing
author Gonçalo Batalhão Alves
author_facet Gonçalo Batalhão Alves
author_role author
dc.contributor.author.fl_str_mv Gonçalo Batalhão Alves
dc.subject.por.fl_str_mv Engenharia electrotécnica, electrónica e informática
Electrical engineering, Electronic engineering, Information engineering
topic Engenharia electrotécnica, electrónica e informática
Electrical engineering, Electronic engineering, Information engineering
description Health regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach
publishDate 2023
dc.date.none.fl_str_mv 2023-07-20
2023-07-20T00:00:00Z
2026-07-19T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/10216/151937
TID:203428412
url https://hdl.handle.net/10216/151937
identifier_str_mv TID:203428412
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/embargoedAccess
eu_rights_str_mv embargoedAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799135822647132160