Post-OCR Correction on Complaint Processing
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://hdl.handle.net/10216/151937 |
Resumo: | Health regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach |
id |
RCAP_86b745c87904d1c99b6bdf79c6a5cd68 |
---|---|
oai_identifier_str |
oai:repositorio-aberto.up.pt:10216/151937 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Post-OCR Correction on Complaint ProcessingEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringHealth regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach2023-07-202023-07-20T00:00:00Z2026-07-19T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/10216/151937TID:203428412engGonçalo Batalhão Alvesinfo:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-12-22T01:29:05Zoai:repositorio-aberto.up.pt:10216/151937Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:50:20.060150Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Post-OCR Correction on Complaint Processing |
title |
Post-OCR Correction on Complaint Processing |
spellingShingle |
Post-OCR Correction on Complaint Processing Gonçalo Batalhão Alves Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
title_short |
Post-OCR Correction on Complaint Processing |
title_full |
Post-OCR Correction on Complaint Processing |
title_fullStr |
Post-OCR Correction on Complaint Processing |
title_full_unstemmed |
Post-OCR Correction on Complaint Processing |
title_sort |
Post-OCR Correction on Complaint Processing |
author |
Gonçalo Batalhão Alves |
author_facet |
Gonçalo Batalhão Alves |
author_role |
author |
dc.contributor.author.fl_str_mv |
Gonçalo Batalhão Alves |
dc.subject.por.fl_str_mv |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
topic |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
description |
Health regulation entities deal with a high number of customer complaints, receiving them through various means such as online forms, emails, letters, or a physical complaint book. In an effort to automatise their complaint screening and prioritisation process, they have been using Natural Language Processing (NLP) models to help guide the decision process. Most of the complaints passed to these models do not come from online forms but from digitised documents previously printed. The objective of these entities is to promptly resolve any issues that may be mentioned in a complaint regarding a health service. Therefore, manually analysing each complaint is inefficient due to the amount of incoming data compared to the speed of this manual process. The application of NLP models aims to reduce the processing time of the complaints and improve the decision quality compared to the manual process. Although NLP has seen considerable improvements in past years and its impact has increased, most models are trained on curated text. When there is the need to digitise documents, an optical character recognition (OCR) technique is applied. The resulting extracted text contains some errors due to the quality of the scanning, and as such, it might not have the desired quality for an NLP complaint classification model to produce the correct classification output. Although there are recent approaches to tackle this issue, through the form of Post-OCR correction, there is a lack of good-performing models, and there is little work for languages in which linguistic resources are less abundant, such as Portuguese. This thesis intends to present a novel approach to combat these poorly retrieved texts through the use of NLP models applied to the Post-OCR process. We believe that using an approach similar to that of an intelligent keyboard in a smartphone will improve the quality of these extracted texts. By using a dictionary to predict each correct character from the extracted text of the OCR process and then applying the same technique for words, we will reduce the amount of errors in the extracted text. We also expect that the NLP models used to classify complaints will perform better later in the complaint processing pipeline due to the increased text quality. In order to validate that our methodology is producing good results, we intend to do intrinsic and extrinsic validation. In an intrinsic evaluation, the quality of NLP systems outputs is evaluated against pre-determined ground truth. For this, we intend on using our approach on curated datasets of known competitions, as is the case of ICDAR 2019 Competition on Post-OCR Text Correction [71], and compare our results with the ground truth provided. In contrast, an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems. In this case, we will compare the outputs of the NLP complaint classification system before and after applying our novel approach |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-07-20 2023-07-20T00:00:00Z 2026-07-19T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/10216/151937 TID:203428412 |
url |
https://hdl.handle.net/10216/151937 |
identifier_str_mv |
TID:203428412 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/embargoedAccess |
eu_rights_str_mv |
embargoedAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799135822647132160 |