Automatic classification of complaints from public administration

Detalhes bibliográficos
Autor(a) principal: Caldeira, Francisco Miguel Silva
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/26805
Resumo: Complaint management is a problem faced by many organizations that is both vital to customer satisfaction and retention, while being highly dependent on human resources. This work attempts to tackle a part of the problem, by classifying summaries of complaints using machine learning models in order to better redirect these to the appropriate responders. To solve the aforementioned problem text mining, and more specifically natural language processing, were used alongside machine learning algorithms for automatic classification. The main challenge of this task is related with the diverse set of characteristics real world datasets have, in this case being small and highly imbalanced. This can have a big impact on the performance of the classification models. The dataset analyzed in this work suffers from both of these problems, being relatively small and having labels in different proportions the three most common labels account for around 95% the dataset. In this work, two different techniques are analyzed: multistage classification with for classifying the more common labels first and the remaining on a second step; and, generating new artificial examples for some classes via translation into other languages. The classification models explored were the following: k-NN, SVM, Naïve Bayes, boosting, and Deep Learning approaches, including transformers. Although, in general using summaries leads to better results, we also experimented with the full documents. Using the models trained with the summarized documents the classification of the full documents. Even though the results were not on par with the summarized dataset the experimented presented good results for signaling the most common label of the documents. We conclude that although, as expected, the classes with little representation are hard to classify, the techniques explored helped to boost the performance, especially in the classes with a low number of elements. SVM and Transformer-based models outperformed their peers.
id RCAP_b807f04e1c7977c5771c077f4ad7933e
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/26805
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Automatic classification of complaints from public administrationText classificationNatural language processingMachine learningBERTClassificação de textoProcessamento de linguagem naturalComplaint management is a problem faced by many organizations that is both vital to customer satisfaction and retention, while being highly dependent on human resources. This work attempts to tackle a part of the problem, by classifying summaries of complaints using machine learning models in order to better redirect these to the appropriate responders. To solve the aforementioned problem text mining, and more specifically natural language processing, were used alongside machine learning algorithms for automatic classification. The main challenge of this task is related with the diverse set of characteristics real world datasets have, in this case being small and highly imbalanced. This can have a big impact on the performance of the classification models. The dataset analyzed in this work suffers from both of these problems, being relatively small and having labels in different proportions the three most common labels account for around 95% the dataset. In this work, two different techniques are analyzed: multistage classification with for classifying the more common labels first and the remaining on a second step; and, generating new artificial examples for some classes via translation into other languages. The classification models explored were the following: k-NN, SVM, Naïve Bayes, boosting, and Deep Learning approaches, including transformers. Although, in general using summaries leads to better results, we also experimented with the full documents. Using the models trained with the summarized documents the classification of the full documents. Even though the results were not on par with the summarized dataset the experimented presented good results for signaling the most common label of the documents. We conclude that although, as expected, the classes with little representation are hard to classify, the techniques explored helped to boost the performance, especially in the classes with a low number of elements. SVM and Transformer-based models outperformed their peers.A classificação de texto é uma área de estudo em aberto, dependendo do problema dos dados disponíveis e estudo em questão, o melhor método nem sempre é mesmo. Dentro da área da inteligência artificial No caso das empresas a classificação de queixas (como neste trabalho) ou mesmo de incidentes é uma tarefa que ainda requer muito trabalho manual. Neste trabalho vai ser abordada a classificação automática de queixas recebidas por uma instituição pública. No processo de tratamento das queixas a classificação é parte do grande panorama e a sua automatização permite acelerar muito os processos manuais que são actualmente usados. Neste contexto, foram trabalhados os sumários das queixas e as técnicas usadas para aplicar modelos de classificação automática. O conjunto de dados é consideravelmente pequeno e apresenta um grande desequilíbrio na distribuição das classes, sendo que as três maiores têm perto de 95% dos dados. Para colmatar este problema foram analisadas duas abordagens: classificação em duas etapas e aumento do conjunto de treino com base em traduções dos sumários. Neste contexto foram usados alguns modelos de classificação como k-NN, SVM, Naïve Bayes, boosting e BERT. Usando modelos treinados com os sumários foi também realizada uma experiência de classificação dos textos completos das queixas. Apesar dos resultados serem piores do que os obtidos usando o dados resumidos, estes apresentam alguma taxa de sucesso, especialmente para classificação da classe mais frequente. Com base neste trabalho foi possível concluir que a classificação das classes com menos representação é um desafio, mas através de técnicas de aumento do conjunto de treino é possível melhorar substancialmente o resultado obtido. Também utilizar uma estratégia de classificação multietapa permite melhorar os resultados obtidos. Os melhores modelos para a classificação foram SVM e BERT.2022-12-27T15:37:22Z2022-12-02T00:00:00Z2022-12-022022-10info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10071/26805TID:203129750engCaldeira, Francisco Miguel Silvainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:44:24Zoai:repositorio.iscte-iul.pt:10071/26805Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:21:04.551928Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Automatic classification of complaints from public administration
title Automatic classification of complaints from public administration
spellingShingle Automatic classification of complaints from public administration
Caldeira, Francisco Miguel Silva
Text classification
Natural language processing
Machine learning
BERT
Classificação de texto
Processamento de linguagem natural
title_short Automatic classification of complaints from public administration
title_full Automatic classification of complaints from public administration
title_fullStr Automatic classification of complaints from public administration
title_full_unstemmed Automatic classification of complaints from public administration
title_sort Automatic classification of complaints from public administration
author Caldeira, Francisco Miguel Silva
author_facet Caldeira, Francisco Miguel Silva
author_role author
dc.contributor.author.fl_str_mv Caldeira, Francisco Miguel Silva
dc.subject.por.fl_str_mv Text classification
Natural language processing
Machine learning
BERT
Classificação de texto
Processamento de linguagem natural
topic Text classification
Natural language processing
Machine learning
BERT
Classificação de texto
Processamento de linguagem natural
description Complaint management is a problem faced by many organizations that is both vital to customer satisfaction and retention, while being highly dependent on human resources. This work attempts to tackle a part of the problem, by classifying summaries of complaints using machine learning models in order to better redirect these to the appropriate responders. To solve the aforementioned problem text mining, and more specifically natural language processing, were used alongside machine learning algorithms for automatic classification. The main challenge of this task is related with the diverse set of characteristics real world datasets have, in this case being small and highly imbalanced. This can have a big impact on the performance of the classification models. The dataset analyzed in this work suffers from both of these problems, being relatively small and having labels in different proportions the three most common labels account for around 95% the dataset. In this work, two different techniques are analyzed: multistage classification with for classifying the more common labels first and the remaining on a second step; and, generating new artificial examples for some classes via translation into other languages. The classification models explored were the following: k-NN, SVM, Naïve Bayes, boosting, and Deep Learning approaches, including transformers. Although, in general using summaries leads to better results, we also experimented with the full documents. Using the models trained with the summarized documents the classification of the full documents. Even though the results were not on par with the summarized dataset the experimented presented good results for signaling the most common label of the documents. We conclude that although, as expected, the classes with little representation are hard to classify, the techniques explored helped to boost the performance, especially in the classes with a low number of elements. SVM and Transformer-based models outperformed their peers.
publishDate 2022
dc.date.none.fl_str_mv 2022-12-27T15:37:22Z
2022-12-02T00:00:00Z
2022-12-02
2022-10
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/26805
TID:203129750
url http://hdl.handle.net/10071/26805
identifier_str_mv TID:203129750
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134771759022080