Automatic classification of complaints from public administration

Caldeira, Francisco Miguel Silva

Automatic classification of complaints from public administration

Detalhes bibliográficos
Autor(a) principal:	Caldeira, Francisco Miguel Silva
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10071/26805
Resumo:	Complaint management is a problem faced by many organizations that is both vital to customer satisfaction and retention, while being highly dependent on human resources. This work attempts to tackle a part of the problem, by classifying summaries of complaints using machine learning models in order to better redirect these to the appropriate responders. To solve the aforementioned problem text mining, and more specifically natural language processing, were used alongside machine learning algorithms for automatic classification. The main challenge of this task is related with the diverse set of characteristics real world datasets have, in this case being small and highly imbalanced. This can have a big impact on the performance of the classification models. The dataset analyzed in this work suffers from both of these problems, being relatively small and having labels in different proportions the three most common labels account for around 95% the dataset. In this work, two different techniques are analyzed: multistage classification with for classifying the more common labels first and the remaining on a second step; and, generating new artificial examples for some classes via translation into other languages. The classification models explored were the following: k-NN, SVM, Naïve Bayes, boosting, and Deep Learning approaches, including transformers. Although, in general using summaries leads to better results, we also experimented with the full documents. Using the models trained with the summarized documents the classification of the full documents. Even though the results were not on par with the summarized dataset the experimented presented good results for signaling the most common label of the documents. We conclude that although, as expected, the classes with little representation are hard to classify, the techniques explored helped to boost the performance, especially in the classes with a low number of elements. SVM and Transformer-based models outperformed their peers.

Metadados do item

id	RCAP_b807f04e1c7977c5771c077f4ad7933e
oai_identifier_str	oai:repositorio.iscte-iul.pt:10071/26805
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Automatic classification of complaints from public administrationText classificationNatural language processingMachine learningBERTClassificação de textoProcessamento de linguagem naturalComplaint management is a problem faced by many organizations that is both vital to customer satisfaction and retention, while being highly dependent on human resources. This work attempts to tackle a part of the problem, by classifying summaries of complaints using machine learning models in order to better redirect these to the appropriate responders. To solve the aforementioned problem text mining, and more specifically natural language processing, were used alongside machine learning algorithms for automatic classification. The main challenge of this task is related with the diverse set of characteristics real world datasets have, in this case being small and highly imbalanced. This can have a big impact on the performance of the classification models. The dataset analyzed in this work suffers from both of these problems, being relatively small and having labels in different proportions the three most common labels account for around 95% the dataset. In this work, two different techniques are analyzed: multistage classification with for classifying the more common labels first and the remaining on a second step; and, generating new artificial examples for some classes via translation into other languages. The classification models explored were the following: k-NN, SVM, Naïve Bayes, boosting, and Deep Learning approaches, including transformers. Although, in general using summaries leads to better results, we also experimented with the full documents. Using the models trained with the summarized documents the classification of the full documents. Even though the results were not on par with the summarized dataset the experimented presented good results for signaling the most common label of the documents. We conclude that although, as expected, the classes with little representation are hard to classify, the techniques explored helped to boost the performance, especially in the classes with a low number of elements. SVM and Transformer-based models outperformed their peers.A classificação de texto é uma área de estudo em aberto, dependendo do problema dos dados disponíveis e estudo em questão, o melhor método nem sempre é mesmo. Dentro da área da inteligência artificial No caso das empresas a classificação de queixas (como neste trabalho) ou mesmo de incidentes é uma tarefa que ainda requer muito trabalho manual. Neste trabalho vai ser abordada a classificação automática de queixas recebidas por uma instituição pública. No processo de tratamento das queixas a classificação é parte do grande panorama e a sua automatização permite acelerar muito os processos manuais que são actualmente usados. Neste contexto, foram trabalhados os sumários das queixas e as técnicas usadas para aplicar modelos de classificação automática. O conjunto de dados é consideravelmente pequeno e apresenta um grande desequilíbrio na distribuição das classes, sendo que as três maiores têm perto de 95% dos dados. Para colmatar este problema foram analisadas duas abordagens: classificação em duas etapas e aumento do conjunto de treino com base em traduções dos sumários. Neste contexto foram usados alguns modelos de classificação como k-NN, SVM, Naïve Bayes, boosting e BERT. Usando modelos treinados com os sumários foi também realizada uma experiência de classificação dos textos completos das queixas. Apesar dos resultados serem piores do que os obtidos usando o dados resumidos, estes apresentam alguma taxa de sucesso, especialmente para classificação da classe mais frequente. Com base neste trabalho foi possível concluir que a classificação das classes com menos representação é um desafio, mas através de técnicas de aumento do conjunto de treino é possível melhorar substancialmente o resultado obtido. Também utilizar uma estratégia de classificação multietapa permite melhorar os resultados obtidos. Os melhores modelos para a classificação foram SVM e BERT.2022-12-27T15:37:22Z2022-12-02T00:00:00Z2022-12-022022-10info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10071/26805TID:203129750engCaldeira, Francisco Miguel Silvainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:44:24Zoai:repositorio.iscte-iul.pt:10071/26805Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:21:04.551928Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Automatic classification of complaints from public administration
title	Automatic classification of complaints from public administration
spellingShingle	Automatic classification of complaints from public administration Caldeira, Francisco Miguel Silva Text classification Natural language processing Machine learning BERT Classificação de texto Processamento de linguagem natural
title_short	Automatic classification of complaints from public administration
title_full	Automatic classification of complaints from public administration
title_fullStr	Automatic classification of complaints from public administration
title_full_unstemmed	Automatic classification of complaints from public administration
title_sort	Automatic classification of complaints from public administration
author	Caldeira, Francisco Miguel Silva
author_facet	Caldeira, Francisco Miguel Silva
author_role	author
dc.contributor.author.fl_str_mv	Caldeira, Francisco Miguel Silva
dc.subject.por.fl_str_mv	Text classification Natural language processing Machine learning BERT Classificação de texto Processamento de linguagem natural
topic	Text classification Natural language processing Machine learning BERT Classificação de texto Processamento de linguagem natural
description	Complaint management is a problem faced by many organizations that is both vital to customer satisfaction and retention, while being highly dependent on human resources. This work attempts to tackle a part of the problem, by classifying summaries of complaints using machine learning models in order to better redirect these to the appropriate responders. To solve the aforementioned problem text mining, and more specifically natural language processing, were used alongside machine learning algorithms for automatic classification. The main challenge of this task is related with the diverse set of characteristics real world datasets have, in this case being small and highly imbalanced. This can have a big impact on the performance of the classification models. The dataset analyzed in this work suffers from both of these problems, being relatively small and having labels in different proportions the three most common labels account for around 95% the dataset. In this work, two different techniques are analyzed: multistage classification with for classifying the more common labels first and the remaining on a second step; and, generating new artificial examples for some classes via translation into other languages. The classification models explored were the following: k-NN, SVM, Naïve Bayes, boosting, and Deep Learning approaches, including transformers. Although, in general using summaries leads to better results, we also experimented with the full documents. Using the models trained with the summarized documents the classification of the full documents. Even though the results were not on par with the summarized dataset the experimented presented good results for signaling the most common label of the documents. We conclude that although, as expected, the classes with little representation are hard to classify, the techniques explored helped to boost the performance, especially in the classes with a low number of elements. SVM and Transformer-based models outperformed their peers.
publishDate	2022
dc.date.none.fl_str_mv	2022-12-27T15:37:22Z 2022-12-02T00:00:00Z 2022-12-02 2022-10
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10071/26805 TID:203129750
url	http://hdl.handle.net/10071/26805
identifier_str_mv	TID:203129750
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134771759022080

Automatic classification of complaints from public administration

Registros relacionados