Comparison of anomaly detection techniques applied to different problems in the telecom industry

Detalhes bibliográficos
Autor(a) principal: Rechena, Pedro Miguel David
Data de Publicação: 2021
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/127796
Resumo: Nowadays, with the growth of digital transformation in companies, a huge amount of data is generated every second as a result of various processes. Often this data contains important information which, when properly analyzed, can help a company gain a competitive advantage. One data processing task common to many different applications is detection of anomalies, that is, data points or groups of data points that stand out from most of the others. Since it is not feasible to have an operator constantly analyzing the data to find anomalous values, due to the generally large volumes of data, the focus of this dissertation is the exploration of a Data Mining area called anomaly detection. In this dissertation we first develop an anomaly detection software in Python, that applies 10 different anomaly detection algorithms, after automatically optimizing their parameters, to an arbitrary dataset. Before applying these algorithms, the software also performs the task of data scaling and imputation of missing values. It outputs the results of the performance metrics of each algorithm, the values of the optimized parameters and the graphics for the results visualization generated using the method t-SNE. This software was then applied to three case studies to compare the performance of different anomaly detection approaches using real-world datasets. These datasets have an increasing level of difficulty associated with them: the amount of missing data and the uncertainty associated with the ground truth regarding the anomalies. In the first case study, we detected fraudulent bank transactions using a public dataset. Then, in the second case we identified clients of a telecommunication company who were likely to miss their payment, leading to contract termination. For this case we used a dataset from a telecommunications company. In the third case, we detected low quality of internet service, again using a large dataset with real measurements from a telecommunications company. Finally, we implemented a state of the art, neural network model, specially applicable to the task of identifying anomalies in time-series data. We optimized the parameters of the network, and applied it to address the problem of low quality of service.
id RCAP_8b38625a51fbc4775b175f59af418d89
oai_identifier_str oai:run.unl.pt:10362/127796
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Comparison of anomaly detection techniques applied to different problems in the telecom industryAnomaly DetectionMachine learningUnsupervised LearningTime seriesLSTMDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaNowadays, with the growth of digital transformation in companies, a huge amount of data is generated every second as a result of various processes. Often this data contains important information which, when properly analyzed, can help a company gain a competitive advantage. One data processing task common to many different applications is detection of anomalies, that is, data points or groups of data points that stand out from most of the others. Since it is not feasible to have an operator constantly analyzing the data to find anomalous values, due to the generally large volumes of data, the focus of this dissertation is the exploration of a Data Mining area called anomaly detection. In this dissertation we first develop an anomaly detection software in Python, that applies 10 different anomaly detection algorithms, after automatically optimizing their parameters, to an arbitrary dataset. Before applying these algorithms, the software also performs the task of data scaling and imputation of missing values. It outputs the results of the performance metrics of each algorithm, the values of the optimized parameters and the graphics for the results visualization generated using the method t-SNE. This software was then applied to three case studies to compare the performance of different anomaly detection approaches using real-world datasets. These datasets have an increasing level of difficulty associated with them: the amount of missing data and the uncertainty associated with the ground truth regarding the anomalies. In the first case study, we detected fraudulent bank transactions using a public dataset. Then, in the second case we identified clients of a telecommunication company who were likely to miss their payment, leading to contract termination. For this case we used a dataset from a telecommunications company. In the third case, we detected low quality of internet service, again using a large dataset with real measurements from a telecommunications company. Finally, we implemented a state of the art, neural network model, specially applicable to the task of identifying anomalies in time-series data. We optimized the parameters of the network, and applied it to address the problem of low quality of service.Com o crescimento da transformação digital nas empresas, uma quantidade enorme de dados são gerados a cada segundo como consequência de variados processos. Muitas das vezes esses dados contêm informação importante que podem permitir a uma determinada empresa obter uma vantagem competitiva. Uma forma de obter conhecimento sobre o actual funcionamento de um determinado processo é através da detecção de anomalias, ou seja, instâncias de dados que se destacam da maioria das restantes. Visto não ser viável ter um operador a visualizar linhas de dados para encontrar anomalias, devido às dimensões dos dados, o foco desta dissertação revolve em torno da exploração de uma área de Data Mining chamada detecção de anomalias. Nesta dissertação propõe-se em primeiro lugar um software de detecção de anomalias feito em Python que aplica um conjunto de 10 algoritmos de detecção de anomalias, depois de optimizar os seus parâmetros automaticamente, a um conjunto de dados arbitrários. Antes da aplicação dos algoritmos, o software realiza primeiramente a sua normalização e a imputação dos valores nulos. Por fim, retorna os resultados das métricas de desempenho de cada algoritmo, os parâmetros escolhidos e um conjunto de gráficos para visualização de resultados, gerados utilizando t-SNE. Este software foi então aplicado a três casos de estudo para comparar o desempenho das diferentes técnicas utilizando conjuntos de dados reais. Estes conjuntos de dados têm um nível crescente de dificuldade associado a eles: a quantidade de valores nulos e a incerteza em relação aos pontos realmente anómalos. O primeiro é relacionado com transacções bancárias onde se utilizou um conjunto de dados público. Depois, um caso de estudo relacionado com cessações de contrato devido à falta de pagamento, onde foi utilizado um conjunto de dados de uma empresa de telecomunicações. Por último um caso de estudo relacionado com a qualidade de serviço de clientes de uma empresa de telecomunicações. Por fim, foi implementada uma arquitectura de um modelo de redes neuronais avançado de detecção de anomalias em séries temporais, que foi utilizado para detectar anomalias no conjunto de dados de qualidade de serviço.Bernardo, LuísZejnilovi´c, SabinaRUNRechena, Pedro Miguel David2021-11-16T17:08:01Z2021-022021-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/127796enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:07:32Zoai:run.unl.pt:10362/127796Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:46:09.906531Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Comparison of anomaly detection techniques applied to different problems in the telecom industry
title Comparison of anomaly detection techniques applied to different problems in the telecom industry
spellingShingle Comparison of anomaly detection techniques applied to different problems in the telecom industry
Rechena, Pedro Miguel David
Anomaly Detection
Machine learning
Unsupervised Learning
Time series
LSTM
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short Comparison of anomaly detection techniques applied to different problems in the telecom industry
title_full Comparison of anomaly detection techniques applied to different problems in the telecom industry
title_fullStr Comparison of anomaly detection techniques applied to different problems in the telecom industry
title_full_unstemmed Comparison of anomaly detection techniques applied to different problems in the telecom industry
title_sort Comparison of anomaly detection techniques applied to different problems in the telecom industry
author Rechena, Pedro Miguel David
author_facet Rechena, Pedro Miguel David
author_role author
dc.contributor.none.fl_str_mv Bernardo, Luís
Zejnilovi´c, Sabina
RUN
dc.contributor.author.fl_str_mv Rechena, Pedro Miguel David
dc.subject.por.fl_str_mv Anomaly Detection
Machine learning
Unsupervised Learning
Time series
LSTM
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic Anomaly Detection
Machine learning
Unsupervised Learning
Time series
LSTM
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description Nowadays, with the growth of digital transformation in companies, a huge amount of data is generated every second as a result of various processes. Often this data contains important information which, when properly analyzed, can help a company gain a competitive advantage. One data processing task common to many different applications is detection of anomalies, that is, data points or groups of data points that stand out from most of the others. Since it is not feasible to have an operator constantly analyzing the data to find anomalous values, due to the generally large volumes of data, the focus of this dissertation is the exploration of a Data Mining area called anomaly detection. In this dissertation we first develop an anomaly detection software in Python, that applies 10 different anomaly detection algorithms, after automatically optimizing their parameters, to an arbitrary dataset. Before applying these algorithms, the software also performs the task of data scaling and imputation of missing values. It outputs the results of the performance metrics of each algorithm, the values of the optimized parameters and the graphics for the results visualization generated using the method t-SNE. This software was then applied to three case studies to compare the performance of different anomaly detection approaches using real-world datasets. These datasets have an increasing level of difficulty associated with them: the amount of missing data and the uncertainty associated with the ground truth regarding the anomalies. In the first case study, we detected fraudulent bank transactions using a public dataset. Then, in the second case we identified clients of a telecommunication company who were likely to miss their payment, leading to contract termination. For this case we used a dataset from a telecommunications company. In the third case, we detected low quality of internet service, again using a large dataset with real measurements from a telecommunications company. Finally, we implemented a state of the art, neural network model, specially applicable to the task of identifying anomalies in time-series data. We optimized the parameters of the network, and applied it to address the problem of low quality of service.
publishDate 2021
dc.date.none.fl_str_mv 2021-11-16T17:08:01Z
2021-02
2021-02-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/127796
url http://hdl.handle.net/10362/127796
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138065433755648