Comparison of anomaly detection techniques applied to different problems in the telecom industry
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/127796 |
Resumo: | Nowadays, with the growth of digital transformation in companies, a huge amount of data is generated every second as a result of various processes. Often this data contains important information which, when properly analyzed, can help a company gain a competitive advantage. One data processing task common to many different applications is detection of anomalies, that is, data points or groups of data points that stand out from most of the others. Since it is not feasible to have an operator constantly analyzing the data to find anomalous values, due to the generally large volumes of data, the focus of this dissertation is the exploration of a Data Mining area called anomaly detection. In this dissertation we first develop an anomaly detection software in Python, that applies 10 different anomaly detection algorithms, after automatically optimizing their parameters, to an arbitrary dataset. Before applying these algorithms, the software also performs the task of data scaling and imputation of missing values. It outputs the results of the performance metrics of each algorithm, the values of the optimized parameters and the graphics for the results visualization generated using the method t-SNE. This software was then applied to three case studies to compare the performance of different anomaly detection approaches using real-world datasets. These datasets have an increasing level of difficulty associated with them: the amount of missing data and the uncertainty associated with the ground truth regarding the anomalies. In the first case study, we detected fraudulent bank transactions using a public dataset. Then, in the second case we identified clients of a telecommunication company who were likely to miss their payment, leading to contract termination. For this case we used a dataset from a telecommunications company. In the third case, we detected low quality of internet service, again using a large dataset with real measurements from a telecommunications company. Finally, we implemented a state of the art, neural network model, specially applicable to the task of identifying anomalies in time-series data. We optimized the parameters of the network, and applied it to address the problem of low quality of service. |
id |
RCAP_8b38625a51fbc4775b175f59af418d89 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/127796 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Comparison of anomaly detection techniques applied to different problems in the telecom industryAnomaly DetectionMachine learningUnsupervised LearningTime seriesLSTMDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaNowadays, with the growth of digital transformation in companies, a huge amount of data is generated every second as a result of various processes. Often this data contains important information which, when properly analyzed, can help a company gain a competitive advantage. One data processing task common to many different applications is detection of anomalies, that is, data points or groups of data points that stand out from most of the others. Since it is not feasible to have an operator constantly analyzing the data to find anomalous values, due to the generally large volumes of data, the focus of this dissertation is the exploration of a Data Mining area called anomaly detection. In this dissertation we first develop an anomaly detection software in Python, that applies 10 different anomaly detection algorithms, after automatically optimizing their parameters, to an arbitrary dataset. Before applying these algorithms, the software also performs the task of data scaling and imputation of missing values. It outputs the results of the performance metrics of each algorithm, the values of the optimized parameters and the graphics for the results visualization generated using the method t-SNE. This software was then applied to three case studies to compare the performance of different anomaly detection approaches using real-world datasets. These datasets have an increasing level of difficulty associated with them: the amount of missing data and the uncertainty associated with the ground truth regarding the anomalies. In the first case study, we detected fraudulent bank transactions using a public dataset. Then, in the second case we identified clients of a telecommunication company who were likely to miss their payment, leading to contract termination. For this case we used a dataset from a telecommunications company. In the third case, we detected low quality of internet service, again using a large dataset with real measurements from a telecommunications company. Finally, we implemented a state of the art, neural network model, specially applicable to the task of identifying anomalies in time-series data. We optimized the parameters of the network, and applied it to address the problem of low quality of service.Com o crescimento da transformação digital nas empresas, uma quantidade enorme de dados são gerados a cada segundo como consequência de variados processos. Muitas das vezes esses dados contêm informação importante que podem permitir a uma determinada empresa obter uma vantagem competitiva. Uma forma de obter conhecimento sobre o actual funcionamento de um determinado processo é através da detecção de anomalias, ou seja, instâncias de dados que se destacam da maioria das restantes. Visto não ser viável ter um operador a visualizar linhas de dados para encontrar anomalias, devido às dimensões dos dados, o foco desta dissertação revolve em torno da exploração de uma área de Data Mining chamada detecção de anomalias. Nesta dissertação propõe-se em primeiro lugar um software de detecção de anomalias feito em Python que aplica um conjunto de 10 algoritmos de detecção de anomalias, depois de optimizar os seus parâmetros automaticamente, a um conjunto de dados arbitrários. Antes da aplicação dos algoritmos, o software realiza primeiramente a sua normalização e a imputação dos valores nulos. Por fim, retorna os resultados das métricas de desempenho de cada algoritmo, os parâmetros escolhidos e um conjunto de gráficos para visualização de resultados, gerados utilizando t-SNE. Este software foi então aplicado a três casos de estudo para comparar o desempenho das diferentes técnicas utilizando conjuntos de dados reais. Estes conjuntos de dados têm um nível crescente de dificuldade associado a eles: a quantidade de valores nulos e a incerteza em relação aos pontos realmente anómalos. O primeiro é relacionado com transacções bancárias onde se utilizou um conjunto de dados público. Depois, um caso de estudo relacionado com cessações de contrato devido à falta de pagamento, onde foi utilizado um conjunto de dados de uma empresa de telecomunicações. Por último um caso de estudo relacionado com a qualidade de serviço de clientes de uma empresa de telecomunicações. Por fim, foi implementada uma arquitectura de um modelo de redes neuronais avançado de detecção de anomalias em séries temporais, que foi utilizado para detectar anomalias no conjunto de dados de qualidade de serviço.Bernardo, LuísZejnilovi´c, SabinaRUNRechena, Pedro Miguel David2021-11-16T17:08:01Z2021-022021-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/127796enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:07:32Zoai:run.unl.pt:10362/127796Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:46:09.906531Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Comparison of anomaly detection techniques applied to different problems in the telecom industry |
title |
Comparison of anomaly detection techniques applied to different problems in the telecom industry |
spellingShingle |
Comparison of anomaly detection techniques applied to different problems in the telecom industry Rechena, Pedro Miguel David Anomaly Detection Machine learning Unsupervised Learning Time series LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
title_short |
Comparison of anomaly detection techniques applied to different problems in the telecom industry |
title_full |
Comparison of anomaly detection techniques applied to different problems in the telecom industry |
title_fullStr |
Comparison of anomaly detection techniques applied to different problems in the telecom industry |
title_full_unstemmed |
Comparison of anomaly detection techniques applied to different problems in the telecom industry |
title_sort |
Comparison of anomaly detection techniques applied to different problems in the telecom industry |
author |
Rechena, Pedro Miguel David |
author_facet |
Rechena, Pedro Miguel David |
author_role |
author |
dc.contributor.none.fl_str_mv |
Bernardo, Luís Zejnilovi´c, Sabina RUN |
dc.contributor.author.fl_str_mv |
Rechena, Pedro Miguel David |
dc.subject.por.fl_str_mv |
Anomaly Detection Machine learning Unsupervised Learning Time series LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
topic |
Anomaly Detection Machine learning Unsupervised Learning Time series LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
description |
Nowadays, with the growth of digital transformation in companies, a huge amount of data is generated every second as a result of various processes. Often this data contains important information which, when properly analyzed, can help a company gain a competitive advantage. One data processing task common to many different applications is detection of anomalies, that is, data points or groups of data points that stand out from most of the others. Since it is not feasible to have an operator constantly analyzing the data to find anomalous values, due to the generally large volumes of data, the focus of this dissertation is the exploration of a Data Mining area called anomaly detection. In this dissertation we first develop an anomaly detection software in Python, that applies 10 different anomaly detection algorithms, after automatically optimizing their parameters, to an arbitrary dataset. Before applying these algorithms, the software also performs the task of data scaling and imputation of missing values. It outputs the results of the performance metrics of each algorithm, the values of the optimized parameters and the graphics for the results visualization generated using the method t-SNE. This software was then applied to three case studies to compare the performance of different anomaly detection approaches using real-world datasets. These datasets have an increasing level of difficulty associated with them: the amount of missing data and the uncertainty associated with the ground truth regarding the anomalies. In the first case study, we detected fraudulent bank transactions using a public dataset. Then, in the second case we identified clients of a telecommunication company who were likely to miss their payment, leading to contract termination. For this case we used a dataset from a telecommunications company. In the third case, we detected low quality of internet service, again using a large dataset with real measurements from a telecommunications company. Finally, we implemented a state of the art, neural network model, specially applicable to the task of identifying anomalies in time-series data. We optimized the parameters of the network, and applied it to address the problem of low quality of service. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-11-16T17:08:01Z 2021-02 2021-02-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/127796 |
url |
http://hdl.handle.net/10362/127796 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138065433755648 |