Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Detalhes bibliográficos
Autor(a) principal: Henriques, João
Data de Publicação: 2020
Outros Autores: Caldeira, Filipe, Cruz, Tiago, Simões, Paulo
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10400.19/7410
Resumo: Abstract: Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.
id RCAP_b2ab9d25c05f70dbc530c8c19bdc93e6
oai_identifier_str oai:repositorio.ipv.pt:10400.19/7410
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasetsanomaly detectionclusteringk-meansgradient tree boostingXGBoostAbstract: Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.Repositório Científico do Instituto Politécnico de ViseuHenriques, JoãoCaldeira, FilipeCruz, TiagoSimões, Paulo2022-11-18T11:34:36Z2020-07-172022-11-15T18:48:22Z2020-07-17T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.19/7410engHenriques J, Caldeira F, Cruz T, Simões P. Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets. Electronics. 2020; 9(7):1164. https://doi.org/10.3390/electronics9071164cv-prod-199521410.3390/electronics9071164info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-08T02:30:36Zoai:repositorio.ipv.pt:10400.19/7410Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T16:45:08.400863Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
spellingShingle Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
Henriques, João
anomaly detection
clustering
k-means
gradient tree boosting
XGBoost
title_short Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_full Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_fullStr Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_full_unstemmed Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_sort Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
author Henriques, João
author_facet Henriques, João
Caldeira, Filipe
Cruz, Tiago
Simões, Paulo
author_role author
author2 Caldeira, Filipe
Cruz, Tiago
Simões, Paulo
author2_role author
author
author
dc.contributor.none.fl_str_mv Repositório Científico do Instituto Politécnico de Viseu
dc.contributor.author.fl_str_mv Henriques, João
Caldeira, Filipe
Cruz, Tiago
Simões, Paulo
dc.subject.por.fl_str_mv anomaly detection
clustering
k-means
gradient tree boosting
XGBoost
topic anomaly detection
clustering
k-means
gradient tree boosting
XGBoost
description Abstract: Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.
publishDate 2020
dc.date.none.fl_str_mv 2020-07-17
2020-07-17T00:00:00Z
2022-11-18T11:34:36Z
2022-11-15T18:48:22Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.19/7410
url http://hdl.handle.net/10400.19/7410
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Henriques J, Caldeira F, Cruz T, Simões P. Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets. Electronics. 2020; 9(7):1164. https://doi.org/10.3390/electronics9071164
cv-prod-1995214
10.3390/electronics9071164
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799130922772070400