Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Henriques, João; Caldeira, Filipe; Cruz, Tiago; Simões, Paulo

Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Detalhes bibliográficos
Autor(a) principal:	Henriques, João
Data de Publicação:	2020
Outros Autores:	Caldeira, Filipe, Cruz, Tiago, Simões, Paulo
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10400.19/7410
Resumo:	Abstract: Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.

Metadados do item

id	RCAP_b2ab9d25c05f70dbc530c8c19bdc93e6
oai_identifier_str	oai:repositorio.ipv.pt:10400.19/7410
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasetsanomaly detectionclusteringk-meansgradient tree boostingXGBoostAbstract: Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.Repositório Científico do Instituto Politécnico de ViseuHenriques, JoãoCaldeira, FilipeCruz, TiagoSimões, Paulo2022-11-18T11:34:36Z2020-07-172022-11-15T18:48:22Z2020-07-17T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.19/7410engHenriques J, Caldeira F, Cruz T, Simões P. Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets. Electronics. 2020; 9(7):1164. https://doi.org/10.3390/electronics9071164cv-prod-199521410.3390/electronics9071164info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-08T02:30:36Zoai:repositorio.ipv.pt:10400.19/7410Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T16:45:08.400863Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
spellingShingle	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets Henriques, João anomaly detection clustering k-means gradient tree boosting XGBoost
title_short	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_full	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_fullStr	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_full_unstemmed	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
title_sort	Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets
author	Henriques, João
author_facet	Henriques, João Caldeira, Filipe Cruz, Tiago Simões, Paulo
author_role	author
author2	Caldeira, Filipe Cruz, Tiago Simões, Paulo
author2_role	author author author
dc.contributor.none.fl_str_mv	Repositório Científico do Instituto Politécnico de Viseu
dc.contributor.author.fl_str_mv	Henriques, João Caldeira, Filipe Cruz, Tiago Simões, Paulo
dc.subject.por.fl_str_mv	anomaly detection clustering k-means gradient tree boosting XGBoost
topic	anomaly detection clustering k-means gradient tree boosting XGBoost
description	Abstract: Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.
publishDate	2020
dc.date.none.fl_str_mv	2020-07-17 2020-07-17T00:00:00Z 2022-11-18T11:34:36Z 2022-11-15T18:48:22Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10400.19/7410
url	http://hdl.handle.net/10400.19/7410
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	Henriques J, Caldeira F, Cruz T, Simões P. Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets. Electronics. 2020; 9(7):1164. https://doi.org/10.3390/electronics9071164 cv-prod-1995214 10.3390/electronics9071164
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799130922772070400

Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Registros relacionados