anomaly-detection-log-datasets

This repository contains scripts to analyze publicly available log data sets (HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD) that are commonly used to evaluate sequence-based anomaly detection techniques. The following sections show how to get the data sets, parse and group them into sequences of event types, and apply some basic anomaly detection techniques. If you use any of the resources provided in this repository, please cite the publications stated at the end of this ReadMe.

The repository comes with some pre-processed samples in each data set directory, which allow to get started without having to download all the data sets. These files are named <dataset>_train (which contains approximately 1% of all normal log sequences for training), <dataset>_test_normal (which contains the remaining normal log sequences for testing), and <dataset>_test_abnormal (which contains all anomalous log sequences). Running the anomaly detection techniques on these samples yield the following F1 scores (averaged over 25 runs; highest score in bold; maximum in brackets):

Getting the data sets

HDFS

There are three versions of this data set: The original logs from Wei Xu et al., and two alternative versions from Loghub and LogDeep.

Original logs from Wei Xu et al.

The original logs can be retrieved from Wei Xu's website as follows.

cd hdfs_xu/
wget http://iiis.tsinghua.edu.cn/~weixu/demobuild.zip
unzip demobuild.zip
gunzip -c data/online1/lg/sorted.log.gz > sorted.log

Fore more information on this data set, see

Xu, W., Huang, L., Fox, A., Patterson, D., & Jordan, M. I. (2009, October). Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (pp. 117-132).

Loghub

Another version of this data set is provided by Loghub. Note that some lines appear to be missing from the original logs.

cd hdfs_loghub/
wget https://zenodo.org/record/3227177/files/HDFS_1.tar.gz
tar -xvf HDFS_1.tar.gz

For more information on Loghub, see

He, S., Zhu, J., He, P., & Lyu, M. R. (2020). Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448.

LogDeep

A pre-processed version of the data set is provided in the LogDeep repository. Note that timestamps and sequence identifiers are missing from this data set.

cd hdfs_logdeep/
git clone https://github.com/donglee-afar/logdeep.git
mv logdeep/data/hdfs/hdfs_t* .

BGL

There are two versions of this data set: CFDR and Loghub.

CFDR

The original logs from the Computer Failure Data Repository can be retrieved as follows.

cd bgl_cfdr/
wget http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/hpc4/bgl2.gz
gunzip bgl2.gz

Fore more information on this data set, see

Oliner, A., & Stearley, J. (2007, June). What supercomputers say: A study of five system logs. In 37th annual IEEE/IFIP international conference on dependable systems and networks (DSN'07) (pp. 575-584). IEEE.

Loghub

An alternative version of this data set is provided by Loghub. Note that some logs have different labels in this data set.

cd bgl_loghub/
wget https://zenodo.org/record/3227177/files/BGL.tar.gz
tar -xvf BGL.tar.gz

For more information on Loghub, see

He, S., Zhu, J., He, P., & Lyu, M. R. (2020). Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448.

OpenStack

There are two versions of this data set: The ones provided Loghub and an updated version by Kalaki et al. that addresses some problems.

Du et al.

The original OpenStack logs are not available anymore; however, Loghub provides a version of this data set.

cd openstack_loghub/
wget https://zenodo.org/record/3227177/files/OpenStack.tar.gz
tar -xvf OpenStack.tar.gz