Anomaly Detection using Spark MLlib and Spark Streaming

An Anomaly Detection example using Spark MLlib for training and Spark Streaming for testing. Slides are available here.

The Model

Anomaly Detection Model

This model is using KMeans(Spark MLlib K-means) approach and it is trained on "normal" dataset only. After the model is trained, the centroid of the "normal" dataset will be returned as well as a threshold. During the validation stage, any data points that are further than the threshold from the centroid are considered as "anomalies".

Dataset

The dataset is downloaded from KDD Cup 1999 Data for Anomaly Detection.

Training Set: The training set is separated from the whole dataset with the data points that are labeled as "normal" only.

Validation Set: The validation set is using the whole dataset. All data points that are NOT labeled as "normal" are considered as "anomalies".

Spark

This application is for learning and testing purpose, thus the program is running as Spark local on a Mac pro. However, the code should be similar if deployed onto a cluster.

The Code

The majority of the code mainly follows the tutorial from Sean Owen, Cloudera (Video, Slides-1, Slides-2). Couple of modifcations have been made to fit personal interest:

Instead of training multiple clusters, the code only trains on "normal" data points
Only one cluster center is recorded and threshold is set to the last of the furthest 2000 data points
During later validating stage, all points that are further than the threshold is labeled as "anomaly"

Spark Application

The code is organized to run as a Spark Application. The application does "offline training" (Spark) and "online learning" (Spark Streaming).

Training: Training is run as a batch job. To compile and run, go to folder spark-train and run:

sbt assambly
sbt package
spark-submit --class AnomalyDetection \
			target/scala-2.11/anomalydetection_2.11-1.0.jar

Validation: Validation is run as a streaming job. Currently the application reads the input data from a local file. In an ideal situation, the program will read the data from some ingestion tools such as Kafka (To connect Spark Streaming with Kafka, my other project can be used as an example). Also, the trained model (centroid and threshold) is also saved in a local file. In production, the information should be saved into a database. The output of the testing should also be saved into a database. To compile and run, go to folder streaming-validation and run:

sbt assambly
sbt package
spark-submit --class AnomalyDetectionTest \
	 	--jars target/scala-2.11/AnomalyDetectionTest-assembly-1.0.jar \
	 		target/scala-2.11/anomalydetectiontest_2.11-1.0.jar

Spark Shell

You can also play around with the code in Spark Shell. In terminal, start Spark shell:

./spark-shell

Follow the steps in file: train-shell.scala. Note that the validation code for Spark Shell does NOT use Spark Streaming. It's in batch processing form.

Apache Zeppelin

Alternatively, you can also use Apache Zeppelin for testing purpose. To install, follow the steps here. To configure with Spark, please configure in conf/zeppelin-env.sh file:

# ./conf/zeppelin-env.sh
export SPARK_HOME=...

For more detail, please visit here.

After installing Zeppelin, you can access the notebook in the browser:

localhost:8080

Download my notebook here and use import note option to import the notebook.

debasishg / anomaly-detection Goto Github PK

anomaly-detection's Introduction

Anomaly Detection using Spark MLlib and Spark Streaming

The Model

Anomaly Detection Model

Dataset

Spark

The Code

Spark Application

Spark Shell

Apache Zeppelin

anomaly-detection's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent