Giter VIP home page Giter VIP logo

anomaly-detection's Introduction

Anomaly Detection using Spark MLlib and Spark Streaming

An Anomaly Detection example using Spark MLlib for training and Spark Streaming for testing. Slides are available here.

The Model

Anomaly Detection Model

This model is using KMeans(Spark MLlib K-means) approach and it is trained on "normal" dataset only. After the model is trained, the centroid of the "normal" dataset will be returned as well as a threshold. During the validation stage, any data points that are further than the threshold from the centroid are considered as "anomalies".

Dataset

The dataset is downloaded from KDD Cup 1999 Data for Anomaly Detection.

Training Set: The training set is separated from the whole dataset with the data points that are labeled as "normal" only.

Validation Set: The validation set is using the whole dataset. All data points that are NOT labeled as "normal" are considered as "anomalies".

Spark

This application is for learning and testing purpose, thus the program is running as Spark local on a Mac pro. However, the code should be similar if deployed onto a cluster.

The Code

The majority of the code mainly follows the tutorial from Sean Owen, Cloudera (Video, Slides-1, Slides-2). Couple of modifcations have been made to fit personal interest:

  • Instead of training multiple clusters, the code only trains on "normal" data points
  • Only one cluster center is recorded and threshold is set to the last of the furthest 2000 data points
  • During later validating stage, all points that are further than the threshold is labeled as "anomaly"

Spark Application

The code is organized to run as a Spark Application. The application does "offline training" (Spark) and "online learning" (Spark Streaming).

Training: Training is run as a batch job. To compile and run, go to folder spark-train and run:

sbt assambly
sbt package
spark-submit --class AnomalyDetection \
			target/scala-2.11/anomalydetection_2.11-1.0.jar

Validation: Validation is run as a streaming job. Currently the application reads the input data from a local file. In an ideal situation, the program will read the data from some ingestion tools such as Kafka (To connect Spark Streaming with Kafka, my other project can be used as an example). Also, the trained model (centroid and threshold) is also saved in a local file. In production, the information should be saved into a database. The output of the testing should also be saved into a database. To compile and run, go to folder streaming-validation and run:

sbt assambly
sbt package
spark-submit --class AnomalyDetectionTest \
	 	--jars target/scala-2.11/AnomalyDetectionTest-assembly-1.0.jar \
	 		target/scala-2.11/anomalydetectiontest_2.11-1.0.jar

Spark Shell

You can also play around with the code in Spark Shell. In terminal, start Spark shell:

./spark-shell

Follow the steps in file: train-shell.scala. Note that the validation code for Spark Shell does NOT use Spark Streaming. It's in batch processing form.

Apache Zeppelin

Alternatively, you can also use Apache Zeppelin for testing purpose. To install, follow the steps here. To configure with Spark, please configure in conf/zeppelin-env.sh file:

# ./conf/zeppelin-env.sh
export SPARK_HOME=...

For more detail, please visit here.

After installing Zeppelin, you can access the notebook in the browser:

localhost:8080

Download my notebook here and use import note option to import the notebook.

anomaly-detection's People

Contributors

keiraqz avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

hypnotranz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.