Giter VIP home page Giter VIP logo

spark-is-streaming's Introduction

A Nearest Neighbor Classifier for High-Speed Big Data Streams with Instance Selection (spark-IS-streaming)

Here we present an efficient nearest neighbor solution to classify fast and massive data streams using Apache Spark. It is formed by a distributed case-base and an instance selection method that enhances its performance and effectiveness. A distributed metric tree (based on M-trees) has been designed to organize the case-base and consequently to speed up the neighbor searches. This distributed tree consists of a top-tree (in the master node) that routes the searches in the first levels and several leaf nodes (in the slaves nodes) that solve the searches in next levels through a completely parallel scheme.

A improved local version of RNGE [1] in order to control the insertion and removal of noisy instances. For each incoming example, a relative graph is built around each new instance and its subset of neighbors. The local graphs are then used to edit the case-base by deciding what instances should be inserted, removed or left intact.

Associated Spark package: https://spark-packages.org/package/sramirez/spark-IS-streaming

Associated journal paper: S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak, J. M. Benítez and F. Herrera, "Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 10, pp. 2727-2739, Oct. 2017. doi: 10.1109/TSMC.2017.2700889 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7993020&isnumber=8038140

Parameters:

Our distributed approach includes several user-defined input parameters, which are described below:

  • "type": type of file. "keel" for KEEL/ARFF files, "labeled" for Spark generated files with labeled points, or "csv" for standard csv files. Default: "csv".
  • "header": path of header file (only valid for KEEL files).
  • "output": path of output directory.
  • "npart": number of partitions for the first partitioning process. Default: 4.
  • "k": number of neighbors selected for prediction. Default: 1.
  • "kGraph": number of neighors selected to construct local graphs. Default: 10.
  • "rate": number of examples in each batch. Default: 2500.
  • "interval": the time interval at which streaming data will be divided into batches. Default: 1000 (ms).
  • "seed": seed value for random generator. Default: 237597430.
  • "ntrees": number of sub-trees in the slave nodes. It should be set equal or higher than the number of initial partitions. Default: 4.
  • "overlap": the distance between elements in sub-trees. 0 means sub-trees are disjoint. Default: 0.
  • "edited": if filtering of new noisy examples ocurrs. Default: false.
  • "removedOld": if removal of already inserted examples considered as noise ocurrs. Default: false.
  • "timeout": milleseconds before shutting the streaming execution. Default: 600000.
  • "sampling": percent of sampling w/o replacement on original static data. Default: 0.0.
  • "nClass": number of classes in data. Default: 2 (binary).

Example:

spark-submit --class org.ugr.sci2s.mllib.test.QueuRDDStreamingTest spark-IS-streaming.jar --input=hdfs://localhost:8020/train.data --output=hdfs://localhost:8020/output/streaming-test --type=csv --interval=1000 --rate=100000 --ntrees=460 --npart=460 --edited=true

For a more thorough sourc code example, please refer to: src/main/scala/org/ugr/sci2s/mllib/test/QueuRDDStreamingTest.scala

Contributors

References

[1] J.S. Sánchez, F. Pla, F.J. Ferri, Prototype selection for the nearest neighbour rule through proximity graphs, Pattern Recognition Lett. 18 (1997) 507–513. KEEL project: https://github.com/SCI2SUGR/KEEL / http://sci2s.ugr.es/keel/description.php

spark-is-streaming's People

Contributors

sramirez avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.