Giter VIP home page Giter VIP logo

activity_from_sensors's Introduction

Real Time Human Activity Classification from IMU data on Spark

The project is the implementation in Scala + Spark of a Multilabel Classifier of human activity from smartphones IMU sensor data.

The work is split in two main Apps:

  • TrainingApp fits the chosen model - DT or MLP - to supervised data.

  • StreamingApp reads input data on TCP socket and classifies on the fly on a sliding window.

They both run in local or cloud mode via bash scripts, provided the spark installation directory:

script/run_local_training.sh /path/to/spark

or AWS Elastic Map Reduce platform, see below.

script/run_emr_training.sh

Our best MLP model achieves 96% and more accuracy on unseen data.


1. The Data

The dataset is mantained by the UC Irvine Machine Learning Center here: http://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition

The Heterogeneity Human Activity Recognition (HHAR) dataset >from Smartphones and Smartwatches is devised to benchmark >human activity recognition algorithms in real-world contexts; specifically, it is gathered with a variety of different device models and use-scenarios, in order to reflect sensing heterogeneities to be expected in real deployments.

Around 13 million phone's accelerometer and gyroscope entries are provided, each with its millisecond-precision record time and labelled activity, which we use.

Activities: ‘Biking’, ‘Sitting’, ‘Standing’, ‘Walking’, ‘Stair Up’ and ‘Stair down’.

Sensors: Two embedded sensors, i.e., Accelerometer and Gyroscope, sampled at the highest frequency the respective device allows.

Devices: 8 smartphones (2 Samsung Galaxy S3 mini, 2 Samsung Galaxy S3, 2 LG Nexus 4, 2 Samsung Galaxy S+)

Recordings: 9 users

Presented like this:

Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
.
.

2. Time series Classification

To extract valuable features a window approach is used: the dataset is grouped in 10 seconds windows.

5 statistics are then computed for each sensor axis and window:

  • Mean
  • Variance
  • Covariance
  • Skewness (measures distribution asymmetry)
  • Kurtosis (measures outliers)

The latter 2 standard moments measure distribution asymmetry (skew), and tail relevance (kurtosis). The introduction of these alone raises accuracy from 93% to 96% approx.

A total of 5 feature x 3 axis x 2 sensors gives 30 unique features for the classification task.


3. Preprocessing

  • SparkSql as state of the art:

SparkSql implements means, variances and covariances computations with optimality. This state of the art is used as reference for our spark-core only implementation.

  • Spark Core The Preprocessing Spark Job pipeline for the accelerations input file is shown, stages 1-6 are replicated for the gyroscope data, the two are then joint in stage 12.

img

  • PartitionBy key and persist are done for improving performance of join and key based operations.

4. Training with mllib pipeline

Spark pipeline is used to train the model, stages are:

  1. label indexer: converts activity labels to indices
  2. min-max scaler,
  3. classifier,
  4. label converter: reverts labels back from indices

Multi-layer perceptron and decision tree algorithms are implemented, with MLP achieving the best result.


5. SparkStreaming

To classify data in real time, input stream is batched by Spark windows of length 10 seconds, a sliding window of this size is computed every 5s for smoother response.

DStream time series input is processed to output predictions, available as DStream too.


6. AWS deployment

  • training data is stored on Amazon S3 file system and accessed directly by TrainingApp
  • StreamingApp listens on TCP port for files to classify, for this reason server_stream.py runs on ec2 istance, serving one or multiple test files to socket
  • classification results can be seen live on port 8888, and are available as DStream

7. Challenges

  • collections operations optimization (GC, groupby vs reducebykey)
  • code refactoring
  • local vs cloud deploy

activity_from_sensors's People

Contributors

danieleveri avatar buoi avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.