Real Time Human Activity Classification from IMU data on Spark

The project is the implementation in Scala + Spark of a Multilabel Classifier of human activity from smartphones IMU sensor data.

The work is split in two main Apps:

TrainingApp fits the chosen model - DT or MLP - to supervised data.
StreamingApp reads input data on TCP socket and classifies on the fly on a sliding window.

They both run in local or cloud mode via bash scripts, provided the spark installation directory:

script/run_local_training.sh /path/to/spark

or AWS Elastic Map Reduce platform, see below.

script/run_emr_training.sh

Our best MLP model achieves 96% and more accuracy on unseen data.

1. The Data

The dataset is mantained by the UC Irvine Machine Learning Center here: http://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition

The Heterogeneity Human Activity Recognition (HHAR) dataset >from Smartphones and Smartwatches is devised to benchmark >human activity recognition algorithms in real-world contexts; specifically, it is gathered with a variety of different device models and use-scenarios, in order to reflect sensing heterogeneities to be expected in real deployments.

Around 13 million phone's accelerometer and gyroscope entries are provided, each with its millisecond-precision record time and labelled activity, which we use.

Activities: ‘Biking’, ‘Sitting’, ‘Standing’, ‘Walking’, ‘Stair Up’ and ‘Stair down’.

Sensors: Two embedded sensors, i.e., Accelerometer and Gyroscope, sampled at the highest frequency the respective device allows.

Devices: 8 smartphones (2 Samsung Galaxy S3 mini, 2 Samsung Galaxy S3, 2 LG Nexus 4, 2 Samsung Galaxy S+)

Recordings: 9 users

Presented like this:

Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
.
.

2. Time series Classification

To extract valuable features a window approach is used: the dataset is grouped in 10 seconds windows.

5 statistics are then computed for each sensor axis and window:

Mean
Variance
Covariance
Skewness (measures distribution asymmetry)
Kurtosis (measures outliers)

The latter 2 standard moments measure distribution asymmetry (skew), and tail relevance (kurtosis). The introduction of these alone raises accuracy from 93% to 96% approx.

A total of 5 feature x 3 axis x 2 sensors gives 30 unique features for the classification task.

3. Preprocessing

SparkSql as state of the art:

SparkSql implements means, variances and covariances computations with optimality. This state of the art is used as reference for our spark-core only implementation.

Spark Core The Preprocessing Spark Job pipeline for the accelerations input file is shown, stages 1-6 are replicated for the gyroscope data, the two are then joint in stage 12.

PartitionBy key and persist are done for improving performance of join and key based operations.

4. Training with mllib pipeline

Spark pipeline is used to train the model, stages are:

label indexer: converts activity labels to indices
min-max scaler,
classifier,
label converter: reverts labels back from indices

Multi-layer perceptron and decision tree algorithms are implemented, with MLP achieving the best result.

5. SparkStreaming

To classify data in real time, input stream is batched by Spark windows of length 10 seconds, a sliding window of this size is computed every 5s for smoother response.

DStream time series input is processed to output predictions, available as DStream too.

6. AWS deployment

training data is stored on Amazon S3 file system and accessed directly by TrainingApp
StreamingApp listens on TCP port for files to classify, for this reason server_stream.py runs on ec2 istance, serving one or multiple test files to socket
classification results can be seen live on port 8888, and are available as DStream

7. Challenges

collections operations optimization (GC, groupby vs reducebykey)
code refactoring
local vs cloud deploy

buoi / activity_from_sensors Goto Github PK

activity_from_sensors's Introduction

Real Time Human Activity Classification from IMU data on Spark

1. The Data

2. Time series Classification

3. Preprocessing

4. Training with mllib pipeline

5. SparkStreaming

6. AWS deployment

7. Challenges

activity_from_sensors's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent