Giter VIP home page Giter VIP logo

capataz's Introduction

capataz

A Data Analytics Platform for Connected Transportation

Summary

capataz allows users to explore, analyze and visualize both near-real-time and historical data, composed of geospatial location, distance traveled, number of passengers and time.

The platform was developed in 3 weeks as part of Insight's Data Engineering Fellowship in NYC. The project allowed me to explore several open-source Big Data technologies, their paradigms and limitations.

Features:

  • Near-real-time data ingestion and processing
  • Near-real-time queries on users and vehicles
  • Batch processing on historical data
  • Trip-time prediction based on historical data
  • User-interface for data exploration and visualization
  • Distributed and scalable

alt text

Description

capataz (Spanish), translates to overseer/controller:

(n) a person or thing that directs or regulates something.

capataz was inspired by the rise of connected transportation and the imminent deployment of self-driving vehicles. It is a proof-of-concept platform for users that need to monitor, explore and vizualize vast amounts of streaming data. Such users may be city officials, urban-planners, emergency and mass-transportation services, and upcoming automated services such as delivery and carpooling. Moreover capataz was designed with the intention of being used by Data Scientist in order to incorporate predictive models for both real-time and batch streams of data. As it stands capataz is able to process and filter a simulated real-time stream of data which includes geospatial location, distance traveled, number of passengers and time. Moreover, it is able to process and filter large batches of data while simultaneously training a predictive model. The processed data can then be queried, visualized and explored through a user-interface.

As a proof-of-concept a Decision Tree Regressor was chosen in order to predict the time-delta (in mins), given a start and goal location at a given hour of the day. This information would be valuable for ridesharing and carpooling services that need to optimize their fleet logistics.

The data used for both real-time simulation and batch processing comes from NYC Taxi and Limousine Commission data, which encompasses multiple .csv's totaling 170 GBs.

Pipeline

alt text

Ingestion:

Apache Kafka serves as the primary messaging system between technologies within the pipeline.

  • Kafka receives messages from:

    • a .csv that simulates real-time stream (currently ~100 messages per second)
    • a completed Spark Streaming process
    • a completed Spark Batch process
  • Kafka sends messages to:

    • a Spark Streaming process
    • a new .csv file on HDFS
    • Elasticsearch, which are then indexed for querying

Stream:

Apache Spark Streaming serves as the near-real-time analytics framework by performing micro-batches on incoming messages from Kafka.

  • Spark Stream performs the following operations:
    • Create a Direct Stream from Kafka, listening for incoming json messages
    • parse json messages into Dataframes
    • Filter invalid GPS coordinates and empty/null entries
    • Send messages back to Kafka, which will be then consumed by Elasticsearch

Batch:

Apache Spark SQL + ML is used for processing batch historical data coming from multiple large (2.5GB) .cvs's stored in HDFS.

  • Spark SQL+ML performs the following operations: *

Datastore+Search Engine:

Elasticsearch serves as the datastore and search engine, where users, cars and predictions are indexed for later querying.

Frontend:

Kibana is used as the user-interface for Elasticsearch

How to run

TODO

Dependencies

TODO

capataz's People

Contributors

alcedok avatar

Stargazers

JeongHoon Baek avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.