nyc-taxi's Introduction

Project Description

This project mainly based on Apache Spark Streaming, Kafka, Hadoop using New York Taxi dataset.

Data Source: https://www.kaggle.com/c/nyc-taxi-trip-duration/data

Data Generator:

Data Generator provides the dataset as a streaming like real world scenarios. When you use the necessary CLI commands, it will produce dataset as streaming data source and it has capabilities to send to Apache Kafta topic or it will save the related data as log file in your directory.

There are 2 python scripts: one for stream data to file (dataframe_to_log.py) and the other to Kafka (dataframe_to_kafka.py). You must use ** Python3 **. It is recommended to use virtual environment.

You can find installation guide in the data-generator directory.

git clone https://github.com/erkansirin78/data-generator.git

cd data-generator

python dataframe_to_kafka.py -h

python dataframe_to_kafka.py -i ~/datasets/nyc_taxi_subset.csv -t test1

python dataframe_to_log -h

Mainly, there are 2 sections as below;

1-) Data Engineering Section

The main purpose of this section was maintaned ETL process for the streaming NYC Taxi dataset and create live dashboard using Kibana.

2-) Machine Learning and Streaming Section

The main purpose of this section was predict Estimated Time Arrival(ETA) while streaming NYC Taxi dataset and direct to different Kafka topics based on ETA value.

Recommend Projects

zhangabner / nyc-taxi Goto Github PK

nyc-taxi's Introduction

Project Description

1-) Data Engineering Section

2-) Machine Learning and Streaming Section

nyc-taxi's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent