Giter VIP home page Giter VIP logo

taxidatapipeline's Introduction

TaxiDataPipeLine - Data Pipeline Project

Description

TaxiDataPipeLine is an application to demonstrate creation of a simple data pipeline using Yellow Taxis trip data.

Build Status

Build & Test
Windows x64 Build & Test
Linux x64 Build & Test

Folder Structure

  • docs - Project documentation
  • src - Python source code
  • test - Unit test

Getting Started

Follow these instructions to get the source code and run it on your local machine.

Prerequisites

You need Python 3.7.3 (Official download link) to run this project.

Clone repository

git clone https://github.com/write2sushma/TaxiDataPipeLine.git

Set-up development environment

Navigate to source folder

cd TaxiDataPipeLine

Create a virtual environment

In Linux OS

python3 -m venv env
source env\bin\activate

In Windows OS

python -m venv env
env\Scripts\activate

Install project dependencies

Project dependencies are listed in requirements.txt file. Use below command to install them -

pip3 install -r requirements.txt

If there is any issue in installing dask using requirements.txt file, use the below commands in command prompt/terminal window:

pip3 install “dask[complete]”

pip3 install dask distributed

How to run

Navigate to TaxiDataPipeLine\taxidata folder and run data_processor.py

python data_processor.py

How to Unit Test

Unit tests are written using Python's UnitTest library. Tests can be run using below command:

pytest

or 

python -m unittest test\test_data_processor.py

How to check coverage

Run below command to check code coverage:

python -m coverage run test\test_data_processor.py

And, then we can see coverage and can generate coverage report in html format

coverage report
coverage html

Data Source

Here is the list of data source urls used for creating data Pipe Line -

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv

Automated build setup

Azure DevOp Pipeline is used to set and configure Automated build pipeline

Future Enhancement Plan:

• Optimize performance using dask scheduler to enable faster parallel processing.
    - This is already implemented in 'enhancements' feature branch.
• Scale pipeline to a multiple of the data size that does not fit any more to one machine using multinode clusters in cloud (e.g. AWS)
• Setup performance monitoring 
• Automate deployment using Azure DevOp Pipeline 

taxidatapipeline's People

Contributors

sushma-goutam avatar write2sushma avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.