Giter VIP home page Giter VIP logo

yellow-proto's Introduction

New York Yellow cab example rest API

The purpose is to create a simple API service allowing to get some data from the dataset

The following is the requirements from the client

  • Use the dataset yellow_tripdata_2020-01.csv from amazon, but the csv data seems no longer to be avaliable, we use instead the 'parquet' format
  • Build an application to serve the API that can run locally, with all requirements properly specified in the package.
  • The application should be built as if it were to be put in production, keep scalability and performance in mind

Testing of corrects

The application is currently not tested to see if the results are correct, the following tests must be made for each route

Id Test Responsible
1 Response should be valid according to specification from all routes when interval contains no data (need to be agreed with customer, suggest respon=0 or header=404) CG
2 Check that datatype is correct (minutes,km/USD) for all endpoints CG
3 Check that results are correct when having 'one' sample VJ
4 Results are selected correct according to query terms (manually specify dataset and verify selection on borders) VJ
5 Results are correct when have 'two' samples (especially check agreement of median) VJ
6 Results is correctly loaded when comming from multiple datafiles (example, median across data from January and February should give result as agreed upon) CG
7 Perfomance of a single 'get' no requirements is made, what should it be? CG
8 How many concurrent users will be on the system? CG

Additional specification needed to decide on requirements

  • how different will the queryes be?
  • can the results be cached? for instance for plots, the query might always asks for the timeranges of a full month, if so the result can be cached

Starting the application

wget -O storage/tripdata/yellow_tripdata_2020-01.parquet https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.parquet 
hostname -I
docker-compose build
docker-compose up

Note the IP adress from above, navigate to the IP adress and test the rest API.

http://{IP}/trip-dur/2020-01-01/00:00:00/2020-01-02/00:00:00

Routes documentation

Requirements: all metrics can be queried with configurable start and end date

All routes have a start and end specified as the following template {metric}/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss

  • First timestamp is minimum time, trip should be started after this
  • Second timestamp is maximum time, trip should be started before this

Average, median trip length

(km and minutes), To get either range or duration, the trip-range or trip-dur routes can be used

trip-range/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss 
trip-dur/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss 

Billings statistics

(total, mean, median), using the total_amount variable.

bills/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss 

Billings statistics

(total, mean, median), using the total_amount variable for trips starting at a given PULocationID or ending in a given DOLocationID.

bills-start-in/PULocationID/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss
bills-end-in/PULocationID/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss

Known bugs / issues

  • The respons time is arround 1 seconds for most routes after removing additional columns (is it good enough, there are no requirements in the spec)
  • with one dataset loaded, it is consuming ~500mb of RAM for january-2020 (Loading all datasets would )
  • Bills-start-in and Bills-end-in does not work, the sub selection for PULocation needs to be fixed

Design modifications

  • The data should be loaded into a database that can handle the entire dataset fast, since for calculating median value, all the data need to be in memory at the same time.
  • Using a standard database might also be a good solution, the data needed should be to bug, and it is easy to scale across nodes. The Danger of using a cloud database such as Cosmos db is that the price could become a concern if someone searches all the data
  • It is also possible that removing all the colums that are not needed, it would be feasable to have everything in memory, right now we use 500mb of ram for one month, assuming the same amount of data for each month gives 6gb-year, for 10 years that 60gb
  • Maybee removing the unused columns would solve the problem, we are using the following column
    • trip_distance
    • total_amount
    • trip_dur
    • tpep_pickup_datetime the dataset consist of 19 columns, and we just use 4column if we multiply the 60gb for 10 years with 4/19 we only seem to need ~4.5gb of memory
  • When the data is moved to database, there might not be additional performance requirement needed, but if something is needed the application itself can be scaled using 'kubernetes', or be provisioned using a serveless arcitecture

yellow-proto's People

Contributors

buildcomplete avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.