The purpose is to create a simple API service allowing to get some data from the dataset
- Use the dataset yellow_tripdata_2020-01.csv from amazon, but the csv data seems no longer to be avaliable, we use instead the 'parquet' format
- Build an application to serve the API that can run locally, with all requirements properly specified in the package.
- The application should be built as if it were to be put in production, keep scalability and performance in mind
The application is currently not tested to see if the results are correct, the following tests must be made for each route
Id | Test | Responsible |
---|---|---|
1 | Response should be valid according to specification from all routes when interval contains no data (need to be agreed with customer, suggest respon=0 or header=404) | CG |
2 | Check that datatype is correct (minutes,km/USD) for all endpoints | CG |
3 | Check that results are correct when having 'one' sample | VJ |
4 | Results are selected correct according to query terms (manually specify dataset and verify selection on borders) | VJ |
5 | Results are correct when have 'two' samples (especially check agreement of median) | VJ |
6 | Results is correctly loaded when comming from multiple datafiles (example, median across data from January and February should give result as agreed upon) | CG |
7 | Perfomance of a single 'get' no requirements is made, what should it be? | CG |
8 | How many concurrent users will be on the system? | CG |
- how different will the queryes be?
- can the results be cached? for instance for plots, the query might always asks for the timeranges of a full month, if so the result can be cached
- Before starting the application yellow_tripdata_2020-01.parquet must be copied to /storage/tripdata (It is not included in here)
- Build and run the image
wget -O storage/tripdata/yellow_tripdata_2020-01.parquet https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.parquet
hostname -I
docker-compose build
docker-compose up
Note the IP adress from above, navigate to the IP adress and test the rest API.
http://{IP}/trip-dur/2020-01-01/00:00:00/2020-01-02/00:00:00
Requirements: all metrics can be queried with configurable start and end date
All routes have a start and end specified as the following template {metric}/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss
- First timestamp is minimum time, trip should be started after this
- Second timestamp is maximum time, trip should be started before this
(km and minutes), To get either range or duration, the trip-range or trip-dur routes can be used
trip-range/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss
trip-dur/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss
(total, mean, median), using the total_amount variable.
bills/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss
(total, mean, median), using the total_amount variable for trips starting at a given PULocationID or ending in a given DOLocationID.
bills-start-in/PULocationID/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss
bills-end-in/PULocationID/YYYY-MM-DD/hh:mm:ss/YYYY-MM-DD/hh:mm:ss
- The respons time is arround 1 seconds for most routes after removing additional columns (is it good enough, there are no requirements in the spec)
- with one dataset loaded, it is consuming ~500mb of RAM for january-2020 (Loading all datasets would )
- Bills-start-in and Bills-end-in does not work, the sub selection for PULocation needs to be fixed
- The data should be loaded into a database that can handle the entire dataset fast, since for calculating median value, all the data need to be in memory at the same time.
- Using a standard database might also be a good solution, the data needed should be to bug, and it is easy to scale across nodes. The Danger of using a cloud database such as Cosmos db is that the price could become a concern if someone searches all the data
- It is also possible that removing all the colums that are not needed, it would be feasable to have everything in memory, right now we use 500mb of ram for one month, assuming the same amount of data for each month gives 6gb-year, for 10 years that 60gb
- Maybee removing the unused columns would solve the problem, we are using the following column
- trip_distance
- total_amount
- trip_dur
- tpep_pickup_datetime the dataset consist of 19 columns, and we just use 4column if we multiply the 60gb for 10 years with 4/19 we only seem to need ~4.5gb of memory
- When the data is moved to database, there might not be additional performance requirement needed, but if something is needed the application itself can be scaled using 'kubernetes', or be provisioned using a serveless arcitecture