Giter VIP home page Giter VIP logo

nyc_tlc's Introduction

New York City - Taxi & Limo Commission:
Yellow Taxi Trip Records

Objective

Create a scalable, easily maintainable solution that does the following:

  • Ingests the dataset (NYC Taxi and Limousine, yellow dataset)
  • Summarizes:
    • Mean & median costs, prices, and passenger counts
    • Aggregate by payment type, year, month
  • Results are output in csv or parquet format
  • Includes source code and a ReadMe File with information on how to run your script and any other dependencies

Approach

  1. Review NYC TLC Dataset Documentation:
    • Timeframe: 2009 to 2018
    • Format: Parquet
    • Size: ~50 GB, roughly 1.5B rows as of 2018
    • Azure Region: East US
    • data dictionary is not 100% accurate
      • ex1: vendorID listed twice, as string & int, the description for both entries matches int entry, but the field data type in the table is string
      • ex2: paymentType has datatype string, but description mentions "numeric code signifying how the passenger paid for the trip" - the data is highly inconsistent (1,CAS, CSH, CASH, cash, etc) and requires transform
  2. Create Microsoft Account and sign into Azure Portal - $200 free credits with new accounts!
  3. Create & configure an Azure Databricks workspace instance
    • name: gray_matter
    • resource group: nyc_tlc
    • region: East US (chosen for affinity with dataset, also stored in East US)
  4. Launch workspace
  5. Create & configure Personal Compute Cluster
    • runtime version: 14.2 ML (includes Apache Spark 3.5.0, Scala 2.12)
    • node type: standard_ds3_v2, 14 GB, 4 cores
  6. Launch compute cluster
  7. Review data dictionary and document column requirements and notes (see below)
  8. Create Notebook (NYC_TLC)
    • Ingest data (used modified spark sample from documentation)
    • Review relevant columns
    • Transform paymentType for consistency and proper grouping
    • Find mean & median using Spark SQL & pySpark
    • Filter on year range based on documentation

Columns of Interest + notes:

  • payment type: paymentType - very messy/inconsistent field, requires transforms
  • passengers: passengerCount - # of passengers in the vehicle, driver-entered value
  • dates: puMonth, puYear
  • financial:
    • extra - $0.50 and $1 rush hour and overnight charges
    • fareAmount - time-and-distance fare calculated by the meter.
    • improvementSurcharge - 0.30 improvement surcharge assessed trips at the flag drop.
      • The improvement surcharge began being levied in 2015
    • mtaTax - $0.50 MTA tax that is automatically triggered based on the metered rate in use
    • tipAmount - automatically populated for credit card tips.
      • Cash tips are not included in tipAmount
      • tipAmount not included in totalAmount
    • tollsAmount - Total amount of all tolls paid in trip
    • totalAmount - total amount charged to passengers. Does not include cash tips

Results

Final output results are saved as CSV file: nyc_tlc_output

I decided to create a quick bar chart that shows average total payment (not including tips) by type and date

nyc_tlc's People

Contributors

bjenk1 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.