Giter VIP home page Giter VIP logo

new-york-city-taxi-trip-duration's Introduction

New York City Taxi Trip Duration

The project aims to predict the total ride duration of taxi trips in New York city. In cities like New york where the traffic is high and the distance between the destinations is short, everyone wants to reach their respective destinations as soon as possible. The dependent variable in this project is “trip_duration” which is the duration of the trip in seconds. There are 10 independent variables or features which we will use in our hypothesis to generate our predictions. We have performed feature engineering, data exploration, time series analysis, model building and validation to build a low latency system to predict "Trip Duration" given pick up and drop off geographical co-ordinates.

image

Team Members (Team 5):

  • Shanun Randev
  • Anand Raj
  • Mowzli Sre
  • Bala krishna Reddy

DataSet Details:

  • Id: Unique identifier for each trip
  • Vendor_id: Code indicating the provider associated with the trip record
  • Pickup_datetime: Date and time when the meter was engaged
  • Dropoff_datetime: Date and time when the meter was disengaged
  • Passenger_count: Number of passengers in the vehicle
  • Pickup_longitude: Longitude where the meter was engaged
  • Pickup_latitude: Latitude where the meter was engaged
  • Dropoff_latitude: Latitude where the meter was disengaged
  • Dropoff_longitude: Longitude where the meter was disengaged
  • Store_and_fwd_flag: Flag indicating whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server
    • Y = store and forward
    • N = not a store and forward trip
  • Target Variable:
    • Trip Duration (seconds)

Data Analysis:

  • How can we predict the total distance using the pickup and dropoff latitude and longitude, then examine the relationship between the two variables: trip duration and distance ?
  • On what days and at what time during the day the trip duration is maximum ?
  • What are the top 4 locations with high demand for taxis in NYC ?
  • How do we compute the speed of vehicles and how is it related to the trip duration ?

Observation on the Dataset

image

Cleaning and Feature Engineering

image

Data Cleaning and Preprocessing

Outlier Correction

  • We are employing 1.5 IQR method which is used for identifying the outliers in a dataset
  • The IQR is a measure of statistical dispersion, representing the range between the first quartile(Q1) and third quartile (Q3)
  • Outliers are observations that fall below Q1 - 1.5IQR or above Q3 + 1.5IQR

image

Cleaning of Passenger Counts

  • Practical considerations were taken out as it is unlikely for a car to accommodate more than 6 passengers or no passenger
  • All trips with 0 and above 6 passengers were removed from the dataset
  • Implementation of a domain based outlier detection is a crucial data cleaning part to provide accurate modelling

Cleaning of trip durations

  • A threshold of 100 hours was taken, for practicality
  • Trip data exceeding this limit, were considered outlier and were removed from the dataset
  • Trip data with more than 100 hours of trip durations are extreme outliers which may easily affect the accuracy of the modelling

Haversine Formula

a = sin²(Δφ/2) + cos(φ₁) * cos(φ₂) * sin²(Δλ/2) c = 2 * atan2(√a, √(1-a)) d = R * c Where:

  • φ₁ and φ₂ are the latitudes of the two points in radians.
  • Δφ is the difference between the latitudes of the two points.
  • Δλ is the difference between the longitudes of the two points.
  • R is the radius of the Earth (mean radius = 6,371 kilometers). The Haversine formula calculates the great-circle distance, which is the shortest distance over the Earth's surface between two points. The result is the distance between the two points along the surface of the sphere. This formula takes into account the curvature of the Earth and provides accurate distance calculations for geographic coordinates.

Cleaning of Trip Distance

  • A minimum meaningful trip distance of 1 meter was chosen
  • Any trip distance less than 1 meter is considered as an outlier
  • Other less trip distance like 2, 3, 4, etc were not clipped, because there are high chances of user cancelling the trip
  • So outliers with short distance and very long timings were clipped

Computing Directional Angle

  • Calculating the compass bearing (direction) between two geographic points specified by their latitude and longitude coordinates.
  • The Bearing is angle measured in degrees from the north direction in a clockwise direction.The haversine formula is often used for navigation and geographical applications.
  • The resulting bearing is typically expressed as value between 0 and 360 degrees where 0 is north, 90 degrees is east, 180 degree is south and 270 degrees is west

Feature Engineering

  • From the dataset, many new columns were featured from the existing columns
  • Hour of the day, day of the trip, week, month, direction of the trip, avg. traffic speed and more were calculated
  • Presence of these variables tend to give more accuracy than the default variables from the dataset

Density Plots after preprocessing of Data

image

Correlation of Trip Duration with other features

Trip duration has a strong positive correlation of 0.8 with trip_distance(km), which suggests that as the trip distance increases, the trip duration tends to increase as well

image

Relationship of Trip Distance with Trip Duration

The scatter plot clearly shows a strong positive correlation with Trip Distance

The higher the Trip Distance, the chances of the trip Duration to extend is also high

image

Time Series Analysis

  • Few peaks and drops were seen in the Trend graph, where weather accounts for the trend
  • By the end of Jan, 2016 New York saw it’s highest snowfall of 27 inches which accounted for the drop in taxi demands in early february
  • The trend shows another bump in the series during the mid february, where Valentine’s day could be more accountable
  • A gradual raise in the demands post summer may be accounted for peoples who travelled to different places in the early fall ( to longer distance )

image

A regular pattern is observed in the Seasonal graph, 4 times for every month, which clearly shows the weekend demands So attempted to analyse the weekly and daily trends throughout the dataset

Daily Taxi Ride Trends

We could observe, gradual increase in number of rides from Monday to Saturday Friday and Saturday are busiest days in terms of ride demand, and there is drop on sunday compared to saturday.

image

Taxi Ride Trends by Hour

We could see that there are less number of rides in early morning hours and a steady increase from the late morning During the evening hours starting at 6 pm to 10 pm there is high demand for taxis

image

Weekly Taxi Ride patterns

Each day we observe a similar pattern with 2 prominent peaks one occurring at 8 to 9 am and another at 6 to 7 pm The evening peak tends to last longer than the morning peak There is a sharper decline in rides after the evening peak on weekdays compared to weekends

image

Demand Prediction using K-means clustering

Demand Prediction

image

Determining ‘k’ for Clusters

  • The elbow method is used to determine the number of clusters to be considered for fitting the K-means clustering.
  • There is no significant improvement in the clustering after 10.
  • So, considering 10 clusters for fitting the K-means clustering.

image

Fitting and Identifying the Clusters

image

Top Locations (Displayed using Folium)

image

Avg. Speed and Trip Duration Analysis

Computing Avg. Traffic Speeds

  • Avg. Traffic Speed = distance traveled / time taken
  • The Avg. Speed of Traffic is relatively high in the morning
  • Again, the traffic speed increases in the evening but less compared to morning
  • The Avg. Traffic speed was found to be more during the weekends

image

Linear Regression using CV

Performance Metric Used: “RMSE” (Root Mean Square Error)

  • Root Mean Squared Error on Training Set: 317.35
  • RMSE for Each Fold (cross-validation): [318.51, 317.32, 317.31, 316.88, 317.55]
  • Mean Root Mean Squared Error (cross-validation): 317.52
  • Root Mean Squared Error on Test Set: 317.75

The RMSE value is around 317.75 seconds (~5 Minutes), which is descent but to further improve the model performance we performed XGBoost.

XGBoost

XGBoost without Hyper Parameter Tuning

  • Root Mean Squared Error on Training Set: 225.24
  • Root Mean Squared Error on Test Set: 227.98

XGBoost with Hyper Parameter Tuning

  • Root Mean Squared Error on Training Set: 221.09
  • Root Mean Squared Error on Test Set: 224.68

Best Hyperparameters: subsample: 0.9, n_estimators: 200, max_depth: 6, learning_rate: 0.2

Compare LR vs XGBoost

image

Conclusion

  • Haversine Formula was used to compute the trip distance for a given coordinate
  • We found that cluster around Manhattan had the more the pickup points, including Harlem, Wall Street, Columbus Circle and Manhattan Community Board
  • We analysed the time series and found the peak hours were from 6pm to 10pm and most taxi demands were on Friday and Saturday
  • The maximum Avg. Speed was recorded to be 24 Km/hr with higher activity in the weekends
  • XGBoost with Hyper Parameter Tuning turned out to be the best fit model to our dataset with ~3.7 minutes error for a given coordinate

new-york-city-taxi-trip-duration's People

Contributors

anandr07 avatar shanunds avatar balakrishnareddy08 avatar mowzlisre avatar

Stargazers

 avatar Rapo avatar  avatar Ankit Kumar avatar  avatar SANIYA SAMIR SHINDE avatar  avatar  avatar Aditeya Raj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.