Giter VIP home page Giter VIP logo

aidanstack / cycling_classification_project Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 1.0 1017 KB

Using 1.8 million rows of traffic collision data from NYC's Open Data initiative, we ran thousands of classification models to discover which variables made collisions lethal for cyclists. The data was cleaned using Pandas, then Scikit-learn was used to instantiate and gridsearch Decision Tree, Random Forest, Logistic Regression, and K Nearest Neighbor models. Logistic Regression proved the most reliable, and was tuned for recall, as fatal collisions made up only .04% of all collisions involving cyclists. The variables most associated with lethal cycling collisions were then extracted back out as actionable insights.

Jupyter Notebook 100.00%
machine-learning newyorkcity logistic-regression

cycling_classification_project's Introduction

Repo Structure

.gitignore - Our dataset has nearly 2 million rows, making it far too large for github. Data source is linked at the bottom of this README

Main Analysis Notebook - Our entire data science process for this project.

README - YOU ARE HERE

Business Problem and Overview

Our proposed client is the City of New York itself. The city government has come under fire from constituents and cyclist advocacy groups for the dangerous conditions cyclists endure. The city's transportation leadership want to know what conditions lead to deadly traffic collisions for New York's cyclists. The classification model in this instance is not the end product, as once the city has a record of the accident, they already know if a cyclist has died or not. The final product of this analysis will be the emergent patterns revealed by what parameters make for a classification model capable of predicting fatal cycling accidents. Our data comes from the NYC Open Data project.

1. Data Understanding

In this phase of exploratory data analysis, we look over our dataset to find out what cleaning and preprocessing steps it requires before we will be able to apply Scikit Learn models. This is also the section where we discover just how severe the imbalance of our dataset is, with only .5% of our data in our minority class (lethal collisions).

2. Data Cleaning

This step is critical for our analysis, as the data straight out of the csv from NYC Open Data is messy, filled with NaN values, and lacks certain columns we will want to use in our modeling process. The data cleaning is simply a matter of methodically going through each of the original columns, and manipulating the data into the formats we need.

3. Preprocessing

To avoid any possible data leakage that could hurt the validity of our final model, we performed two train test splits. The first separated our data into training data and holdout data, then the second split separated our training data into training and test data, so that when iterating on our models, we could test each iteration as we went on the test set. We then set up a Pipeline containing all of our preprocessing steps, for consistency and ease of use later on at final model evaluation time.

4. Model Type Exploration

We knew that our final model was never going to perform well, due simply to the extreme imbalance in our dataset. However we still wanted the best possible model, so we took the kitchen sink approach and ran all kinds of classification algorithms through grid searches, eventually trying thousands of models. In the end however logistic regression looked the most promising.

5. Modeling Iterations

Once we knew that Logistic Regression was our classifier of choice, we ran an even more extensive grid search until we had the best model we could make.

6. Model Evaluation and Feature Analysis

With our final model in hand, we could assemble it into a pipeline with our preprocessor, and fit to our holdout data. Our model performed as well as we could have hoped, with an accuracy of 72% and a recall of 68.75%. This is exactly what we expected to get, as in this business case, false positives are more desirable than missing false negatives. The cases where we got false positives point to collisions with input conditions that the model can tell are more likely to be lethal, and possibly weren't lethal due to chance. False positives are also more tolerable because the constituents currently criticizing city leadership will not be angered by making some streets safer than needed.

7. Key Takeaways and Next Steps

  1. Large Trucks and Buses. One of the most effective changes the city could enact is to figure out how to reduce the number of cyclists killed by large vehicles. Whether this means changes to bike lane infrastructure, restricting which streets large vehicles can drive on, cyclist safety training programs, or special training for city bus drivers and people who have commercial drivers licenses in NYC.
  2. Zipcode. As the second strongest indicator of lethality, zipcode is an aspect to crashes that cannot be ignored by the city. Luckily this indicator is highly informative for the city in terms of where geographically to focus their efforts.
  3. Brooklyn. With a huge number of cyclists, Brooklyn is definitely another piece of the geographical puzzle when it comes to the city reducing cycling fatalities.
  4. Time of Day. The city could do a lot by cracking down on drunk and reckless driving in the late night and early morning, as well as run campaigns urging cyclists to wear more visible clothing and use lights after sundown. More traffic control officers during rush hour could also make a large difference.

Possible Next Steps

  1. Fill in NaN values in Borough and Zipcode using latitude and longitude data.
  2. Build an intersection feature using Street and Cross Street that finds the cities most dangerous intersections.
  3. Look at bicycles vs E-Bikes in terms of collision lethality.

Presentation Link: https://www.canva.com/design/DAEuDWwpohA/1pH6i6Tejg5N8YkQkCM45Q/view?utm_content=DAEuDWwpohA&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton

NYC Open Data Motor Vehicle Collisions: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data

cycling_classification_project's People

Contributors

aidanstack avatar

Watchers

 avatar

Forkers

shoemaker703

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.