Giter VIP home page Giter VIP logo

san_francisco_crime_resolution_model's Introduction

San Francisco Crime Data

Collaborators:

Name github.com
Betty Zhou @bettybhzhou
Ian Flores @ian-flores

Project Overview

Objective

The goal of this project is to implement a classification of the San Francisco crime data with a decision tree to predict the resolution of a crime instance. The resolution of a crime instance can be either "processed" or "non-processed". Processed indicates that a crime instance resulted in a subject being processed into the justice system. The following table outlines crime resolutions that were classified as processed or non-processed:

Non-Processed Processed
NONE ARREST, BOOKED
CLEARED-CONTACT JUVENILE FOR MORE INFO ARREST, CITED
UNFOUNDED NOT PROSECUTED
JUVENILE ADMONISHED PSYCHOPATHIC CASE
EXCEPTIONAL CLEARANCE JUVENILE CITED
JUVENILE DIVERTED JUVENILE BOOKED
LOCATED
PROSECUTED BY OUTSIDE AGENCY
COMPLAINANT REFUSES TO PROSECUTE
DISTRICT ATTORNEY REFUSES TO PROSECUTE
PROSECUTED FOR LESSER OFFENSE

This project will address the following predictive question:
What are the strongest predictors for whether a crime instance in San Francisco resulted in someone being "processed" or "not processed" into the justice system?

Dataset

The dataset contains San Francisco crime data from 2003 to May 2018 with each crime instance having the following features: incident ID, category of crime, description of crime, day of week of incident, date of incident, time of incident, police district, resolution of incident, address, longitude, latitude and location of incident.

The dataset can be found at the following link:

The dataset can be loaded using this R script or this Python script. The following is a preview of the dataset in R:

Analysis and Results

The final report of the project can be found here

Analysis Pipeline

Usage using Docker:

To run this analysis using Docker, execute the following steps:

  1. Clone this repository

  2. Pull the Docker image from Docker Hub using the follwing command:

    docker pull bettybhz/san_francisco_crime_resolution_model
    
  3. Navigate to the root of this project on your computer using the command line.

  4. Execute the following command by filling in <PATH_ON_YOUR_COMPUTER> with the absolute path to the root of this project on your computer).

    docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/San_Francisco_Crime_Resolution_Model bettybhz/san_francisco_crime_resolution_model make -C 'home/San_Francisco_Crime_Resolution_Model' all
    

    Execute the following command to clean up the analysis:

    docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/San_Francisco_Crime_Resolution_Model bettybhz/san_francisco_crime_resolution_model make -C 'home/San_Francisco_Crime_Resolution_Model' clean
    
    

Usage from command line:

  1. Clone this repository

  2. From the root of the project, run the following commands:

python src/01_clean_data.py 1000 data/san_francisco_clean.csv
python src/02_EDA.py data/san_francisco_clean.csv results/figures/
python src/03_feature_engineering.py data/san_francisco_clean.csv data/san_francisco_features.csv
python src/04_decison_tree.py data/san_francisco_features.csv data/feature_results.csv
Rscript src/03_Exploratory_SF_map.R data/san_francisco_clean.csv results/figures/
Rscript -e "rmarkdown::render('docs/san_francisco_report.Rmd')"

Or, run all scripts from the root using the following command:

make all

To clean up analysis, run:

make clean

Dependencies:

  • R (v 3.5.1) & R libraries

    • rmarkdown v1.10
    • knitr v1.20
    • tidyversev 1.2.1
    • ggmap v2.6.1
    • maps v3.3.0
  • Python (v 3.6.5) & Python libraries:

    • matplotlib v2.2.2
    • numpy v1.14.3
    • seaborn v0.9.0
    • pandas v.23.0
    • argparse v1.1
    • sklearn v0.19.1
    • pendulum v2.0.4

san_francisco_crime_resolution_model's People

Contributors

bettybhzhou avatar ian-flores avatar

Watchers

 avatar  avatar  avatar

san_francisco_crime_resolution_model's Issues

adding to report analysis

Hi Ian,

I took another look at the report. I was wondering if you could add that we optimized for the max_depth and min_sample_split and also, that we balanced the classifier to the analysis? (based on the code below)

parameters ={'max_depth': np.arange(1, 30, 5), 
             'min_samples_split': np.arange(2, 400, 20)}

tree = DecisionTreeClassifier(class_weight='balanced')
clf = GridSearchCV(tree, parameters, cv = 10)
clf.fit(X_train, y_train)

I missed it last time, but I think it's important to specify in our analysis.

Thank you,
Betty

Feedback from Peer Review

These are the suggestions provided by peers on our project:

  1. Convert the feature importance table to a bar graph to provide better visualization and summary of results.

  2. Include details about analysis such as the hyperparameters used for building the classifier (i.e. k for cross-validation, min_sample_split value, max_depth used).

  3. Add to future direction about how our decision tree classifier would compare to a Naive Bayes classifier.

Proposal feedback

Mechanics

  • The link in your proposal repo only redirect to the release page of the project repo. You can also create a direct link to v1.0 of the repo with https://github.com/UBC-MDS/DSCI_522_SF_crime/tree/v0.1.

  • Would be nice to show in README that the data importing script does run and maybe print the first few rows of the data.

Reasoning

  • Maybe this wasn’t covered at the time of writing your proposal, you might want to mention how you are going to train and tune your decision tree in terms of tuning and transformation of the predictions, in addition to what packages you will use. Since a good portion of the predictors are categorical, are you going to use all classes as is? Maybe some classes for these predictors will be quite rare as well?

Other comments

  • Might be interesting to plot the incidents on a map of SF as part of an initial data exploration. With 2 million data points the map might look very busy though, might need to subsample or make a heat map instead.

Define non-processed?

Not clear in overview what non-processed means? And maybe provide some examples of what encompasses processing as well to help the reader understand what problem you are trying to solve.

Clustering the data into arrest and not arrest

Processed:

  • arrest, booked
  • arrest, cited
  • prosecuted by outside agency
  • juvenile cited
  • complainant refused to prosecute
  • district attorney refused to prosecute
  • not prosecuted
  • juvenile booked
  • psychopathic case
  • located
  • prosecuted for lesser offense

Not processed

  • none
  • cleared-contact juvenile for more info
  • juvenile diverted
  • juvenile admonished
  • exceptional clearance
  • unfounded

Milestone 1 feedback

Mechanics

  • Might want to add version number for the packages in the dependency section.
  • Might want to add a link to the final report in README.
  • Might want to explain the usage of the scripts a little more. Say, 01_clean_data.py has three arguments.

Quality

  • In the documentation at the top of a python or R script, you can explain what the arguments are in addition to showing an usage example.

Report

  • Would be nice to link the description of variables in Section 1.0 to the variable names in subsequent sections. Say, there are x and y in Table 2, are they longitude and latitude?
  • Also, I am not 100% sure what is stolen, buliding, vehicle neither when I just look at the Table. Are they among the 50 words extracted? Maybe list these 50 words somewhere?
  • How did you choose the depth of your decision tree?
  • How did you compute the testing accuracy in section 4.0?
  • Style-wise, would be nice to replace the 'snake_case' with the usual case throughout.

Make figures smaller

Figures should be as small as possible (yet still clearly readable and understandable). So that you can easily read the figure captions and surrounding text while looking at the figure. In the case of the figures in your report, the text size in the figures is great, but the plot areas and the bars/boxes can be much smaller. This is specifically only really a problem for figure 1 in your report.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.