ubc-mds / san_francisco_crime_resolution_model Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 2.0 23.8 MB

The goal of our project is to implement a classification of the San Francisco crime data with a decision tree.

License: Creative Commons Attribution Share Alike 4.0 International

R 2.68% Python 15.61% Jupyter Notebook 77.16% Makefile 3.51% Dockerfile 1.04%

decision-tree san-francisco crime-data

san_francisco_crime_resolution_model's Introduction

San Francisco Crime Data

Collaborators:

Name	github.com
Betty Zhou	@bettybhzhou
Ian Flores	@ian-flores

Project Overview

Objective

The goal of this project is to implement a classification of the San Francisco crime data with a decision tree to predict the resolution of a crime instance. The resolution of a crime instance can be either "processed" or "non-processed". Processed indicates that a crime instance resulted in a subject being processed into the justice system. The following table outlines crime resolutions that were classified as processed or non-processed:

Non-Processed	Processed
NONE	ARREST, BOOKED
CLEARED-CONTACT JUVENILE FOR MORE INFO	ARREST, CITED
UNFOUNDED	NOT PROSECUTED
JUVENILE ADMONISHED	PSYCHOPATHIC CASE
EXCEPTIONAL CLEARANCE	JUVENILE CITED
JUVENILE DIVERTED	JUVENILE BOOKED
	LOCATED
	PROSECUTED BY OUTSIDE AGENCY
	COMPLAINANT REFUSES TO PROSECUTE
	DISTRICT ATTORNEY REFUSES TO PROSECUTE
	PROSECUTED FOR LESSER OFFENSE

This project will address the following predictive question:
What are the strongest predictors for whether a crime instance in San Francisco resulted in someone being "processed" or "not processed" into the justice system?

Dataset

The dataset contains San Francisco crime data from 2003 to May 2018 with each crime instance having the following features: incident ID, category of crime, description of crime, day of week of incident, date of incident, time of incident, police district, resolution of incident, address, longitude, latitude and location of incident.

The dataset can be found at the following link:

Link to San Francisco crime data website

The dataset can be loaded using this R script or this Python script. The following is a preview of the dataset in R:

Analysis and Results

The final report of the project can be found here

Analysis Pipeline

Usage using Docker:

To run this analysis using Docker, execute the following steps:

Clone this repository
Pull the Docker image from Docker Hub using the follwing command:
```
docker pull bettybhz/san_francisco_crime_resolution_model
```
Navigate to the root of this project on your computer using the command line.

Execute the following command by filling in <PATH_ON_YOUR_COMPUTER> with the absolute path to the root of this project on your computer).

docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/San_Francisco_Crime_Resolution_Model bettybhz/san_francisco_crime_resolution_model make -C 'home/San_Francisco_Crime_Resolution_Model' all

Execute the following command to clean up the analysis:

docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/San_Francisco_Crime_Resolution_Model bettybhz/san_francisco_crime_resolution_model make -C 'home/San_Francisco_Crime_Resolution_Model' clean

Usage from command line:

Clone this repository
From the root of the project, run the following commands:

python src/01_clean_data.py 1000 data/san_francisco_clean.csv
python src/02_EDA.py data/san_francisco_clean.csv results/figures/
python src/03_feature_engineering.py data/san_francisco_clean.csv data/san_francisco_features.csv
python src/04_decison_tree.py data/san_francisco_features.csv data/feature_results.csv
Rscript src/03_Exploratory_SF_map.R data/san_francisco_clean.csv results/figures/
Rscript -e "rmarkdown::render('docs/san_francisco_report.Rmd')"

Or, run all scripts from the root using the following command:

make all

To clean up analysis, run:

make clean

Dependencies:

R (v 3.5.1) & R libraries
- rmarkdown v1.10
- knitr v1.20
- tidyversev 1.2.1
- ggmap v2.6.1
- maps v3.3.0
Python (v 3.6.5) & Python libraries:
- matplotlib v2.2.2
- numpy v1.14.3
- seaborn v0.9.0
- pandas v.23.0
- argparse v1.1
- sklearn v0.19.1
- pendulum v2.0.4

san_francisco_crime_resolution_model's People

Contributors

Watchers

Forkers

ian-flores bettybhzhou

san_francisco_crime_resolution_model's Issues

Dependency versions of R & Python needed

Please add the dependency versions R & Python in the dependencies section.

adding to report analysis

Hi Ian,

I took another look at the report. I was wondering if you could add that we optimized for the max_depth and min_sample_split and also, that we balanced the classifier to the analysis? (based on the code below)

parameters ={'max_depth': np.arange(1, 30, 5), 
             'min_samples_split': np.arange(2, 400, 20)}

tree = DecisionTreeClassifier(class_weight='balanced')
clf = GridSearchCV(tree, parameters, cv = 10)
clf.fit(X_train, y_train)

I missed it last time, but I think it's important to specify in our analysis.

Thank you,
Betty

Feedback from Peer Review

These are the suggestions provided by peers on our project:

Convert the feature importance table to a bar graph to provide better visualization and summary of results.
Include details about analysis such as the hyperparameters used for building the classifier (i.e. k for cross-validation, min_sample_split value, max_depth used).
Add to future direction about how our decision tree classifier would compare to a Naive Bayes classifier.

Add a tiny bit more detail to your usage instructions

I think its always helpful to be extra clear in the usage instructions, so I would suggest that you indicate all the commands you suggest are expected to be run from the root of this project.

Proposal feedback

Mechanics

The link in your proposal repo only redirect to the release page of the project repo. You can also create a direct link to v1.0 of the repo with https://github.com/UBC-MDS/DSCI_522_SF_crime/tree/v0.1.
Would be nice to show in README that the data importing script does run and maybe print the first few rows of the data.

Reasoning

Maybe this wasn’t covered at the time of writing your proposal, you might want to mention how you are going to train and tune your decision tree in terms of tuning and transformation of the predictions, in addition to what packages you will use. Since a good portion of the predictors are categorical, are you going to use all classes as is? Maybe some classes for these predictors will be quite rare as well?

Other comments

Might be interesting to plot the incidents on a map of SF as part of an initial data exploration. With 2 million data points the map might look very busy though, might need to subsample or make a heat map instead.

Analysis Pipeline Distribution of tasks

Hi @bettybhzhou, how do you think we can divide the work for the pipeline?

Improve code organization in python scripts

Your python & R scripts would be more readable and modular if they all had a main function.

Define non-processed?

Not clear in overview what non-processed means? And maybe provide some examples of what encompasses processing as well to help the reader understand what problem you are trying to solve.

Clustering the data into arrest and not arrest

Processed:

arrest, booked
arrest, cited
prosecuted by outside agency
juvenile cited
complainant refused to prosecute
district attorney refused to prosecute
not prosecuted
juvenile booked
psychopathic case
located
prosecuted for lesser offense

Not processed

none
cleared-contact juvenile for more info
juvenile diverted
juvenile admonished
exceptional clearance
unfounded

Milestone 1 feedback

Mechanics

Might want to add version number for the packages in the dependency section.
Might want to add a link to the final report in README.
Might want to explain the usage of the scripts a little more. Say, 01_clean_data.py has three arguments.

Quality

In the documentation at the top of a python or R script, you can explain what the arguments are in addition to showing an usage example.

Report

Would be nice to link the description of variables in Section 1.0 to the variable names in subsequent sections. Say, there are x and y in Table 2, are they longitude and latitude?
Also, I am not 100% sure what is stolen, buliding, vehicle neither when I just look at the Table. Are they among the 50 words extracted? Maybe list these 50 words somewhere?
How did you choose the depth of your decision tree?
How did you compute the testing accuracy in section 4.0?
Style-wise, would be nice to replace the 'snake_case' with the usual case throughout.

`all` target probably only needs to depend on `docs/san_francisco_report.md`

Given that docs/san_francisco_report.md depends on all the other things you are making, the all target probably only needs to depend on docs/san_francisco_report.md.

Make figures smaller

Figures should be as small as possible (yet still clearly readable and understandable). So that you can easily read the figure captions and surrounding text while looking at the figure. In the case of the figures in your report, the text size in the figures is great, but the plot areas and the bars/boxes can be much smaller. This is specifically only really a problem for figure 1 in your report.

files, not folders, should be listed for targets and dependencies

It is recommended to list all files (not just the folder they are in) for the targets and dependencies in a Makefile to keep the dependency tree working well.