code-312 / rescue-chicago Goto Github PK

Repository for work related to a interactive data dashboard that can be used to analyze how different dog characteristics may correlate with average length of stay in a shelter prior to adoption.

Home Page: https://code312-rescue-trends-2659be78e6b4.herokuapp.com/

Shell 0.02% Procfile 0.01% Python 20.70% Jupyter Notebook 79.27%

data-visualization pandas python

rescue-chicago's Issues

Quality Control on House Trained

Make sure that nulls aren't turning into False's
#38

Data Pipeline Script

Do alllll of the data stuff in one script, parameterized by city. This will run everything from calling the petfinder API to cleaning to syncing data up to Postgres.

Plot Trends Over Time

Right now we show aggregate counts / LOS across all dogs over all time. If we could plot trends over time, it would be interesting to see how LOS or counts have varied over time.

Create a new "Trends Over Time" page(s)
Add plot with x-axis of month - year, y-axis of length of stay
Add plot with x-axis of month - year, y-axis of count of dogs
Extra Credit: Let users add different colored lines to the plot corresponding to "groups". There's several options of how we do this that would be valuable, some ideas / options below:
- "group by" an attribute (age, size, etc) and make a colored line on the plots per unique value in that category
- "group by" breed, but since there's a huge number of breeds, we'd probably want to do something like we have in the sidebar where you can select a number of random breeds, or can specifically type in a list of breeds
- "filter by" where you can bunch through a bunch of attributes to define each group, sort of like the two columns we have on the other pages

Parameterize Data Pipeline by City

Or some different way, but goal is to create data files by city. Also when the data goes into the Postgres table, add in a column with that city's name

Email Petfinder and ask if there's something about

Right now Heroku is looking at @ecooperman 's personal GitHub repo clone, which seems a little fragile. We should figure out a way to either point to our C4C repo, or some type of manual sync over process.

More Trend Options

Give more options for trends to look at!

Current state:

Ideas of some things to add:

organization (once @Jared-Kunhart 's PR is merged and the db is updated)
color (primary)
location (once we add it to the db)

Pull Pet Data on the fly

Automated Feature Selection

Feature values in "breed_primary","gender","coat","color_primary","color_secondary","color_tertiary" vary significantly. The model features are manually typed out but values change in each feature depending on the data pull.

To reduce the amount of manual work and remain consistent, feature pulls for X should be automated based on values.

Adoptability Model Improvements

Feature Selection

Descriptive stats on color, breed to learn which are correlated to LOS
Test for multicollinearity between features

Data Quality

Remove outliers (based on LOS)

Model Evaluation

Add additional evaluation metrics

Research PetFinder Resources

Some datasets from PetFinder have already been curated and used for other applications like Kaggle competitions.

This ticket is to research prior competitions or community projects that have used PetFinder data, and see if anything from this could be useful. For example, there might be an existing dataset we could pull from, ideas of how to engineer features, etc.

One starting place could be this Kaggle competition, or this corresponding dataset in TensorFlow.

You can document your findings either in this GitHub repo, or our shared Google Drive

Remove Outliers

There's some crazy data that we ingest from PetFinder - like dogs that have apparently been up for adoption for 10+ years. We'll want to remove any obviously wrong data from our database, so that it doesn't lead to misleading conclusion in the dashboards.

Some ideas for analysis to help inform outlier removal:

Plot of typical LOS by posted date (thought being that maybe older postings have less reliable data)
Histogram of LOS data (thought being that maybe there's some process to automatically remove dogs after some amount of time, e.g. maybe their postings "expire")
Boxplot of LOS by organization (thought being that maybe some organizations are better or worse than others at being diligent about their data)

Adoptability Rating

Forecasting Intake V2

Christine from Rescue Chicago expressed that having forecasts for intake would both help them plan internally, and communicate out their anticipated needs to transfer facilities. For example, this can help transfer facilities plan their long-distance transfers in anticipation of CACC needs.

This ticket is to iterate on the initial forecasting POC that @TheeChris completed, and see if we can improve it, and also explore any of the driving factors for intake rates.

Some ideas to get started might include exploring forecasting methods, accounting for known trends like seasonality, handling the crazy data from 2020, etc.

Unit Tests for Streamlit Functions

Plotly Comparison Visualization

Currently Breed Trends by Length of Stay has a comparison chart using Plotly. However Other Trends by LoS, Breed Trends by Count, Other Trends by Count don't use Plotly and have two separate charts instead. Combine them into one comparison chart.

Organization Trends by Length of Stay

There is pages for Breed Trends by Length of Stay and Breed Trends by Count. Kayla and I thought it would be great to compare by organizations as well. Whether this comes in the form of a separate page or just select boxes on those pages would be great.

Mobility Visualization

Preprocess Features for a Model

Most machine learning models expect exclusively numeric input features. Some (most?) of our features are categories (puppy, young, adult... or breed names for example).

Let's use pandas.DataFrame as the data structure in preparing our dataset for modeling. Scikit-learn, the most commonly used ML package, supports this datatype for running models.

Preprocessing ideas:

Any True/False feature can be converted to 0/1 values
Any ordinal features (like age category, or size category) should be mapped to numbers (e.g. baby: 0, young: 1, adult: 2, senior: 3)
Categorical variables without ordering should be one-hot encoded. Scikit-learn has a helpful function here. For dog breeds, since there's a huge number of possibilities, I'd recommend only keeping the most popular breeds. So, you might play around with the parameter min_frequency or max_categories so we don't end up with more than ~50 or so breeds
Drop any columns that we don't want included in the model. For example, name or ID number should probably be dropped

I think it would make the most sense to organize this as a new step that runs on the output from data_cleaner. We could call it data_preprocessor?

Unit Tests for Data Pipeline

To follow best practices, we should write unit tests 😄 This ticket is to add unit tests for our data pipeline functions

More Features!

Not all the data we get from the API makes it through the data cleaner and into our database. Let's change that!

Some ideas of features to add:

description
photo urls
tags
location

code-312 / rescue-chicago Goto Github PK

rescue-chicago's People

Contributors

Stargazers

Watchers

rescue-chicago's Issues

Recommend Projects

Recommend Topics

Recommend Org