chicago-justice-project / chicago-justice Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 11.0 9.3 MB

Chicago Justice Project

License: GNU General Public License v3.0

Python 72.90% Shell 1.02% JavaScript 3.86% CSS 0.82% HTML 19.51% Ruby 1.90%

chicago-justice's People

Stargazers

Watchers

Forkers

mchladek alexkcode pandey kbrose offended opportunitylivetv datamade mirrorecho sureshmelvinsigera

chicago-justice's Issues

Some categories implicitly imply other, more general categories

Some tags used for coding articles imply other, more general tags. For example, "Gun Violence (GUNV)" implies "Violence (VIOL)". However, how volunteers code articles may not be consistent. For an article involving gun violence, the coder may have tagged the article as just GUNV or as both GUNV and VIOL. There are a couple ways to address this:

When analyzing the data, infer these more general categories automatically
Better educate volunteers to not worry about tagging something as VIOL if they've already tagged it as GUNV

We should also check if there are any other issues similar to this.

Break out newsarticle app into separate, project

I was starting to work on a project that involved developing a corpus of Chicago media articles, so I was delighted to find that y'all have done a great job writing the scrapers for that already.

Would you consider breaking newsarticle into a separate project? I'd be happy to help with that.

Geolocation part 1: user-defined location strings

Users should be able to copy/paste strings from the article that provide location info. These strings should be stored on the EnteredData record, and can be used by the algo to learn what location data looks like.

Check on The Daily Line Scraper

The scraper for the Daily Line seems to have stopped scrapping on Jan 13th.

Prevent tags being selected if articles is marked as not relevant

From the data dump I got a while ago, there are 2733 articles out of 271808 total (= ~1% of articles) with tags, but also marked as not relevant.

It would be a nice-to-have to prevent this from happening. Front-end validation would probably be sufficient since I don't think we're super worried about malicious actors...

Use a CJP-specific user agent string when scraping

Look in to erratic coding site behavior

I guess one of our coders is having problems with he cite. I have tried repeatedly to log in with the admin credentials and I have had varying degrees of success. Sometimes I get a failure message and sometimes it let’s me log in - all with the same credentials.

Fix prod config/environment variables

Currently we are setting these in /etc/init/chicagojustice.conf, but should it read from the repo .env file? Or somewhere else?

Article publication date column

Currently the article model has a Created and Modified column, but not the actual date of publication. Do we need both created and modified? Idea: Published and Accessed (scraped) times.

Adding/removing categories or other metadata should probably not update either of the timestamps.

Upgrade to Django 1.11

1.11 was just released, and is the next LTS release after 1.8 (what the app is currently on).

Might be a good opportunity to upgrade to Python3 as well.

Visualize points as clusters

Display data on a map using clusters that decompact as the user zooms in.
This will be used as a possible way to display either the crime report data or the media data.

Example:
https://www.mapbox.com/mapbox-gl-js/example/cluster/

Test RDS database restore

We should do a full db restore from RDS snapshots and document the process.

Remove dead sources from database

Chicago News Cooperative
Chicago Journal

Automatically filter non-relevant articles

Utilize NLP data to filter out non-relevant articles, either hiding them from volunteers or automatically coding them as non-relevant, based on some confidence level (85% or 90% ??).

Move article categories into a data migration

Categories in production instance: http://data.chicagojustice.org/admin/newsarticles/category/

Better handle multiple users coding same article

Currently the model does not keep good track of when a user overwrites the coding done by another user. A better history of article codings by user should be kept. This will be especially important if article codings done by the NLP model are "coded" using a dummy user. We will need to have multiple users for multiple versions of the NLP and will want to keep track of how the different versions coded articles differently.

Sync two maps

syncing two maps so that they zoom in and move at the same time.

Ongoing work:
https://github.com/kyaroch/cjp-test-maps/blob/master/test_map.html

Examples:
http://blog.mastermaps.com/2013/06/creating-synchronized-view-of-two-maps.html
http://util.io/compare-maps

Visualize data as a choropleth map

Display data on a choropelth map that decompresses user zooms in.
This will be used as a possible way to display either the crime report data or the media data.

Example:
https://www.mapbox.com/mapbox-gl-js/example/updating-choropleth/

Move submit button up next to Relevant check box

This would allow the efficient coding of articles that are not relevant.

Scrape non-crime-related articles

Tracy wants to scrape all articles. Right now there are a few news sources where we're only scraping crime-related news feeds (i.e., bettergov, Sun-Times, & Fox). I'll create a PR soon that will scrape all articles from those sources. Should double check with Tracy before merging, though, that he's aware that it won't pick up previous articles. We'll only have non-crime-related articles from those sources from the time changes are made to prod.

Admin view for single UserCoding times out

Trying to load the Admin view for a single UserCoding, e.g. http://data.chicagojustice.org/admin/newsarticles/usercoding/202234/ fails with a 502 in the production DB.

Make onboarding easier

Ideas:

Redo the README
Watch a new person try to get everything set up, fix or document the sticking points
Provide a sample dataset for local testing
Remove postgis as a dependency (harder to get set up on some OSes)

Models for machine classifications

When we run the classification algo on the full dataset, we probably don't want to modify the article records directly. Especially if human coders have already classified it.

Idea: derived metadata (categories, relevant/not, location) should live in its own related table, which includes fields for how it was derived (human or algo). Could there be multiple classifications for an article, and choose which one to use?

Create Scraper for Crime in Wrigleyville blog

http://www.cwbchicago.com/

Can we scrape this site?

Thanks!

Tracy

Improve white space handling when selecting location data

The current JavaScript handing location selection when coding articles does not handle white space very robustly. A couple things we can do to improve it:

Check if only white space is selected and, if so, don't mark as location
Strip any beginning or ending white space of a selection, part of general selection cleaning

Make articles pop up for coders

If this can be done we would like to have a random un-coded article to pop up for the coder to code rather than the coder having to click to call up an article. This would eliminate the need for us to provide each coder a specific time range and track that for every coder.

Create JSON data for map testing

Login problems

The error message I received is below. I tried to log in and got this error message. When i received it I hit back and hit log in again and it worked fine. I am using Fire Fox with either the most up to date updates or very close. I am also on a Mac.

Forbidden (403)

CSRF verification failed. Request aborted.

More information is available with DEBUG=True.

Table dump problems

I think something is going wrong when dumping the tables? I downloaded the latest .tar.gz file (Updated 9//1/2017) from the sftp server, but when I try to extract it I get an error:

$ tar -xvzf cjp_tables.tgz 
cjp_tables/
cjp_tables/newsarticles_usercoding_categories.csv
cjp_tables/newsarticles_article.csv

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Opening in archive manager shows two files:

newsarticles_article.csv : a 1.4GB file (the previous data dump's version of this file was 2 GB)
newsarticles_usercoding_categories.csv : a 0 byte file.

Restart nightly DB exports

Now that this app has been migrated elastic beanstalk (:tada:), I'd love to get the nightly DB exports happening again.

(I'm mainly adding this issue so I have something to reference in the article-tagging's CONTRIBUTING.md file.)

Be able to hide obsolete categories in the coding form

Some categories aren't relevant anymore, but we probably shouldn't outright delete them. Add something like an active field to the Category model, which we can then filter on when rendering the Article coding view.

Add new page with docs & training videos

We would like to add a page to the coding site that allows coders to access training videos and the coding documents. This page should only be accessible by coders who have credentials and have signed in.

Add model's predictions to nightly data dump

Caputre Author for new & old articles

Ideally we want to be able to capture the author of each article as it is scraped.

We also want ot go back through the database of scraped articles and parse out the author from each article

Add ProPublica Illinois to scrapers

[https://www.propublica.org/illinois]

Add "What's New" notification on login for volunteers

As the procedure for coding/tagging articles by volunteers changes more quickly, we should add some kind of notification (e.g., pop-up) when users login that lets them know of recent changes to the interface and how they should be coding articles.

Manual article entry

Make it easy to manually input an article. Could just be the admin interface if it's not too error-prone

Have volunteers tag articles over several passes with each pass being a subset of tags

Right now volunteers tag/code articles using all ~30 categories at once, which are too many to keep in mind while tagging. Categories should be broken into groups (e.g., crimes, agencies, etc.) and volunteers should be assigned to one group for a while, only tagging articles for the group of tags they're assigned to.

Partially coded articles will be prioritized when a volunteer logs in and starts coding. Once there are no remaining partially coded articles, the volunteer will be presented a random article to code.

Volunteers should always tag addresses/location data regardless of the group of categories they are currently tagging.

psql cjpweb_prd -h $DATABASE_URL -U cjpuser -c "\d *" > column_names.txt

? (UNTESTED! I know nothing about psql. Hacked that together from man pages.)

chicago-justice-project / chicago-justice Goto Github PK

chicago-justice's People

Stargazers

Watchers

Forkers

chicago-justice's Issues

Recommend Projects

Recommend Topics

Recommend Org