chicago-justice-project / chicago-justice Goto Github PK
View Code? Open in Web Editor NEWChicago Justice Project
License: GNU General Public License v3.0
Chicago Justice Project
License: GNU General Public License v3.0
Some tags used for coding articles imply other, more general tags. For example, "Gun Violence (GUNV)" implies "Violence (VIOL)". However, how volunteers code articles may not be consistent. For an article involving gun violence, the coder may have tagged the article as just GUNV or as both GUNV and VIOL. There are a couple ways to address this:
We should also check if there are any other issues similar to this.
I was starting to work on a project that involved developing a corpus of Chicago media articles, so I was delighted to find that y'all have done a great job writing the scrapers for that already.
Would you consider breaking newsarticle
into a separate project? I'd be happy to help with that.
Users should be able to copy/paste strings from the article that provide location info. These strings should be stored on the EnteredData record, and can be used by the algo to learn what location data looks like.
The scraper for the Daily Line seems to have stopped scrapping on Jan 13th.
From the data dump I got a while ago, there are 2733 articles out of 271808 total (= ~1% of articles) with tags, but also marked as not relevant.
It would be a nice-to-have to prevent this from happening. Front-end validation would probably be sufficient since I don't think we're super worried about malicious actors...
I guess one of our coders is having problems with he cite. I have tried repeatedly to log in with the admin credentials and I have had varying degrees of success. Sometimes I get a failure message and sometimes it let’s me log in - all with the same credentials.
Currently we are setting these in /etc/init/chicagojustice.conf
, but should it read from the repo .env file? Or somewhere else?
Currently the article model has a Created and Modified column, but not the actual date of publication. Do we need both created and modified? Idea: Published and Accessed (scraped) times.
Adding/removing categories or other metadata should probably not update either of the timestamps.
1.11 was just released, and is the next LTS release after 1.8 (what the app is currently on).
Might be a good opportunity to upgrade to Python3 as well.
Display data on a map using clusters that decompact as the user zooms in.
This will be used as a possible way to display either the crime report data or the media data.
Example:
https://www.mapbox.com/mapbox-gl-js/example/cluster/
We should do a full db restore from RDS snapshots and document the process.
Chicago News Cooperative
Chicago Journal
Utilize NLP data to filter out non-relevant articles, either hiding them from volunteers or automatically coding them as non-relevant, based on some confidence level (85% or 90% ??).
Categories in production instance: http://data.chicagojustice.org/admin/newsarticles/category/
Currently the model does not keep good track of when a user overwrites the coding done by another user. A better history of article codings by user should be kept. This will be especially important if article codings done by the NLP model are "coded" using a dummy user. We will need to have multiple users for multiple versions of the NLP and will want to keep track of how the different versions coded articles differently.
syncing two maps so that they zoom in and move at the same time.
Ongoing work:
https://github.com/kyaroch/cjp-test-maps/blob/master/test_map.html
Examples:
http://blog.mastermaps.com/2013/06/creating-synchronized-view-of-two-maps.html
http://util.io/compare-maps
Display data on a choropelth map that decompresses user zooms in.
This will be used as a possible way to display either the crime report data or the media data.
Example:
https://www.mapbox.com/mapbox-gl-js/example/updating-choropleth/
This would allow the efficient coding of articles that are not relevant.
Tracy wants to scrape all articles. Right now there are a few news sources where we're only scraping crime-related news feeds (i.e., bettergov, Sun-Times, & Fox). I'll create a PR soon that will scrape all articles from those sources. Should double check with Tracy before merging, though, that he's aware that it won't pick up previous articles. We'll only have non-crime-related articles from those sources from the time changes are made to prod.
Trying to load the Admin view for a single UserCoding, e.g. http://data.chicagojustice.org/admin/newsarticles/usercoding/202234/ fails with a 502 in the production DB.
Ideas:
When we run the classification algo on the full dataset, we probably don't want to modify the article records directly. Especially if human coders have already classified it.
Idea: derived metadata (categories, relevant/not, location) should live in its own related table, which includes fields for how it was derived (human or algo). Could there be multiple classifications for an article, and choose which one to use?
The current JavaScript handing location selection when coding articles does not handle white space very robustly. A couple things we can do to improve it:
If this can be done we would like to have a random un-coded article to pop up for the coder to code rather than the coder having to click to call up an article. This would eliminate the need for us to provide each coder a specific time range and track that for every coder.
The error message I received is below. I tried to log in and got this error message. When i received it I hit back and hit log in again and it worked fine. I am using Fire Fox with either the most up to date updates or very close. I am also on a Mac.
Forbidden (403)
CSRF verification failed. Request aborted.
More information is available with DEBUG=True.
I think something is going wrong when dumping the tables? I downloaded the latest .tar.gz
file (Updated 9//1/2017) from the sftp server, but when I try to extract it I get an error:
$ tar -xvzf cjp_tables.tgz
cjp_tables/
cjp_tables/newsarticles_usercoding_categories.csv
cjp_tables/newsarticles_article.csv
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Opening in archive manager shows two files:
Now that this app has been migrated elastic beanstalk (:tada:), I'd love to get the nightly DB exports happening again.
(I'm mainly adding this issue so I have something to reference in the article-tagging's CONTRIBUTING.md file.)
Some categories aren't relevant anymore, but we probably shouldn't outright delete them. Add something like an active field to the Category model, which we can then filter on when rendering the Article coding view.
We would like to add a page to the coding site that allows coders to access training videos and the coding documents. This page should only be accessible by coders who have credentials and have signed in.
Ideally we want to be able to capture the author of each article as it is scraped.
We also want ot go back through the database of scraped articles and parse out the author from each article
[https://www.propublica.org/illinois]
As the procedure for coding/tagging articles by volunteers changes more quickly, we should add some kind of notification (e.g., pop-up) when users login that lets them know of recent changes to the interface and how they should be coding articles.
Make it easy to manually input an article. Could just be the admin interface if it's not too error-prone
Right now volunteers tag/code articles using all ~30 categories at once, which are too many to keep in mind while tagging. Categories should be broken into groups (e.g., crimes, agencies, etc.) and volunteers should be assigned to one group for a while, only tagging articles for the group of tags they're assigned to.
Partially coded articles will be prioritized when a volunteer logs in and starts coding. Once there are no remaining partially coded articles, the volunteer will be presented a random article to code.
Volunteers should always tag addresses/location data regardless of the group of categories they are currently tagging.
Update the UI to let the user drop a pin for each location text snippet. Default to google maps search result?
Store the lat/lng in EnteredData
Add form for regular users to be able to manually add an article.
Is it possible to add the location information from the volunteers to the data dump?
Some WGN articles have a hidden div with lat/long that seems to be where the article occurs. We might want to capture this in the scrapers
Example: http://wgntv.com/2017/06/27/police-trying-to-locate-parents-or-guardians-of-baby-boy/
ideally we would like to adjust UI so that the article is brought up to near the top and the codes moved to the right edge of the screen to make coding more efficient.
http://www.thebluevoiceblog.com/
This is a blog of the union that represents the patrol officers. I would like to start scraping it.
The dependency on PostGIS comes from geocode_point
spatial field type and use of GeoManager
model manager in the crimedata model, which may not be used much right now, but needs migration strategy to remove them.
Can we add this website to the list of sites we scrape?
At one point there was also a file column_names.txt
that came with the data dump. Could that be added as well? Maybe something like
psql cjpweb_prd -h $DATABASE_URL -U cjpuser -c "\d *" > column_names.txt
? (UNTESTED! I know nothing about psql
. Hacked that together from man pages.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.