Giter VIP home page Giter VIP logo

chicago-justice's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

chicago-justice's Issues

Some categories implicitly imply other, more general categories

Some tags used for coding articles imply other, more general tags. For example, "Gun Violence (GUNV)" implies "Violence (VIOL)". However, how volunteers code articles may not be consistent. For an article involving gun violence, the coder may have tagged the article as just GUNV or as both GUNV and VIOL. There are a couple ways to address this:

  • When analyzing the data, infer these more general categories automatically
  • Better educate volunteers to not worry about tagging something as VIOL if they've already tagged it as GUNV

We should also check if there are any other issues similar to this.

Break out newsarticle app into separate, project

I was starting to work on a project that involved developing a corpus of Chicago media articles, so I was delighted to find that y'all have done a great job writing the scrapers for that already.

Would you consider breaking newsarticle into a separate project? I'd be happy to help with that.

Geolocation part 1: user-defined location strings

Users should be able to copy/paste strings from the article that provide location info. These strings should be stored on the EnteredData record, and can be used by the algo to learn what location data looks like.

Prevent tags being selected if articles is marked as not relevant

From the data dump I got a while ago, there are 2733 articles out of 271808 total (= ~1% of articles) with tags, but also marked as not relevant.

It would be a nice-to-have to prevent this from happening. Front-end validation would probably be sufficient since I don't think we're super worried about malicious actors...

Look in to erratic coding site behavior

I guess one of our coders is having problems with he cite. I have tried repeatedly to log in with the admin credentials and I have had varying degrees of success. Sometimes I get a failure message and sometimes it let’s me log in - all with the same credentials.

Article publication date column

Currently the article model has a Created and Modified column, but not the actual date of publication. Do we need both created and modified? Idea: Published and Accessed (scraped) times.

Adding/removing categories or other metadata should probably not update either of the timestamps.

Upgrade to Django 1.11

1.11 was just released, and is the next LTS release after 1.8 (what the app is currently on).

Might be a good opportunity to upgrade to Python3 as well.

Automatically filter non-relevant articles

Utilize NLP data to filter out non-relevant articles, either hiding them from volunteers or automatically coding them as non-relevant, based on some confidence level (85% or 90% ??).

Better handle multiple users coding same article

Currently the model does not keep good track of when a user overwrites the coding done by another user. A better history of article codings by user should be kept. This will be especially important if article codings done by the NLP model are "coded" using a dummy user. We will need to have multiple users for multiple versions of the NLP and will want to keep track of how the different versions coded articles differently.

Scrape non-crime-related articles

Tracy wants to scrape all articles. Right now there are a few news sources where we're only scraping crime-related news feeds (i.e., bettergov, Sun-Times, & Fox). I'll create a PR soon that will scrape all articles from those sources. Should double check with Tracy before merging, though, that he's aware that it won't pick up previous articles. We'll only have non-crime-related articles from those sources from the time changes are made to prod.

Make onboarding easier

Ideas:

  • Redo the README
  • Watch a new person try to get everything set up, fix or document the sticking points
  • Provide a sample dataset for local testing
  • Remove postgis as a dependency (harder to get set up on some OSes)

Models for machine classifications

When we run the classification algo on the full dataset, we probably don't want to modify the article records directly. Especially if human coders have already classified it.

Idea: derived metadata (categories, relevant/not, location) should live in its own related table, which includes fields for how it was derived (human or algo). Could there be multiple classifications for an article, and choose which one to use?

Improve white space handling when selecting location data

The current JavaScript handing location selection when coding articles does not handle white space very robustly. A couple things we can do to improve it:

  • Check if only white space is selected and, if so, don't mark as location
  • Strip any beginning or ending white space of a selection, part of general selection cleaning

Make articles pop up for coders

If this can be done we would like to have a random un-coded article to pop up for the coder to code rather than the coder having to click to call up an article. This would eliminate the need for us to provide each coder a specific time range and track that for every coder.

Login problems

The error message I received is below. I tried to log in and got this error message. When i received it I hit back and hit log in again and it worked fine. I am using Fire Fox with either the most up to date updates or very close. I am also on a Mac.

Forbidden (403)

CSRF verification failed. Request aborted.

More information is available with DEBUG=True.

Table dump problems

I think something is going wrong when dumping the tables? I downloaded the latest .tar.gz file (Updated 9//1/2017) from the sftp server, but when I try to extract it I get an error:

$ tar -xvzf cjp_tables.tgz 
cjp_tables/
cjp_tables/newsarticles_usercoding_categories.csv
cjp_tables/newsarticles_article.csv

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Opening in archive manager shows two files:

  • newsarticles_article.csv : a 1.4GB file (the previous data dump's version of this file was 2 GB)
  • newsarticles_usercoding_categories.csv : a 0 byte file.

Restart nightly DB exports

Now that this app has been migrated elastic beanstalk (:tada:), I'd love to get the nightly DB exports happening again.

(I'm mainly adding this issue so I have something to reference in the article-tagging's CONTRIBUTING.md file.)

Be able to hide obsolete categories in the coding form

Some categories aren't relevant anymore, but we probably shouldn't outright delete them. Add something like an active field to the Category model, which we can then filter on when rendering the Article coding view.

Add new page with docs & training videos

We would like to add a page to the coding site that allows coders to access training videos and the coding documents. This page should only be accessible by coders who have credentials and have signed in.

Caputre Author for new & old articles

Ideally we want to be able to capture the author of each article as it is scraped.

We also want ot go back through the database of scraped articles and parse out the author from each article

Add "What's New" notification on login for volunteers

As the procedure for coding/tagging articles by volunteers changes more quickly, we should add some kind of notification (e.g., pop-up) when users login that lets them know of recent changes to the interface and how they should be coding articles.

Manual article entry

Make it easy to manually input an article. Could just be the admin interface if it's not too error-prone

Have volunteers tag articles over several passes with each pass being a subset of tags

Right now volunteers tag/code articles using all ~30 categories at once, which are too many to keep in mind while tagging. Categories should be broken into groups (e.g., crimes, agencies, etc.) and volunteers should be assigned to one group for a while, only tagging articles for the group of tags they're assigned to.

Partially coded articles will be prioritized when a volunteer logs in and starts coding. Once there are no remaining partially coded articles, the volunteer will be presented a random article to code.

Volunteers should always tag addresses/location data regardless of the group of categories they are currently tagging.

Remove PostGIS Dependency

The dependency on PostGIS comes from geocode_point spatial field type and use of GeoManager model manager in the crimedata model, which may not be used much right now, but needs migration strategy to remove them.

Add column description to data dump

At one point there was also a file column_names.txt that came with the data dump. Could that be added as well? Maybe something like

psql cjpweb_prd -h $DATABASE_URL -U cjpuser -c "\d *" > column_names.txt

? (UNTESTED! I know nothing about psql. Hacked that together from man pages.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.