Giter VIP home page Giter VIP logo

article-tagging's Introduction

Build Status

tagnews

tagnews is a Python library that can

  • Automatically categorize the text from news articles with type-of-crime tags, e.g. homicide, arson, gun violence, etc.
  • Automatically extract the locations discussed in the news article text, e.g. "55th and Woodlawn" and "1700 block of S. Halsted".
  • Retrieve the latitude/longitude pairs for said locations using an instance of the pelias geocoder hosted by CJP.
  • Get the community areas those lat/long pairs belong to using a shape file downloaded from the city data portal parsed by the shapely python library.

Sound interesting? There's example usage below!

You can find the source code on GitHub.

Installation

You can install tagnews with pip:

pip install tagnews

NOTE: You will need to install some NLTK packages as well:

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')

Beware, tagnews requires python >= 3.5.

Example

The main classes are tagnews.CrimeTags and tagnews.GeoCoder.

>>> import tagnews
>>> crimetags = tagnews.CrimeTags()
>>> article_text = ('The homicide occurred at the 1700 block of S. Halsted Ave.'
...   ' It happened just after midnight. Another person was killed at the'
...   ' intersection of 55th and Woodlawn, where a lone gunman')
>>> crimetags.tagtext_proba(article_text)
HOMI     0.739159
VIOL     0.146943
GUNV     0.134798
...
>>> crimetags.tagtext(article_text, prob_thresh=0.5)
['HOMI']
>>> geoextractor = tagnews.GeoCoder()
>>> prob_out = geoextractor.extract_geostring_probs(article_text)
>>> list(zip(*prob_out))
[..., ('at', 0.0044685714), ('the', 0.005466637), ('1700', 0.7173856),
 ('block', 0.81395197), ('of', 0.82227415), ('S.', 0.7940061),
 ('Halsted', 0.70529455), ('Ave.', 0.60538065), ...]
>>> geostrings = geoextractor.extract_geostrings(article_text, prob_thresh=0.5)
>>> geostrings
[['1700', 'block', 'of', 'S.', 'Halsted', 'Ave.'], ['55th', 'and', 'Woodlawn,']]
>>> coords, scores = geoextractor.lat_longs_from_geostring_lists(geostrings)
>>> coords
         lat       long
0  41.859021 -87.646934
1  41.794816 -87.597422
>>> scores # confidence in the lat/longs as returned by pelias, higher is better
array([0.878, 1.   ])
>>> geoextractor.community_area_from_coords(coords)
['LOWER WEST SIDE', 'HYDE PARK']

Limitations

This project uses Machine Learning to automate data cleaning/preparation tasks that would be cost and time prohibitive to perform using people. Like all Machine Learning projects, the results are not perfect, and in some cases may look just plain bad.

We strived to build the best models possible, but perfect accuracy is rarely possible. If you have thoughts on how to do better, please consider reporting an issue, or better yet contributing.

How can I contribute?

Great question! Please see CONTRIBUTING.md.

Problems?

If you have problems, please report an issue. Anything that is behaving unexpectedly is an issue, and should be reported. If you are getting bad or unexpected results, that is also an issue, and should be reported. We may not be able to do anything about it, but more data rarely degrades performance.

Background

We want to compare the amount of different types of crimes are reported in certain areas vs. the actual occurrence amount in those areas. In essence, are some crimes under-represented in certain areas but over-represented in others? This is the main question driving the analysis.

This question came from the Chicago Justice Project. They have been interested in answering this question for quite a while, and have been collecting the data necessary to have a data-backed answer. Their efforts include

  1. Scraping RSS feeds of articles written by Chicago area news outlets for several years, allowing them to collect almost half a million articles.
  2. Organizing an amazing group of volunteers that have helped them tag these articles with crime categories like "Gun Violence" and "Drugs", but also organizations such as "Cook County State's Attorney's Office", "Illinois State Police", "Chicago Police Department", and other miscellaneous categories such as "LGBTQ", "Immigration".
  3. The web UI used to do this tagging was also recently updated to allow highlighting of geographic information, resulting in several hundred articles with labeled location sub-strings.

Most of the code for those components can be found here.

A group actively working on this project meets every Tuesday at Chi Hack Night.

See Also

article-tagging's People

Contributors

alabavery avatar jlherzberg avatar kbrose avatar m00nd00r avatar mattesweeney avatar mchladek avatar rjworth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

article-tagging's Issues

Bag of Words

Try a simple bag-of-words model for article tagging.

Semantic word vectorizer

Are there easy-to-use, pre-trained, cross-platform word vectorizers? Preferably with a python interface?

Add continuous integration

I've been struggling with how to do this because:

  • I don't want to put trained models in version control
  • In order to train a model you need the data
  • We can't pull the 3GB of data down to the CI server each time we want to run the unit tests
  • W/o a trained model there's not much to test

I think instead it may make sense to put a very small subset of data in version control, both of article data and word vectors. Just enough to train any model. The tests will then not be testing accuracy or anything, but we could test everything is hooked up correctly.

Determine what post processing of geostrings should be.

Things like removing the string "block of ", or appending "Chicago Illinois" to the geostring if it doesn't already have it.

This will be somewhat geotagging provider dependent, but there's probably a lot of overlap in what transformations are beneficial for geotagging.

Find a good NER corpus to use as training data

Named Entity Recognition describes a superset of what we are trying to do w/ the location extraction. It is also a well studied field, so there are probably open corpora that have labeled data we could use to augment (or use exclusively as) our training data.

We should do some investigating to try and find such a corpus.

Update documentation - CONTRIBUTING

  • Simplify setup instructions, especially for code dependencies
  • How to get the database exports
  • Brief summary of directory structure
  • Breakdown the crime tags model
    • Where it is and what it is right now
    • Ways in which it could be improved
    • How to measure performance
  • Breakdown the geostring model
  • Geocoding
    • What to do if it breaks (do we want a section in the README too?)
    • Ways in which it could be improved (confidence score)
  • Tests
    • Travis CI
    • Running locally
    • Writing tests
  • Documentation
    • How to write documentation
    • How to publish it?
  • How to publish a new version
    • Add section on how to get the models in there!
  • Other, non-coding ways to help
  • Remove/consolidate GEOTAGGING_CONTRIB

Problem with running bag-of-words-count-stemmed.ipynb

After cloning article-tagging to the local, if the python setup.py install isn't run first, the df = ld.load_data() code line in the bag-of-words notebook will error that it can't find the news article being called.

Seems to work once the setup is run.

Error creating 'fpr', 'tpr', 'ppv' dataframes from bench_results

While trying to create the model locally in bag-of-words-count-stemmed-binary.ipynb at this code block:

fpr = pd.DataFrame(bench_results['fpr'], columns=crime_df.columns.values[9:]).T

tpr = pd.DataFrame(bench_results['tpr'], columns=crime_df.columns.values[9:]).T

ppv = pd.DataFrame(bench_results['ppv'], columns=crime_df.columns.values[9:]).T

I get the following error:


ValueError Traceback (most recent call last)
~/anaconda/envs/cjp-ap/lib/python3.6/site-packages/pandas/core/internals.py in create_block_manager_from_blocks(blocks, axes)
4293 blocks = [make_block(values=blocks[0],
-> 4294 placement=slice(0, len(axes[0])))]
4295

~/anaconda/envs/cjp-ap/lib/python3.6/site-packages/pandas/core/internals.py in make_block(values, placement, klass, ndim, dtype, fastpath)
2718
-> 2719 return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
2720

~/anaconda/envs/cjp-ap/lib/python3.6/site-packages/pandas/core/internals.py in init(self, values, placement, ndim, fastpath)
114 'implies %d' % (len(self.values),
--> 115 len(self.mgr_locs)))
116

ValueError: Wrong number of items passed 38, placement implies 41

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in ()
----> 1 fpr = pd.DataFrame(bench_results['fpr'], columns=crime_df.columns.values[9:]).T
2
3 tpr = pd.DataFrame(bench_results['tpr'], columns=crime_df.columns.values[9:]).T
4
5 ppv = pd.DataFrame(bench_results['ppv'], columns=crime_df.columns.values[9:]).T

~/anaconda/envs/cjp-ap/lib/python3.6/site-packages/pandas/core/frame.py in init(self, data, index, columns, dtype, copy)
304 else:
305 mgr = self._init_ndarray(data, index, columns, dtype=dtype,
--> 306 copy=copy)
307 elif isinstance(data, (list, types.GeneratorType)):
308 if isinstance(data, types.GeneratorType):

~/anaconda/envs/cjp-ap/lib/python3.6/site-packages/pandas/core/frame.py in _init_ndarray(self, values, index, columns, dtype, copy)
481 values = maybe_infer_to_datetimelike(values)
482
--> 483 return create_block_manager_from_blocks([values], [columns, index])
484
485 @Property

~/anaconda/envs/cjp-ap/lib/python3.6/site-packages/pandas/core/internals.py in create_block_manager_from_blocks(blocks, axes)
4301 blocks = [getattr(b, 'values', b) for b in blocks]
4302 tot_items = sum(b.shape[0] for b in blocks)
-> 4303 construction_error(tot_items, blocks[0].shape[1:], axes, e)
4304
4305

~/anaconda/envs/cjp-ap/lib/python3.6/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
4278 raise ValueError("Empty data passed with indices specified.")
4279 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 4280 passed, implied))
4281
4282

ValueError: Shape of passed values is (38, 4), indices imply (41, 4)

I did manage to run the model successfully before using the new database files. However, even when I try running the previous database subsequently, I get the exact same problem.

Any ideas as to how to resolve this?

Before submitting this issue, I have since deleted and re-cloned this repo, then setup, and re-run this notebook three times. I'm working on mac OSX. And finally, when I run the tagnews.test() from the command line that fails - if that helps.

Model Loading Fails When There is No Model

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-2bcf5cf913b8> in <module>()
      1 import tagnews
----> 2 tagger = tagnews.crimetype.tag.Tagger()
      3 article_text = 'A short article. About drugs and police.'
      4 tagger.relevant(article_text, prob_thresh=0.1)

/Users/nathancooperjones/GitHub/article-tagging/lib/tagnews/crimetype/tag.py in __init__(self, model_directory, clf, vectorizer)
     67         """
     68         if clf is None and vectorizer is None:
---> 69             self.clf, self.vectorizer = load_model(model_directory)
     70         elif clf is None or vectorizer is None:
     71             raise ValueError(('clf and vectorizer must both be None,'

/Users/nathancooperjones/GitHub/article-tagging/lib/tagnews/crimetype/tag.py in load_model(location)
     33     """
     34     models = glob.glob(os.path.join(location, 'model*.pkl'))
---> 35     model = models.pop()
     36     while models:
     37         model_time = time.strptime(model[-19:-4], '%Y%m%d-%H%M%S')

IndexError: pop from empty list

Investigate NER and CRF

Investigate Named Entity Recognition and the related field of Conditional Random Fields for the purpose of location extraction from articles.

Make standard benchmarking code

Benchmarker should...

  • accept pre-processed NxM matrix of numeric data
  • a classifier that has a .fit and .predict_proba method
  • perform k-fold cross validation (with a fixed random seed!), making sure the classifier is reset after each run

Open question: How to handle other languages?

Better test coverage

The constant struggle to cover every possible path through your code with every possible state.

CLI from stdin "broken" on windows

Perhaps a better description would blame windows for refusing to follow conventions, but it appears as if "ctrl-d" does not send an EOF signal in the command prompt as it (kind of) does in bash. This means it is impossible to gracefully exit from the stdin-based CLI:

>python -m tagnews.crimetype.cli
Go ahead and start typing. Hit ctrl-d when done.
This is a test article, attempting to get reading in from the stdin working.
^D^D^D

It would be great to either figure out how to send EOF from command prompt and document that (preferred), or modify the CLI to correctly handle the situation.

Capturing incident times

Would the project find any value in capturing incident times associated with location information in news articles? I'm working on address extraction and it would be a small effort to annotate related time data in the test corpus.

Simplify example geocoding notebook

Right now the geocoding example notebook uses the NER dataset from kaggle. Current best results are achieved using only our own training data downloadable from https://geo-extract-tester.herokuapp.com. To make things as easy as possible, the example notebook should be updated to download our training.txt file and not kaggle's.

Notebook in question: https://github.com/chicago-justice-project/article-tagging/blob/master/lib/notebooks/extract-geostring-example.ipynb

CLI from stdin and printing

Currently, when no arguments are passed to the CLI then it reads from stdin. A message is printed out to guide the user. However, this is just annoying when using it by piping something else to it, e.g.

$ cat article.txt | python -m tagnews.crimetype.cli

I think it would be better to

  • update CLI so that there's a flag for read from stdin, in which case no guidance is needed because the user already chose to use a flag, so they must know what they're doing
  • when no arguments are passed print help and quit.

Update documentation - README

  • Remove/update out-of-date info
  • Simplify README.md to make landing page more manageable
  • Clearly state the goals of the project
  • Make explicit the limitations of the project/using machine learning for something like this
  • Clarify the connections to the other CJP projects
  • Investigate other projects' READMEs, are there other sections we're missing?

Create a function to get lat/long values from list of geo-strings.

Related to #72.

Function signature should approximate

>>> lat_longs = lat_longs_from_geo_strings(['1700 block of south halsted', 'merchandise mart'])
>>> print(lat_longs)
[[12.345, 67.890], [98.76, 54.321]]

Some installation instructions for whatever API is being used will also be needed.

Add documentation on the data

Probably as a section in CONTRIBUTING.md.

  • Versions
  • Where to extract it
  • How to load it
  • Memory usage and ways to mitigate it
  • Column descriptions (including the new locations column, and the one-hot-encoded type-of-crime tags)

(These can be chunked off and handled one-by-one depending on contributor's comfortability with each bullet point.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.