casmlab / get-nytimes-articles Goto Github PK

Python tools for getting data from the New York Times Article API. Retrieves JSON from the API, stores it, parses it into a CSV file.

License: MIT License

Python 100.00%

get-nytimes-articles's Introduction

get-nytimes-articles

Python tools for getting data from the New York Times Article API. Retrieves JSON from the API, stores it, parses it into a TSV file.

New York Times Article API Docs: http://developer.nytimes.com/docs/read/article_search_api_v2

Requesting an API Key for the Times API: http://developer.nytimes.com/docs/reference/keys

Recent Updates

use config file instead of manually editing lines in main .py file
check whether file exists before trying to parse it (See Issue #1)
changed references to CSV to TSV since that's what really gets produced
make script smart about whether or not to keep fetching for that day (i.e., stop when no more articles)
solve KeyError issues in parse module
get better info from API calls with errors

Dependencies

Python v2.7 (not tested on any others) Modules:

urllib2 (HTTPError)
json
datetime
time
sys
ConfigParser
logging

Why store the JSON files? Why not just parse them?

The New York Times is nice enough to allow programmatic access to its articles, but that doesn't mean I should query the API every time I want data. Instead, I query it once and cache the raw data, lessening the burden on the Times API. Then, I parse that raw data into whatever format I need - in this case a tab-delimited file with only some of the fields - and leave the raw data alone. Next time I have a research question that relies on the same articles, I can just re-parse the stored JSON files into whatever format helps me answer my new question.

Usage

Set your variables in the config file (copy settings_example.cfg to settings.cfg).

python getTimesArticles.py

Planned improvements

capture and re-request page after intermittent "504: Bad Gateway" errors
make script smart about running multi-day processes (i.e., respect the API limit and wait when more than 10K calls are needed)

get-nytimes-articles's People

Contributors

Stargazers

Watchers

get-nytimes-articles's Issues

Error At End of News Day

Your NYT api wrapper seems like it would work perfectly for part of my own work but I keep getting a weird error when I try to run it. I've attached the testing log that it outputs but it is basically this again and again: ERROR:root:IOError in 19700107 page 63: 2 No such file or directory.

I have modified the search query to the following:

request_string = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=%22Earned+Income+Tax+Credit%22" + date + "&end_date=" + date + "&page=" + str(page) + "&api-key=" + api_key

For reference, this is the string I am running at the CLI with API key included:

python pullarticles.py -j /vagrant/nyt/articles -c /vagrant/nyt/file.csv -k APIKEY

I am running Ubuntu through a vagrant instance of a virtual machine. I don't think this is the problem though as I sometimes get a few hits. I am moderately familiar with Python and have played around with the Twitter API before but I can't figure this out.

That query works in the NYT API console and spits out similar results in the archive search as well.

Thanks!

Handle 429 errors gracefully

API occasionally returns HTTP 429 errors. Usually can resolve by resending request. Need to modify script to

pause
resend request that generated 429
resume normal operation

Retry calls that failed

API still returns the occasional 403, 429 or 504 error. Need to capture a single list of pages that failed, check if they eventually worked (429 fix in #5 has a wait + retry feature), and retry if they didn't.

Empty or malformed queries don't quit script

Here's an example line from the logs:
ERROR:root:HTTPError on page 92 on 20140106 (err no. 400: Bad Request) Here's the URL of the call: http://api.nytimes.com/svc/search/v2/articlesearch.json?q=&begin_date=20140106&end_date=20140106&page=92&api-key=[hidden]

Basically, if the settings.cfg file isn't properly completed, the script just keeps calling the API with bad requests. Need to institute checks of the settings so the script fails gracefully and helpfully if they're wrong.

Script Stops Before Time Parameters

Hi,

I'm hoping you can help with this. I am trying to use this script to run a search over a number of years. The output file seems to stop before the first day of the first month, with about 200 results. Any idea of what could be going wrong?

Thanks,

Megan

Handle 403 errors gracefully

HTTP 403 Forbidden errors occur, probably because of API call limits. Script should fail gracefully - tell user what's happening and why. Currently does so in logs, but people aren't always looking there.