Giter VIP home page Giter VIP logo

get-nytimes-articles's Introduction

get-nytimes-articles

Python tools for getting data from the New York Times Article API. Retrieves JSON from the API, stores it, parses it into a TSV file.

New York Times Article API Docs: http://developer.nytimes.com/docs/read/article_search_api_v2

Requesting an API Key for the Times API: http://developer.nytimes.com/docs/reference/keys

Recent Updates

  • use config file instead of manually editing lines in main .py file
  • check whether file exists before trying to parse it (See Issue #1)
  • changed references to CSV to TSV since that's what really gets produced
  • make script smart about whether or not to keep fetching for that day (i.e., stop when no more articles)
  • solve KeyError issues in parse module
  • get better info from API calls with errors

Dependencies

Python v2.7 (not tested on any others) Modules:

  • urllib2 (HTTPError)
  • json
  • datetime
  • time
  • sys
  • ConfigParser
  • logging

Why store the JSON files? Why not just parse them?

The New York Times is nice enough to allow programmatic access to its articles, but that doesn't mean I should query the API every time I want data. Instead, I query it once and cache the raw data, lessening the burden on the Times API. Then, I parse that raw data into whatever format I need - in this case a tab-delimited file with only some of the fields - and leave the raw data alone. Next time I have a research question that relies on the same articles, I can just re-parse the stored JSON files into whatever format helps me answer my new question.

Usage

Set your variables in the config file (copy settings_example.cfg to settings.cfg).

python getTimesArticles.py

Planned improvements

  • capture and re-request page after intermittent "504: Bad Gateway" errors
  • make script smart about running multi-day processes (i.e., respect the API limit and wait when more than 10K calls are needed)

get-nytimes-articles's People

Contributors

libbyh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

get-nytimes-articles's Issues

Error At End of News Day

Your NYT api wrapper seems like it would work perfectly for part of my own work but I keep getting a weird error when I try to run it. I've attached the testing log that it outputs but it is basically this again and again: ERROR:root:IOError in 19700107 page 63: 2 No such file or directory.

I have modified the search query to the following:

request_string = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=%22Earned+Income+Tax+Credit%22" + date + "&end_date=" + date + "&page=" + str(page) + "&api-key=" + api_key

For reference, this is the string I am running at the CLI with API key included:

python pullarticles.py -j /vagrant/nyt/articles -c /vagrant/nyt/file.csv -k APIKEY

I am running Ubuntu through a vagrant instance of a virtual machine. I don't think this is the problem though as I sometimes get a few hits. I am moderately familiar with Python and have played around with the Twitter API before but I can't figure this out.

That query works in the NYT API console and spits out similar results in the archive search as well.

Thanks!

Handle 429 errors gracefully

API occasionally returns HTTP 429 errors. Usually can resolve by resending request. Need to modify script to

  1. pause
  2. resend request that generated 429
  3. resume normal operation

Retry calls that failed

API still returns the occasional 403, 429 or 504 error. Need to capture a single list of pages that failed, check if they eventually worked (429 fix in #5 has a wait + retry feature), and retry if they didn't.

Empty or malformed queries don't quit script

Here's an example line from the logs:
ERROR:root:HTTPError on page 92 on 20140106 (err no. 400: Bad Request) Here's the URL of the call: http://api.nytimes.com/svc/search/v2/articlesearch.json?q=&begin_date=20140106&end_date=20140106&page=92&api-key=[hidden]

Basically, if the settings.cfg file isn't properly completed, the script just keeps calling the API with bad requests. Need to institute checks of the settings so the script fails gracefully and helpfully if they're wrong.

Script Stops Before Time Parameters

Hi,

I'm hoping you can help with this. I am trying to use this script to run a search over a number of years. The output file seems to stop before the first day of the first month, with about 200 results. Any idea of what could be going wrong?

Thanks,

Megan

Handle 403 errors gracefully

HTTP 403 Forbidden errors occur, probably because of API call limits. Script should fail gracefully - tell user what's happening and why. Currently does so in logs, but people aren't always looking there.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.