propublica / sunlight-congress Goto Github PK

The Sunlight Foundation's Congress API. Shut down on Oct. 1, 2017.

Home Page: https://www.propublica.org/nerds/item/congress-api-bill-subjects-personal-explanations-and-sunsetting-sunlight

License: Other

Ruby 88.24% Shell 1.20% Python 10.56%

sunlight-congress's People

Contributors

Stargazers

Watchers

sunlight-congress's Issues

Add Amendments

Add an "amendments" endpoint, using GovTrack's amendment XML:

Example:
http://www.govtrack.us/data/us/111/bills.amdt/h234.xml

Have a task, amendments_archive, that loads in all amendments to the table, and then goes over each bill (perhaps by Amendment.distinct(:bill_id) or the like) and adds an array of amendments to the bill. Each amendment on a bill should have only the basic fields (everything but the actions).

Link votes and amendments

votes_archive should look for references to amendments (do they exist?) and add an amendment_id field and amendment subobject if it's there.
Same for rolls_live_house and rolls_live_senate.

Make an index-creating capistrano task

All it does is, for models + sunlight_services + hit + api_key, run the Model.create_indexes method on each one.

Support operators on datetime fields

Greater than and less than. Interpret any datetime-ish query value ("2010-09-29") against the actual datetime fields.

Rename pubDate to pubdate

On house videos.

Record hits in the database for analytics

Be wary, as this caused issues in Drumbone when it got too high, but - keep something.

Perhaps a task that runs monthly that offloads the month's hits into a dump and removes them from the database.

Support greater/less than or equal to

For example, show me all the bills with at least 5 cosponsors:

bills.json?cosponsors_count>=5

<= for less than or equal to.

It's not possible that keys will have > or < in them. Not allowing "less than" or "greater than" without "or equal to" will only be problematic in the case of floating point numbers, of which we don't have any now. If we end up having them in the future, we can invent some special syntax for them (>>= and <<=, perhaps).

Sync with central API key service

Receive keys from central, as Drumbone does.

Include count and page keys for plural endpoints

As peer to the array key (i.e. "bills"), include: "count", "page", and "per_page". "page" and "per_page" are the (possibly adjusted) pagination params, and "count" is the total number of items for that search.

Senate Floor Updates

We only have the House in there right now, which is coming from HouseLive.gov. Use the one on republicans.senate.gov:

http://republican.senate.gov/public/index.cfm?FuseAction=FloorUpdates.Home

Put party breakdown inside vote_breakdown

vote_breakdown: {
    total: {ayes: ..., nays: ..., ...},
    party: {R: {ayes: ..., nays: ..., ...}, D: {...}, ...},
}

Leaves room for us to easily expand on any other ways it could be broken down.

Require an API key

Restore functionality from Drumbone.

Dates should be dates, not timestamps

For most bill dates, it's a full on timestamp, at midnight UTC, which is incorrect. It should be limited to the date only, with no timestamp.

Since America is west of UTC, these dates represented in any American timezone would be the day before they actually are, which is a serious inaccuracy.

Report analytics nightly

Report nightly, as Drumbone does.

File local reports, and make sure broad exception handling is covered.

Expand videos to include White House videos

Real Time "Congress" be damned:

update house_live script to add a "chamber" field with a value of "house"
update house_live script to rename "timestamp_id" to "video_id" and prepend "house", e.g. "house-123456789"
make two whitehouse_live scripts that pull archival and live videos. Use a "chamber" value of "whitehouse", and a "video_id" value of "whitehouse-" followed by the date and slug, e.g. "whitehouse-2010-11-23-new-start-treaty".

Support "in" and "nin" operators

For example, to support queries such as "give me all bills that are actually bills and not resolutions":

bills.json?bill_type__in=hr|hjres|s|sjres

Pipes seem unlikely to occur in filterable fields, and if we found some source data that uses pipes, we could always swap those pipes out for something else before syndicating it.

Use an "in" query though, not an actual "or" query:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24in

Finally, when "not" is supported, support the idea of "not in" searches, like this one for "anything but simple resolutions":

bills.json?bill_type!=hres|sres

This would map to the "nin" operator in Mongo.

Fetch some roll call vote data in real time

The idea here is to have a separate task that can run every X minutes. It can make new roll call votes, that are missing fields (for example, "required" will be missing, and a related bill might not even exist yet). These fields will be filled in later by the twice-daily roll call vote task (that goes over all THOMAS-provided roll call votes and passage voice votes).

Example of House XML (view source, it's actually XML, and they use Bioguide IDs):
http://clerk.house.gov/evs/2010/roll518.xml

Example of Senate XML (uses some internal ID, will have to parse names out):
http://www.senate.gov/legislative/LIS/roll_call_votes/vote1112/vote_111_2_00229.xml

Sadly, in both cases we'll have to monitor an HTML table to see whether there's new stuff:
http://clerk.house.gov/evs/2010/index.asp
http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_111_2.htm

And the URLs for both tables depend on the year, Congress number, and session number. Not trivial, but: possible.

Investigate Hudson for monitoring

The Open State project uses this and it would be worth checking out how applicable to our own tasks, especially since the most awkward part of supporting multi-language data loaders is the reporting.

Add exception catching to whitehouse_live

Add failure/exception reporting to whitehouse_live.

Add XML output option

For clients which need it.

Link committees and amendments

As the "sponsor" field, the basic fields for a committee. Also add the committee ID as the "sponsor_id" field.

Add a Vote model and populate it

Port over the roll call fetching code from Drumbone, into a model named Vote. Add a vote_type field that's either "roll" or whatever it is.

As part of the get_votes task, following roll call loading, iterate through each bill and go through each one's votes array. For any voice votes, create them (and include a "bill" object on them). For any roll call votes, update them with anything worth doing (perhaps nothing, refer to notes).

If the Vote table is empty, fill it from scratch. Otherwise, you can just worry about the roll call votes in the Senate and House whose numbers are higher than the last recorded, since old roll call votes never change. Then, go over bills and add voice votes and link roll call votes as normal.

Once we're pulling in partial roll call vote data in real time, this logic can be updated to be: if the Vote table is empty, fill it from scratch. Otherwise, just fill in the un-filled in roll call votes, then go over bills and add voice votes and link roll call votes as normal.

Support "not" for fields

For example, to find any vote that was not a roll call:

votes.json?vote_type!=roll

Keys are absolutely unlikely to use excalamation points, though this makes parsing out the conditions a little trickier, of course.

Add lots more filter fields

Now that we support regexes and comparators.

Don't include "filename" field in roll call vote

It's in Drumbone, don't repeat it in RTC.

Link committees and bills

Any bill which has a committee associated, see if we can add on the committee and its relationship to the bill. committee_ids and committee subobjects.

If we need to keep the relationship, then it should work like voter_ids and voters do on the vote object - {commitee_id: [id], relationship: "..."} and {committee: {obj}, relationship: "..."}.

Filter keys with dots don't work

Example:
/votes.json?apikey=sunlight9&per_page=1&vote_breakdown.ayes%3E=200

It breaks upon storing a hit in the analytics db.

Forbidden fields on models

Document and try to enforce them somewhere. Anything used in params, basically: captures, callback, sections, apikey, per_page, page, order, sort. I think that's it.

Return 204 for favicon.ico at the nginx level

It's the "right" way to do it, and I'm sick of 404 errors in the logs.

Remove "clip_id", "full_length", and "offset" from top-level video object

All are remnants from when videos were structured differently.

Set up cronjobs on staging and backend

For get_legislators, once a day.
For get_bills, twice a day.
For get_rolls, twice a day.
For house_live, every 5 minutes (consult Kaitlin).

Imminent:
For get_amendments, twice a day.
For rolls_live, every 10 minutes.

Down the line:
For floor_updates, every minute.
For various docserver scrapers, consult Josh.

Whip Date/Notices

Links to the latest whip packs and notices for Democrats and Republicans, as modeled here:

http://realtimecongress.org/whip_dates.json

I'd like to change the modeling, but I'm not sure how.

Add /crossdomain.xml support

Use the template in Drumbone, and in its nginx configuration, to set this up at the root.

Handle arbitrary error messages returned to the user in the right format

Example of "expected" error:
/votes.json?apikey=sunlight9&per_page=10&sections=number,question&question~~=(

Allow options to pass from rake to individual task

Especially for the house_live script.

Switch hpricot code to Nokogiri

Should be a drop-in replacement, and amendments_archive is a fine example of how to do it.

Bring in committees for internal use

Write a committees task that imports them from the Congress API similarly to legislators.

Emails not sending from production box

Connection refused

Publish data in bulk

Have it nightly dump the tables to compressed JSON, at a publicly available address.

Pull in House videos and floor events

Work with Kaitlin to pull in floor events.

I'm not sure yet how to reconcile the floor events from this feed with the floor_events that Josh already picked up in the old RTC.

Directory structure for tasks

Give each task a folder that supports running unit tests (i.e. link to environment.rb correctly), or any other files the task needs.

Have the loader that governs making the rake tasks use the folder names. Have each task load in the [task_name].rb file in the root of the task's folder, and assume that a camelized class name is in there.

Set up Python on backend and api boxes

Get all scrapers working.

Document the developer perspective in a README

As stuff stabilizes, document what it takes to add a new model, and a new task.

Committee Hearings

What we currently have in RTC:
http://realtimecongress.org/hearings_upcoming.json

For the Senate, they have nice XML:
http://www.senate.gov/general/committee_schedules/hearings.xml

For the House, GovTrack's feed:
http://www.govtrack.us/users/events-rss2.xpd?monitors=misc:allcommittee

Look into introspecting filter_keys on defined Mongoid fields with types

Maybe I don't need an extra method at all, it's quite cumbersome besides.

Email-time for failure and warning reports should occur post-task

Instead of occurring as a report is filed, in the middle of a task, have tasks file reports marked as unread (as the default value). After the task is done running, go through all unread reports, mark them as read and send emails for any warnings or failures. Surround this in exception handling as well, and file a local report with a special flag set if it fails.

This is good not just so that reports can be filed from other languages and still reported on, but also so that tasks do not potentially hang in the middle of their job while trying to send an email. It's just sensible.

House Whip Notices

Democratic and Republican whip notices for the House, using the code or algorithms in the old RTC API.

Pluck out legislator_names and bioguide_ids from clip description

Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.

Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.

I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.

You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".

But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.

(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)

Comma separated values not working for sections parameter

It seems to only bring back the first section listed, for videos anyway:

http://api.realtimecongress.org/api/v1/videos.xml?per_page=7&sections=duration,clip-id,video-id,legislative-day,clip-urls&apikey=&order=legislative_day&sort=desc

Support $all operator

So if someone wants a bill cosponsored by a pair of people:
/bills.json?cosponsor_ids__all=1|2

that'd match any bills where the cosponsor_ids array contains both "1" and "2".

propublica / sunlight-congress Goto Github PK

sunlight-congress's People

Contributors

Stargazers

Watchers

Forkers

sunlight-congress's Issues

Recommend Projects

Recommend Topics

Recommend Org