propublica / sunlight-congress Goto Github PK
View Code? Open in Web Editor NEWThe Sunlight Foundation's Congress API. Shut down on Oct. 1, 2017.
License: Other
The Sunlight Foundation's Congress API. Shut down on Oct. 1, 2017.
License: Other
Instead of occurring as a report is filed, in the middle of a task, have tasks file reports marked as unread (as the default value). After the task is done running, go through all unread reports, mark them as read and send emails for any warnings or failures. Surround this in exception handling as well, and file a local report with a special flag set if it fails.
This is good not just so that reports can be filed from other languages and still reported on, but also so that tasks do not potentially hang in the middle of their job while trying to send an email. It's just sensible.
Example:
/votes.json?apikey=sunlight9&per_page=1&vote_breakdown.ayes%3E=200
It breaks upon storing a hit in the analytics db.
Report nightly, as Drumbone does.
File local reports, and make sure broad exception handling is covered.
For example, show me all the bills with at least 5 cosponsors:
bills.json?cosponsors_count>=5
<= for less than or equal to.
It's not possible that keys will have > or < in them. Not allowing "less than" or "greater than" without "or equal to" will only be problematic in the case of floating point numbers, of which we don't have any now. If we end up having them in the future, we can invent some special syntax for them (>>= and <<=, perhaps).
Receive keys from central, as Drumbone does.
Should be a drop-in replacement, and amendments_archive is a fine example of how to do it.
As peer to the array key (i.e. "bills"), include: "count", "page", and "per_page". "page" and "per_page" are the (possibly adjusted) pagination params, and "count" is the total number of items for that search.
Greater than and less than. Interpret any datetime-ish query value ("2010-09-29") against the actual datetime fields.
Add an "amendments" endpoint, using GovTrack's amendment XML:
Example:
http://www.govtrack.us/data/us/111/bills.amdt/h234.xml
Have a task, amendments_archive, that loads in all amendments to the table, and then goes over each bill (perhaps by Amendment.distinct(:bill_id) or the like) and adds an array of amendments to the bill. Each amendment on a bill should have only the basic fields (everything but the actions).
Give each task a folder that supports running unit tests (i.e. link to environment.rb correctly), or any other files the task needs.
Have the loader that governs making the rake tasks use the folder names. Have each task load in the [task_name].rb file in the root of the task's folder, and assume that a camelized class name is in there.
It's in Drumbone, don't repeat it in RTC.
Add failure/exception reporting to whitehouse_live.
It seems to only bring back the first section listed, for videos anyway:
Democratic and Republican whip notices for the House, using the code or algorithms in the old RTC API.
Port over the roll call fetching code from Drumbone, into a model named Vote. Add a vote_type field that's either "roll" or whatever it is.
As part of the get_votes task, following roll call loading, iterate through each bill and go through each one's votes array. For any voice votes, create them (and include a "bill" object on them). For any roll call votes, update them with anything worth doing (perhaps nothing, refer to notes).
If the Vote table is empty, fill it from scratch. Otherwise, you can just worry about the roll call votes in the Senate and House whose numbers are higher than the last recorded, since old roll call votes never change. Then, go over bills and add voice votes and link roll call votes as normal.
Once we're pulling in partial roll call vote data in real time, this logic can be updated to be: if the Vote table is empty, fill it from scratch. Otherwise, just fill in the un-filled in roll call votes, then go over bills and add voice votes and link roll call votes as normal.
Especially for the house_live script.
We only have the House in there right now, which is coming from HouseLive.gov. Use the one on republicans.senate.gov:
http://republican.senate.gov/public/index.cfm?FuseAction=FloorUpdates.Home
Restore functionality from Drumbone.
Document and try to enforce them somewhere. Anything used in params, basically: captures, callback, sections, apikey, per_page, page, order, sort. I think that's it.
For get_legislators, once a day.
For get_bills, twice a day.
For get_rolls, twice a day.
For house_live, every 5 minutes (consult Kaitlin).
Imminent:
For get_amendments, twice a day.
For rolls_live, every 10 minutes.
Down the line:
For floor_updates, every minute.
For various docserver scrapers, consult Josh.
Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.
Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.
I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.
You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".
But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.
(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)
So if someone wants a bill cosponsored by a pair of people:
/bills.json?cosponsor_ids__all=1|2
that'd match any bills where the cosponsor_ids array contains both "1" and "2".
Use the template in Drumbone, and in its nginx configuration, to set this up at the root.
Example of "expected" error:
/votes.json?apikey=sunlight9&per_page=10§ions=number,question&question~~=(
Real Time "Congress" be damned:
Links to the latest whip packs and notices for Democrats and Republicans, as modeled here:
http://realtimecongress.org/whip_dates.json
I'd like to change the modeling, but I'm not sure how.
All it does is, for models + sunlight_services + hit + api_key, run the Model.create_indexes method on each one.
As stuff stabilizes, document what it takes to add a new model, and a new task.
As the "sponsor" field, the basic fields for a committee. Also add the committee ID as the "sponsor_id" field.
Maybe I don't need an extra method at all, it's quite cumbersome besides.
For example, to find any vote that was not a roll call:
votes.json?vote_type!=roll
Keys are absolutely unlikely to use excalamation points, though this makes parsing out the conditions a little trickier, of course.
Be wary, as this caused issues in Drumbone when it got too high, but - keep something.
Perhaps a task that runs monthly that offloads the month's hits into a dump and removes them from the database.
The idea here is to have a separate task that can run every X minutes. It can make new roll call votes, that are missing fields (for example, "required" will be missing, and a related bill might not even exist yet). These fields will be filled in later by the twice-daily roll call vote task (that goes over all THOMAS-provided roll call votes and passage voice votes).
Example of House XML (view source, it's actually XML, and they use Bioguide IDs):
http://clerk.house.gov/evs/2010/roll518.xml
Example of Senate XML (uses some internal ID, will have to parse names out):
http://www.senate.gov/legislative/LIS/roll_call_votes/vote1112/vote_111_2_00229.xml
Sadly, in both cases we'll have to monitor an HTML table to see whether there's new stuff:
http://clerk.house.gov/evs/2010/index.asp
http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_111_2.htm
And the URLs for both tables depend on the year, Congress number, and session number. Not trivial, but: possible.
For most bill dates, it's a full on timestamp, at midnight UTC, which is incorrect. It should be limited to the date only, with no timestamp.
Since America is west of UTC, these dates represented in any American timezone would be the day before they actually are, which is a serious inaccuracy.
Work with Kaitlin to pull in floor events.
I'm not sure yet how to reconcile the floor events from this feed with the floor_events that Josh already picked up in the old RTC.
Get all scrapers working.
Now that we support regexes and comparators.
All are remnants from when videos were structured differently.
The Open State project uses this and it would be worth checking out how applicable to our own tasks, especially since the most awkward part of supporting multi-language data loaders is the reporting.
For clients which need it.
vote_breakdown: {
total: {ayes: ..., nays: ..., ...},
party: {R: {ayes: ..., nays: ..., ...}, D: {...}, ...},
}
Leaves room for us to easily expand on any other ways it could be broken down.
On house videos.
Connection refused
Have it nightly dump the tables to compressed JSON, at a publicly available address.
For example, to support queries such as "give me all bills that are actually bills and not resolutions":
bills.json?bill_type__in=hr|hjres|s|sjres
Pipes seem unlikely to occur in filterable fields, and if we found some source data that uses pipes, we could always swap those pipes out for something else before syndicating it.
Use an "in" query though, not an actual "or" query:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24in
Finally, when "not" is supported, support the idea of "not in" searches, like this one for "anything but simple resolutions":
bills.json?bill_type!=hres|sres
This would map to the "nin" operator in Mongo.
What we currently have in RTC:
http://realtimecongress.org/hearings_upcoming.json
For the Senate, they have nice XML:
http://www.senate.gov/general/committee_schedules/hearings.xml
For the House, GovTrack's feed:
http://www.govtrack.us/users/events-rss2.xpd?monitors=misc:allcommittee
Write a committees task that imports them from the Congress API similarly to legislators.
It's the "right" way to do it, and I'm sick of 404 errors in the logs.
Any bill which has a committee associated, see if we can add on the committee and its relationship to the bill. committee_ids and committee subobjects.
If we need to keep the relationship, then it should work like voter_ids and voters do on the vote object - {commitee_id: [id], relationship: "..."} and {committee: {obj}, relationship: "..."}.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.