Giter VIP home page Giter VIP logo

sunlight-congress's People

Contributors

annetheagile avatar ben-zen avatar crdunwel avatar dwillis avatar jcarbaugh avatar kaitlin avatar konklone avatar lindsayyoung avatar luigi avatar mtigas avatar philosoralphter avatar plantfansam avatar rshorey avatar sbai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sunlight-congress's Issues

Email-time for failure and warning reports should occur post-task

Instead of occurring as a report is filed, in the middle of a task, have tasks file reports marked as unread (as the default value). After the task is done running, go through all unread reports, mark them as read and send emails for any warnings or failures. Surround this in exception handling as well, and file a local report with a special flag set if it fails.

This is good not just so that reports can be filed from other languages and still reported on, but also so that tasks do not potentially hang in the middle of their job while trying to send an email. It's just sensible.

Filter keys with dots don't work

Example:
/votes.json?apikey=sunlight9&per_page=1&vote_breakdown.ayes%3E=200

It breaks upon storing a hit in the analytics db.

Report analytics nightly

Report nightly, as Drumbone does.

File local reports, and make sure broad exception handling is covered.

Support greater/less than or equal to

For example, show me all the bills with at least 5 cosponsors:

bills.json?cosponsors_count>=5

<= for less than or equal to.

It's not possible that keys will have > or < in them. Not allowing "less than" or "greater than" without "or equal to" will only be problematic in the case of floating point numbers, of which we don't have any now. If we end up having them in the future, we can invent some special syntax for them (>>= and <<=, perhaps).

Include count and page keys for plural endpoints

As peer to the array key (i.e. "bills"), include: "count", "page", and "per_page". "page" and "per_page" are the (possibly adjusted) pagination params, and "count" is the total number of items for that search.

Add Amendments

Add an "amendments" endpoint, using GovTrack's amendment XML:

Example:
http://www.govtrack.us/data/us/111/bills.amdt/h234.xml

Have a task, amendments_archive, that loads in all amendments to the table, and then goes over each bill (perhaps by Amendment.distinct(:bill_id) or the like) and adds an array of amendments to the bill. Each amendment on a bill should have only the basic fields (everything but the actions).

Directory structure for tasks

Give each task a folder that supports running unit tests (i.e. link to environment.rb correctly), or any other files the task needs.

Have the loader that governs making the rake tasks use the folder names. Have each task load in the [task_name].rb file in the root of the task's folder, and assume that a camelized class name is in there.

House Whip Notices

Democratic and Republican whip notices for the House, using the code or algorithms in the old RTC API.

Add a Vote model and populate it

Port over the roll call fetching code from Drumbone, into a model named Vote. Add a vote_type field that's either "roll" or whatever it is.

As part of the get_votes task, following roll call loading, iterate through each bill and go through each one's votes array. For any voice votes, create them (and include a "bill" object on them). For any roll call votes, update them with anything worth doing (perhaps nothing, refer to notes).

If the Vote table is empty, fill it from scratch. Otherwise, you can just worry about the roll call votes in the Senate and House whose numbers are higher than the last recorded, since old roll call votes never change. Then, go over bills and add voice votes and link roll call votes as normal.

Once we're pulling in partial roll call vote data in real time, this logic can be updated to be: if the Vote table is empty, fill it from scratch. Otherwise, just fill in the un-filled in roll call votes, then go over bills and add voice votes and link roll call votes as normal.

Forbidden fields on models

Document and try to enforce them somewhere. Anything used in params, basically: captures, callback, sections, apikey, per_page, page, order, sort. I think that's it.

Set up cronjobs on staging and backend

For get_legislators, once a day.
For get_bills, twice a day.
For get_rolls, twice a day.
For house_live, every 5 minutes (consult Kaitlin).

Imminent:
For get_amendments, twice a day.
For rolls_live, every 10 minutes.

Down the line:
For floor_updates, every minute.
For various docserver scrapers, consult Josh.

Pluck out legislator_names and bioguide_ids from clip description

Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.

Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.

I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.

You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".

But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.

(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)

Support $all operator

So if someone wants a bill cosponsored by a pair of people:
/bills.json?cosponsor_ids__all=1|2

that'd match any bills where the cosponsor_ids array contains both "1" and "2".

Expand videos to include White House videos

Real Time "Congress" be damned:

  • update house_live script to add a "chamber" field with a value of "house"
  • update house_live script to rename "timestamp_id" to "video_id" and prepend "house", e.g. "house-123456789"
  • make two whitehouse_live scripts that pull archival and live videos. Use a "chamber" value of "whitehouse", and a "video_id" value of "whitehouse-" followed by the date and slug, e.g. "whitehouse-2010-11-23-new-start-treaty".

Support "not" for fields

For example, to find any vote that was not a roll call:

votes.json?vote_type!=roll

Keys are absolutely unlikely to use excalamation points, though this makes parsing out the conditions a little trickier, of course.

Record hits in the database for analytics

Be wary, as this caused issues in Drumbone when it got too high, but - keep something.

Perhaps a task that runs monthly that offloads the month's hits into a dump and removes them from the database.

Link votes and amendments

  1. votes_archive should look for references to amendments (do they exist?) and add an amendment_id field and amendment subobject if it's there.
  2. Same for rolls_live_house and rolls_live_senate.

Fetch some roll call vote data in real time

The idea here is to have a separate task that can run every X minutes. It can make new roll call votes, that are missing fields (for example, "required" will be missing, and a related bill might not even exist yet). These fields will be filled in later by the twice-daily roll call vote task (that goes over all THOMAS-provided roll call votes and passage voice votes).

Example of House XML (view source, it's actually XML, and they use Bioguide IDs):
http://clerk.house.gov/evs/2010/roll518.xml

Example of Senate XML (uses some internal ID, will have to parse names out):
http://www.senate.gov/legislative/LIS/roll_call_votes/vote1112/vote_111_2_00229.xml

Sadly, in both cases we'll have to monitor an HTML table to see whether there's new stuff:
http://clerk.house.gov/evs/2010/index.asp
http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_111_2.htm

And the URLs for both tables depend on the year, Congress number, and session number. Not trivial, but: possible.

Dates should be dates, not timestamps

For most bill dates, it's a full on timestamp, at midnight UTC, which is incorrect. It should be limited to the date only, with no timestamp.

Since America is west of UTC, these dates represented in any American timezone would be the day before they actually are, which is a serious inaccuracy.

Pull in House videos and floor events

Work with Kaitlin to pull in floor events.

I'm not sure yet how to reconcile the floor events from this feed with the floor_events that Josh already picked up in the old RTC.

Investigate Hudson for monitoring

The Open State project uses this and it would be worth checking out how applicable to our own tasks, especially since the most awkward part of supporting multi-language data loaders is the reporting.

Put party breakdown inside vote_breakdown

vote_breakdown: {
    total: {ayes: ..., nays: ..., ...},
    party: {R: {ayes: ..., nays: ..., ...}, D: {...}, ...},
}

Leaves room for us to easily expand on any other ways it could be broken down.

Publish data in bulk

Have it nightly dump the tables to compressed JSON, at a publicly available address.

Support "in" and "nin" operators

For example, to support queries such as "give me all bills that are actually bills and not resolutions":

bills.json?bill_type__in=hr|hjres|s|sjres

Pipes seem unlikely to occur in filterable fields, and if we found some source data that uses pipes, we could always swap those pipes out for something else before syndicating it.

Use an "in" query though, not an actual "or" query:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24in

Finally, when "not" is supported, support the idea of "not in" searches, like this one for "anything but simple resolutions":

bills.json?bill_type!=hres|sres

This would map to the "nin" operator in Mongo.

Link committees and bills

Any bill which has a committee associated, see if we can add on the committee and its relationship to the bill. committee_ids and committee subobjects.

If we need to keep the relationship, then it should work like voter_ids and voters do on the vote object - {commitee_id: [id], relationship: "..."} and {committee: {obj}, relationship: "..."}.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.