Giter VIP home page Giter VIP logo

lobbyist-lookup's Introduction

Unified Congress Lobbyist Disclosure Scrapper and Lookup

Deploy

  • Record Retrieval

    - Latest current year House lobby disclosure filings available on [House.gov](http://disclosures.house.gov/).
    • Using the webbrowser based search may result in

      Cannot download more than 2000 records. Please refine search.

    • Using past filings download link utimately leads to here to download filings in xml format.

      • The house.gov site uses an input element with method of POST to an asp page to serve the archive files. The site also runs on ASP which has ViewState and EventValidation enforced to prevent CSRF. ViewStateand EventValidation makes programmatic POST requests more complicated as we need to have valid ViewState and EventValidation values in order to send a valid POST request.
        • This Go program retrieves a response from the ASP server with a GET request. After parsing the hidden ViewState and EventValidation input values, we are able to construct a valid POST request which the ASP server replies back with a file stream. We write the file stream to a defined file.
          • houseRetrieve.go uses code.google.com/p/go.net/html package to parse HTML for tokens.
          • houseRetrieve.go contains the archive downloading portion of the code and can be repurposed to send/received requests with other ASP sites using CSRF protection.
    • XXXX Registration archives contain new registrations for that year. XXXX N Quarter archives contain filings due for N quarter.

      • This program will download all archives for the current year.
    • Use predicted file naming convention for Senate filings on Senate.gov.

      • Senate provides xml files with up to 1000 filings per file.
        • XML files are in UTF-16 and Go expects UTF-8
          • Used code.google.com/p/go-charset/charset to convert UTF-8 to UTF-16.
    • Interesting Info

      • House has ~90k filings versus Senate's ~130k filings.
      • House filings are in their individual XML file versus Senate filing being 1000 per file
      • Senate filings therefore parse faster funnily enough.
    • Retrieves lobbyist filings every day.

      • Heroku cycles dynos every 24 hrs so that also refreshes the list as well ;)
Parameter Comment
__VIEWSTATE extracted token
__EVENTVALIDATION extracted token
selFilesXML requestd archive filename from page HTML input element
btnDownloadXML needed to tell ASP to serve file?

lobbyist-lookup's People

Contributors

ansonl avatar kr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

ianmadlenya

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.