Giter VIP home page Giter VIP logo

richmondsunlight.com's Introduction

Richmond Sunlight

SonarCloud GitHub Build

This is the front-end of the website. See also: rs-machine, the collection of scrapers and parsers that provide the site's third-party data, rs-api, the API that powers (some of) the website, and rs-video-processor, the on-demand legislative-video-processing system.

History

Richmond Sunlight started in 2005 as a little RSS-based bill tracker, updating every few hours. In 2006 it was built out as Richmond Sunlight, launching publicly in January of 2007. It's remained a hobby site ever since. The code base hasn’t been overhauled in all that time, and it shows — the site’s tech stack shows the growth rings of being developed over the course of many years. But it continues to function, and has been modernized in some ways, e.g. by adding a CI/CD pipeline, moving to SOA, etc.

Branches

Local development

The site can be run locally, in Docker:

  1. Install Docker.
  2. Clone this repository. Make sure you’re using the branch that you want.
  3. Run ./docker-run.sh.
  4. In your browser, open http://localhost:5000.

When you are done, run ./docker-stop.sh (or quit Docker).

Architecture

Network diagram

richmondsunlight.com's People

Contributors

dependabot-preview[bot] avatar jalbertbowden avatar waldoj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

richmondsunlight.com's Issues

Integrate campaign finance data

  • move code and data over from Slicehost
  • integrate into Richmond Sunlight's design and URL structure
  • generate Finance home page based on a call to ElasticSearch
  • generate each committee page based on a call to ElasticSearch (make an exact match)
    • put contributions and expenses in different tabs
    • sort chronologically
    • make table sortable
    • group by election cycle
  • generate a page for each transaction
  • create a unique page for each contribution and expense
    • base the URL on the incremented integers that we use when indexing
  • provide JSON / XML / PDF links on every level of page
  • add a cron job to run SaberVA on Richmond Sunlight each night
  • index the data with ElasticSearch
  • display some relevant data on each legislator's page

Move screenshots to S3

  • Set up an S3 bucket with a reasonable name.
  • Move the screenshot files.
  • Update every place where we generate the screenshot URLs to call them from S3 instead.
  • Delete files from EC2
  • Update the screenshot-generation script to use s3tools to move those images onto S3 after generating them.
  • Start generating thumbnails of the screenshots, rather than just the enormous versions, since mod_pagespeed can't automatically generate thumbnails of images being called from a remote server.

Automate tagging

We can use prior bills and VA Decoded laws to derive tags. Also, content analysis via OpenCalais.

Move search to ElasticSearch

Now that ElasticSearch is installed, index legislation with it, rather than Sphinx.

Done right, indexing legislation means exporting legislation as JSON. This should be as simple as generating a list of new legislation, making a request to the API for each one of those, and submitting that JSON to ElasticSearch.

  • get a list of all bills
  • connect to the API for each one
  • output the resulting JSON to the filesystem
  • set up an Elasticsearch index
  • have Elasticsearch index the JSON files
  • set up a script to automatically index the files
  • move Elasticsearch to a different server
  • only export the files every 24 hours
  • set up an API endpoint for search
  • create a new search page that queries against that endpoint

Add Twitter handles

Drop the Twitter RSS column, add a Twitter handle column.

"Jennifer Wexton","@SenatorWexton"
"Nick Rush","@electnickrush"
"Patrick Hope","@HopeforVirginia"
"Donald McEachin","@Donald_McEachin"
"Jennifer McClellan","@JennMcClellanVA"
"Michael Webert","@MichaelWebert"
"Chris Head","@ChrisHead4Del"
"Randy Minchew","@RandyMinchew"
"Joseph Yost","@yostfordelegate"
"Dave Marsden","@SenDaveMarsden"
"Jill H. Vogel","@JillHVogel"
"David Bulova","@DavidBulova"
"Bryce Reeves","@ReevesVA"
"David Ramadan","@DavidIRamadan"
"Lionell Spruill, Sr.","@Del_LSpruill_Sr"
"Mark Keam","@DelegateKeam"
"Ron Villanueva","@DelRVillanueva"
"Dave Albo","@DaveAlbo"
"Terry Kilgore","@delterrykilgore"
"Christopher K. Peace","@DelCPeace"
"Tom Gear","@DelegateTomGear"
"Jeff McWaters","@JeffMcWaters"
"Glenn Oder","@GlennOder"
"Tag Greason","@TagGreason"
"Tim Hugo","@TimHugo"
"Chap Petersen","@ChapPetersen"
"David J. Toscano","@deltoscano"
"Ben Cline","@DelBenCline"
"Bill Stanley","@BillStanley"
"Barbara Comstock","@BarbaraComstock"
"Chris Stolle","@chrisstolle"
"Charniele Herring","@C_Herring"
"Michael Futrell","@michaelfutrell"
"Rich Anderson","@DelRichAnderson"
"Robert G Marshall","@RobertGMarshall"
"Mark Obenshain","@MarkObenshain"
"Jim LeMunyon","@JimLeMunyon"
"Scott Surovell","@ssurovell"
"Mark L. Keam","@MarkKeam"
"Sam Rasoul","@Sam_Rasoul"

Cache bill IDs, numbers in Memcached

Every bill request requires two lookups—one to convert the bill number to an ID, and then one to look up the data for that ID. And we also have the problem with #41 in which we have a bill number, but not an ID, and don't want to have to check MySQL again.

The solution here is to use Memcached to store bill numbers and their IDs, one record per bill number. I'm thinking that we only store the current session's bill numbers in Memcached, so that we can simplify (and slim down) the name space. I figure cleanup.php can select a list of all bill IDs and numbers and load them into Memcached. That'll run once an hour. And then we can use look at the Memcached records within Bill::get_id and history.php, instead of having to check with MySQL.

Delete Memcached entry upon status history update

When a bill's entry in bills.csv is updated, we delete its entry from Memcached (which is stored with a key of bills-[id]). But we don't do the same thing with history.csv updates, because we never actually obtain the bill ID during that update process. (We update that with a subselect.) Figure out how to erase that cache entry.

Replace ereg_replace() with preg_replace()

ereg_replace() is deprecated as of PHP v5.3, and must be swapped out.

  • index.php
  • legislator.php
  • process-comments.php
  • process-tags.php
  • tag.php
  • includes/class.Legislator.php

Convert MyISAM tables to InnoDB

There are a few remaining MyISAM tables. These are inefficient and, I'm pretty sure, unnecessary at this point. Worse still, they impact RDS's ability to do point-in-time restores. Convert them to InnoDB.

Create pages for each video clip

We have a measurable, finite, indexed list of video clips. Each of those should have a unique URL, to facilitate discoverability and sharing.

  • Create a page to display each clip.
  • Create a page that lists all clips. This is more for search engines than for humans. Group by date.
  • On each clip page, provide some data about the legislator and/or the bill in question.
  • Provide recommendations for similar clips, presumably using bill tags.
  • Add a "play" icon on top of the placeholder image (though also set them to autoplay).

Move the site to EC2

  • Migrate files to EC2
  • Migrate database to EC2 MySQL installation
  • Review cron jobs and exec() statements for programs to install and install them
  • Set up a site user (ricsun)
  • Set up beta.richmondsunlight.com
  • Establish a static IP (54.209.110.70)
  • Move the site to Ubuntu 12, to step down from PHP 5.5
  • Test the site
  • Set up automounting of the EBS
  • Move and test m.richmondsunlight.com and api.richmondsunlight.com
  • Reboot the server and make sure that all services start up properly
  • Figure out why Apache isn't starting at boot time
  • Fix file ownership, which is weird
  • Set up a crontab for the site user
  • Drop the DNS TTL to 1 minute
  • Sync recent changes within the filesystem
  • Close down the old site with a "site down" notice
  • Sync the database to the EC2 server
  • Enable the crontab
  • Update DNS to use EC2 as the live site
  • Install Sphinx
  • Upgrade to MySQL 5.6 (5.5 handles subqueries inefficiently)
  • Move EC2 key to ricsun account
  • Switch to a non-privileged MySQL user (for WP and Sphinx, too)
  • Use a better RDS password (functions, settings, WP, Mint)
  • Take the DNS TTL back up to 24 hours
  • Decommission the old server

Prevent cron jobs from bogging down Apache

The server is DoSing itself with the cron jobs. For whatever reason, some of them never complete, leaving that Apache instance running permanently. (This was a problem on the old server, too, but it took weeks of such hung processes to actually affect the server.)

I think the easiest thing to do will be to pipe all requests through alarmlimit, to ensure that they get cut off after some point. That's not the proper solution, but it'll work.

Use GovernorNotes.txt

There is a pair of files on the legislature's FTP server, 2012GovernorNotes.txt and 2013GovernorNotes.txt, that look potentially useful. It's a list of bills by number, patron, catch line, and description, and then notes. These notes include entries like these four (unrelated) examples:

This bill is identical to HB 1468. HB 1444 (pending), HB 1672 (pending), and HB 1988 amend § 8.01-225, but there are no conflicts. HB 1444 (pending), HB 1499, HB 1564, HB 1672 (pending), HB 1759, HB 2161, SB 773, and SB 807 amend § 54.1-3408, but there are no conflicts.

Please note that § 22.1-279.3:1, which is not amended in this bill, relies on the definition of firearm" in § 22.1-277.07 that this bill amends. Consequently the reporting requirements in § 22.1-279.3:1 would be changed by this bill.

A technical amendment is suggested. The new language added by the bill to subsection K of § 3.2-6540 refers to a violation of this subsection but it clearly was intended to refer to a violation of the section as a whole. (Subsection K is a penalty provision that provides little substantive law.) Clarifying amendments adopted in the House Agriculture, Chesapeake and Natural Resources Committee were rejected by the Senate.

"This bill is similar to SB 1097 but contains some stylistic and substantive differences: I. Stylistic: On lines 3, 8, 9, 27, 31, and 36, this bill references children identified as deaf or hard of hearing"; on those same lines" SB 1097 references "hearing-impaired children, a child who is deaf or hard of hearing, and deaf or hard-of-hearing children."

This seems like awfully useful data. At the moment, I'm only seeing files for prior years. Based on the name, I suspect that this is a list of all legislation that has been passed by the legislature and is awaiting the governor's signature, but I'm not sure.

Use WordPress' API to ID articles for a given bill

Right now, we're using the per-tag RSS feed to find out if a given bill has any articles (in WordPress) about it. This is really slow compared to using WordPress' API. Instead, every hour in cache.php we should get https://www.richmondsunlight.com/blog/wp-json/taxonomies/post_tag/terms and cache the resulting data in Memcached. Then, instead, we can check on that each bill, and only check the RSS if there's a tag using that bill number.

Comments queries are slow

It takes 80–150 ms to get a MySQL response to a comments query. The problem can be seen when EXPLAINing the query:

SELECT comments.name, comments.date_created, comments.email, comments.url, comments.comment, UNIX_TIMESTAMP( comments.date_created ) AS TIMESTAMP, comments.editors_pick, users.representative_id
FROM comments
LEFT JOIN users
   ON comments.user_id = users.id
LEFT JOIN bills
   ON comments.bill_id = bills.id
WHERE (
   comments.bill_id = 12345
   OR (
      bills.summary_hash =  "541fccef54fca0f8339f9dac2e50f70f"
      AND bills.session_id = 1
   )
)
AND comments.status =  "published"
ORDER BY comments.date_created ASC

Here's MySQL's query analysis:

explain

Not using any keys doesn't make any sense. I even created a new index, publishable, based on the intersection of bill_id and status. I thought it might be a result of a mismatch between the integer lengths of user_id between tables, but after getting those all identical, the problem persists.

Use website archives

  • create a database to track legislator / date / URL
  • bulk add all existing legislator / date / URL records into the database (find /vol/www/richmondsunlight.com/html/mirror -maxdepth 2)
  • move mirrored contents to S3
  • modify the archival script to add new records to the database
  • query the database as part of the standard API response with the Legislator method
  • display links to the archived versions on the public-facing legislator page

Automatically assign tags

Using Sphinx and our existing tag corpus, we should be able to automatically assign tags to legislation, perhaps by simply using existing tags on bills.

Move to PDO, prepared queries

Much of the site still uses mysql_query() etc. This is gone—rightly so–as of PHP 5.5. Rewrite all of these queries to use PDO.

Create a committee parser

Committee data is available via FTP. Parse that file into something useful. But make sure that parser stands alone, because that could be useful to others.

Move website mirrors to S3

That is, the mirrors of legislators' websites.

  • Have cron/mirror.php query S3, rather than the filesystem, to determine the recency of mirrors for each legislator.
  • Have cron/mirror.php store its files on S3, rather than the filesystem.

Check and flag legislation that was written by a third party

The PDF of legislation that was written by a third party is flagged as such. For instance, this bill contains this text at the top:

LEGISLATION NOT PREPARED BY DLS

The process is easy:

  • Modify the database to add a column to track whether the legislation was prepared by DLS. Allow that column to be null.
  • Write a PHP script to randomly select X bills for which we have no data for the origin of the text.
  • Pipe the already-archived PDF through pdftohtml (part of Poppler, installable with sudo apt-get install poppler-utils).
  • Convert the output HTML's entities to ASCII characters with Recode (recode html..utf8)
  • Look for the string LEGISLATION NOT PREPARED BY DLS (keeping in mind that those spaces are no-break spaces—that is, use regex's \s, not a literal space.)
  • Update the record in the database to indicate whether the legislation originated externally or not.
  • Include this data within the API.
  • Include this data within the UI, saying nothing if the bill was written internally, but providing a notice if it was written externally.

Account creation fails infuratingly for existing quasi-users

If somebody has posted comments before, and we have a quasi-account for them, they cannot create an account, because one exists. So they try to get their password reset, but they can't, because they don't have an account. Once upon a time, they could create an account, and the existing data was merged into the new account, but not anymore. Figure out what's going on and fix it.

Generated related bills with Sphinx

Right now, we're generating the list of related bills with a MySQL query. Rewrite this to use Sphinx, preferably within the API, rather than within the UI.

Move downloads to S3

CSV, JSON, HTML, etc.

  • Modify the downloads-generating cron job to copy files to S3 at the completion of the script.
  • Create a downloads page.

Start using memcached

APC was eliminated without a replacement. Re-add the same basic functionality (caching legislators, bills, and the home page) in memcached.

  • Port per-page caching to memcached
  • Port per-user recommendation caching to memcached

Create an open source parser

Right now Richmond Sunlight parses the CSV natively (and nastily). Create a parser to turn that CSV into JSON, republish both that parser and the resulting JSON, and then use that JSON as the raw data for Richmond Sunlight.

Republish legislature's files

The legislature provides useful CSV files, locked behind a password-protected FTP interface. We're downloading them anyway—share them with folks via the downloads section.

Fix video clips

Video clips aren't playing—the link just goes to archive.org.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.