openva / richmondsunlight.com Goto Github PK

View Code? Open in Web Editor NEW

12.0 5.0 3.0 115.09 MB

The Richmond Sunlight website.

Home Page: https://www.richmondsunlight.com/

License: MIT License

PHP 60.54% CSS 16.12% JavaScript 20.02% HTML 0.76% Perl 1.28% Shell 1.18% Hack 0.02% Dockerfile 0.10%

government legislature legislation virginia

richmondsunlight.com's Introduction

Richmond Sunlight

This is the front-end of the website. See also: rs-machine, the collection of scrapers and parsers that provide the site's third-party data, rs-api, the API that powers (some of) the website, and rs-video-processor, the on-demand legislative-video-processing system.

History

Richmond Sunlight started in 2005 as a little RSS-based bill tracker, updating every few hours. In 2006 it was built out as Richmond Sunlight, launching publicly in January of 2007. It's remained a hobby site ever since. The code base hasn’t been overhauled in all that time, and it shows — the site’s tech stack shows the growth rings of being developed over the course of many years. But it continues to function, and has been modernized in some ways, e.g. by adding a CI/CD pipeline, moving to SOA, etc.

Branches

master: The staging site.
deploy: The production site.

Local development

The site can be run locally, in Docker:

Install Docker.
Clone this repository. Make sure you’re using the branch that you want.
Run ./docker-run.sh.
In your browser, open http://localhost:5000.

When you are done, run ./docker-stop.sh (or quit Docker).

Architecture

richmondsunlight.com's People

Contributors

Stargazers

Watchers

Forkers

leesharma jalbertbowden blueblock

richmondsunlight.com's Issues

Make the design more responsive

It expands for big browsers—now make it shrink for phones.

Integrate campaign finance data

Move screenshots to S3

Set up an S3 bucket with a reasonable name.
Move the screenshot files.
Update every place where we generate the screenshot URLs to call them from S3 instead.
Delete files from EC2
Update the screenshot-generation script to use s3tools to move those images onto S3 after generating them.
Start generating thumbnails of the screenshots, rather than just the enormous versions, since mod_pagespeed can't automatically generate thumbnails of images being called from a remote server.

Provide a navigation interface for video

Right now, the highlights reel videos just plays beginning to end, with no navigation or explanation as to when video is from.

Make smart portfolios public

They're good enough.

Add a "Call Richmond" button

Emulate Josh's Call Congress.

Automate tagging

We can use prior bills and VA Decoded laws to derive tags. Also, content analysis via OpenCalais.

Set the include path in .htaccess

And stop futzing with directory structures when including things.

Move search to ElasticSearch

Now that ElasticSearch is installed, index legislation with it, rather than Sphinx.

Done right, indexing legislation means exporting legislation as JSON. This should be as simple as generating a list of new legislation, making a request to the API for each one of those, and submitting that JSON to ElasticSearch.

Provide better Facebook metadata

It's grabbing non-useful text and the image is the vote pie chart.

Move website mirrors to Heterix

Instead of just using cURL, instead use Heterix to mirror legislators' websites.

Also, check out Archive-It.

Make Retina favicons

Provide tag autocompletion

Help people stick to the folksonomy.

set up an API endpoint to make suggestions
add the jQuery functionality
pretty up the UI
ensure that it works within quoted terms, separated by spaces

Delete video chyron files after importing them

At the conclusion of the import process, once the chyron files have been inserted in the database, we can delete all of the text files from /video/.

Add Twitter handles

Drop the Twitter RSS column, add a Twitter handle column.

"Jennifer Wexton","@SenatorWexton"
"Nick Rush","@electnickrush"
"Patrick Hope","@HopeforVirginia"
"Donald McEachin","@Donald_McEachin"
"Jennifer McClellan","@JennMcClellanVA"
"Michael Webert","@MichaelWebert"
"Chris Head","@ChrisHead4Del"
"Randy Minchew","@RandyMinchew"
"Joseph Yost","@yostfordelegate"
"Dave Marsden","@SenDaveMarsden"
"Jill H. Vogel","@JillHVogel"
"David Bulova","@DavidBulova"
"Bryce Reeves","@ReevesVA"
"David Ramadan","@DavidIRamadan"
"Lionell Spruill, Sr.","@Del_LSpruill_Sr"
"Mark Keam","@DelegateKeam"
"Ron Villanueva","@DelRVillanueva"
"Dave Albo","@DaveAlbo"
"Terry Kilgore","@delterrykilgore"
"Christopher K. Peace","@DelCPeace"
"Tom Gear","@DelegateTomGear"
"Jeff McWaters","@JeffMcWaters"
"Glenn Oder","@GlennOder"
"Tag Greason","@TagGreason"
"Tim Hugo","@TimHugo"
"Chap Petersen","@ChapPetersen"
"David J. Toscano","@deltoscano"
"Ben Cline","@DelBenCline"
"Bill Stanley","@BillStanley"
"Barbara Comstock","@BarbaraComstock"
"Chris Stolle","@chrisstolle"
"Charniele Herring","@C_Herring"
"Michael Futrell","@michaelfutrell"
"Rich Anderson","@DelRichAnderson"
"Robert G Marshall","@RobertGMarshall"
"Mark Obenshain","@MarkObenshain"
"Jim LeMunyon","@JimLeMunyon"
"Scott Surovell","@ssurovell"
"Mark L. Keam","@MarkKeam"
"Sam Rasoul","@Sam_Rasoul"

Cache bill IDs, numbers in Memcached

Every bill request requires two lookups—one to convert the bill number to an ID, and then one to look up the data for that ID. And we also have the problem with #41 in which we have a bill number, but not an ID, and don't want to have to check MySQL again.

The solution here is to use Memcached to store bill numbers and their IDs, one record per bill number. I'm thinking that we only store the current session's bill numbers in Memcached, so that we can simplify (and slim down) the name space. I figure cleanup.php can select a list of all bill IDs and numbers and load them into Memcached. That'll run once an hour. And then we can use look at the Memcached records within Bill::get_id and history.php, instead of having to check with MySQL.

Delete Memcached entry upon status history update

When a bill's entry in bills.csv is updated, we delete its entry from Memcached (which is stored with a key of bills-[id]). But we don't do the same thing with history.csv updates, because we never actually obtain the bill ID during that update process. (We update that with a subselect.) Figure out how to erase that cache entry.

Increase the size of root volume

We're all out of space! Grow the drive.

Improve the layout of legislator data

It's a mess.

Replace ereg_replace() with preg_replace()

ereg_replace() is deprecated as of PHP v5.3, and must be swapped out.

Convert MyISAM tables to InnoDB

There are a few remaining MyISAM tables. These are inefficient and, I'm pretty sure, unnecessary at this point. Worse still, they impact RDS's ability to do point-in-time restores. Convert them to InnoDB.

Create pages for each video clip

We have a measurable, finite, indexed list of video clips. Each of those should have a unique URL, to facilitate discoverability and sharing.

Create a page to display each clip.
Create a page that lists all clips. This is more for search engines than for humans. Group by date.
On each clip page, provide some data about the legislator and/or the bill in question.
Provide recommendations for similar clips, presumably using bill tags.
Add a "play" icon on top of the placeholder image (though also set them to autoplay).

Add mapping functionality

We've got so much geodata, but aren't doing anything with it. Perhaps Leaflet?

Move the site to EC2

Prevent cron jobs from bogging down Apache

The server is DoSing itself with the cron jobs. For whatever reason, some of them never complete, leaving that Apache instance running permanently. (This was a problem on the old server, too, but it took weeks of such hung processes to actually affect the server.)

I think the easiest thing to do will be to pipe all requests through alarmlimit, to ensure that they get cut off after some point. That's not the proper solution, but it'll work.

Move DNS to Route 53

Use GovernorNotes.txt

There is a pair of files on the legislature's FTP server, 2012GovernorNotes.txt and 2013GovernorNotes.txt, that look potentially useful. It's a list of bills by number, patron, catch line, and description, and then notes. These notes include entries like these four (unrelated) examples:

This bill is identical to HB 1468. HB 1444 (pending), HB 1672 (pending), and HB 1988 amend § 8.01-225, but there are no conflicts. HB 1444 (pending), HB 1499, HB 1564, HB 1672 (pending), HB 1759, HB 2161, SB 773, and SB 807 amend § 54.1-3408, but there are no conflicts.

Please note that § 22.1-279.3:1, which is not amended in this bill, relies on the definition of firearm" in § 22.1-277.07 that this bill amends. Consequently the reporting requirements in § 22.1-279.3:1 would be changed by this bill.

A technical amendment is suggested. The new language added by the bill to subsection K of § 3.2-6540 refers to a violation of this subsection but it clearly was intended to refer to a violation of the section as a whole. (Subsection K is a penalty provision that provides little substantive law.) Clarifying amendments adopted in the House Agriculture, Chesapeake and Natural Resources Committee were rejected by the Senate.

"This bill is similar to SB 1097 but contains some stylistic and substantive differences: I. Stylistic: On lines 3, 8, 9, 27, 31, and 36, this bill references children identified as deaf or hard of hearing"; on those same lines" SB 1097 references "hearing-impaired children, a child who is deaf or hard of hearing, and deaf or hard-of-hearing children."

This seems like awfully useful data. At the moment, I'm only seeing files for prior years. Based on the name, I suspect that this is a list of all legislation that has been passed by the legislature and is awaiting the governor's signature, but I'm not sure.

Use WordPress' API to ID articles for a given bill

Right now, we're using the per-tag RSS feed to find out if a given bill has any articles (in WordPress) about it. This is really slow compared to using WordPress' API. Instead, every hour in cache.php we should get https://www.richmondsunlight.com/blog/wp-json/taxonomies/post_tag/terms and cache the resulting data in Memcached. Then, instead, we can check on that each bill, and only check the RSS if there's a tag using that bill number.

Comments queries are slow

It takes 80–150 ms to get a MySQL response to a comments query. The problem can be seen when EXPLAINing the query:

SELECT comments.name, comments.date_created, comments.email, comments.url, comments.comment, UNIX_TIMESTAMP( comments.date_created ) AS TIMESTAMP, comments.editors_pick, users.representative_id
FROM comments
LEFT JOIN users
   ON comments.user_id = users.id
LEFT JOIN bills
   ON comments.bill_id = bills.id
WHERE (
   comments.bill_id = 12345
   OR (
      bills.summary_hash =  "541fccef54fca0f8339f9dac2e50f70f"
      AND bills.session_id = 1
   )
)
AND comments.status =  "published"
ORDER BY comments.date_created ASC

Here's MySQL's query analysis:

Not using any keys doesn't make any sense. I even created a new index, publishable, based on the intersection of bill_id and status. I thought it might be a result of a mismatch between the integer lengths of user_id between tables, but after getting those all identical, the problem persists.

Figure out how to easily store data on S3

In order to support our S3 needs, it needs to be trivial to put files into S3. (See #15 and #16.) I'd figured that we'd just mount the S3 bucket as a directory, but it turns out that's a terrible idea. This is going to require, I suspect, both a PHP library and a command-line library.

Use website archives

create a database to track legislator / date / URL
bulk add all existing legislator / date / URL records into the database (find /vol/www/richmondsunlight.com/html/mirror -maxdepth 2)
move mirrored contents to S3
modify the archival script to add new records to the database
query the database as part of the standard API response with the Legislator method
display links to the archived versions on the public-facing legislator page

Automatically assign tags

Using Sphinx and our existing tag corpus, we should be able to automatically assign tags to legislation, perhaps by simply using existing tags on bills.

Document the v1.1 API

Move to PDO, prepared queries

Much of the site still uses mysql_query() etc. This is gone—rightly so–as of PHP 5.5. Rewrite all of these queries to use PDO.

Create a committee parser

Committee data is available via FTP. Parse that file into something useful. But make sure that parser stands alone, because that could be useful to others.

Move website mirrors to S3

That is, the mirrors of legislators' websites.

Have cron/mirror.php query S3, rather than the filesystem, to determine the recency of mirrors for each legislator.
Have cron/mirror.php store its files on S3, rather than the filesystem.

Check and flag legislation that was written by a third party

The PDF of legislation that was written by a third party is flagged as such. For instance, this bill contains this text at the top:

LEGISLATION NOT PREPARED BY DLS

The process is easy:

Modify the database to add a column to track whether the legislation was prepared by DLS. Allow that column to be null.
Write a PHP script to randomly select X bills for which we have no data for the origin of the text.
Pipe the already-archived PDF through pdftohtml (part of Poppler, installable with sudo apt-get install poppler-utils).
Convert the output HTML's entities to ASCII characters with Recode (recode html..utf8)
Look for the string LEGISLATION NOT PREPARED BY DLS (keeping in mind that those spaces are no-break spaces—that is, use regex's \s, not a literal space.)
Update the record in the database to indicate whether the legislation originated externally or not.
Include this data within the API.
Include this data within the UI, saying nothing if the bill was written internally, but providing a notice if it was written externally.

Move the database from MySQL to RDS

I thought that #18 had to be closed before this could be done, but apparently not!

Account creation fails infuratingly for existing quasi-users

If somebody has posted comments before, and we have a quasi-account for them, they cannot create an account, because one exists. So they try to get their password reset, but they can't, because they don't have an account. Once upon a time, they could create an account, and the existing data was merged into the new account, but not anymore. Figure out what's going on and fix it.

Install ElasticSearch

http://www.elasticsearch.org/

Move search to Amazon CloudSearch

Generated related bills with Sphinx

Right now, we're generating the list of related bills with a MySQL query. Rewrite this to use Sphinx, preferably within the API, rather than within the UI.

Move downloads to S3

CSV, JSON, HTML, etc.

Modify the downloads-generating cron job to copy files to S3 at the completion of the script.
Create a downloads page.

Enable SSL

Move to SSL by default.

Start using memcached

APC was eliminated without a replacement. Re-add the same basic functionality (caching legislators, bills, and the home page) in memcached.

Port per-page caching to memcached
Port per-user recommendation caching to memcached