codeforamerica / snap-it-up Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 5.0 31.54 MB

Super-simple dashboard showing the status of SNAP-related web services.

Home Page: http://status.citizenonboard.com/

License: BSD 3-Clause "New" or "Revised" License

Ruby 75.96% CSS 2.88% HTML 21.16%

snap-it-up's People

Contributors

Stargazers

Watchers

Forkers

codeforasia isabella232

snap-it-up's Issues

Resolve issues with Indiana monitor

The "FSSA" in "FSSA Benefits Portal" has been removed from the page, so we are showing continuous
downtime erroneously.

Rate SSL support?

Ran across this site today, which will rate a domain’s SSL support. Might be interesting to run this for all the sites: https://www.ssllabs.com/ssltest/

Here’s Indiana: https://www.ssllabs.com/ssltest/analyze.html?d=ifcem.com

Add a README.

Which hopefully gives some overview to what we’re doing here.

Some sites have regular planned downtime. Even if we aren’t doing any useful analysis with it yet, it would be good to collect that data—it could be useful to people trying to use the services, it almost certainly has an interesting place in the narrative about service availability, and we could of course do interesting analysis with it later.

Maybe start by posting data here, but probably move it into a file in the repo. That way we can display it on pages or make it available in a machine readable format for others to work with.

(Inspired by #22)

Extract snapshot behavior into own class

Weekly QA Checklist

Stub

Purpose: ensure a durable level of value from the work we've completed to date, with realistic bounds on the amount of future work to be invested.

Tech docs/description in the readme or wiki

Kinda like a mini methods section for the article. Hopefully we can include the full primary source data here for the target month. Assigning to @alanjosephwilliams because he brought it up but probably everyone will have to contribute.

Determine how to calculate uptime

After merging #51, I did some more general sanity checks and noticed some oddities, which I talked with @pingometer support about. Our calculations are now somewhat similar to Pingometer’s, but not the same—and they never will be:

It turns out Pingometer calculates uptime based on each check of a monitor. That is, every check is factored into Pingometer’s “uptime” calculation, regardless of whether it passed the threshold needed to trigger an event. A monitor’s sensitivity setting determines how many consecutive failed checks result in an event. There is no sensitivity level that results in a single failed check leading to an incident: http://support.pingometer.com/knowledge_base/topics/what-is-the-sensitivity

We’re now calculating uptime based on events, which leaves us with slightly different results. That’s not a bad thing—whether uptime is calculated based on times we actually classify the service as down vs. every individual unsuccessful check is pretty subjective. In some contexts or in some philosophies, what we’re no doing is more correct. In others, Pingometer’s approach is more right.

Either way, Pingometer gives us all the data we need to choose our approach. HOWEVER, because checks are frequent across their platform, Pingometer (quite reasonably) only stores individual check data for a few days. So if we want to change the way we calculate things, we can do it going forward, but can’t get historical data.

This probably means I should also be capturing checks in addition to events, but we should also figure out the appropriate approach to calculations here.

Write an awesome article

@bengolder let's start dropping notes here and build out an outline.

Could we get gov to take over a status page?

Maybe an awesome concrete outcome. Stub for now. I'll expand later!

Check OR, IN, and VA monitors

(If we care at the moment). They all show down but I think they're all up.

Set up appropriate monitoring for Louisiana

Per #46, we'll need to identify the proper URL and keyword to effectively monitor Louisiana's SNAP webservice.

Pingometer shows downtime blips every night

Synchronized downtime on two separate systems suggests Pingometer artifacts. Ideas @pingometer?

Pingometer is slow

@pingometer just wanted to let you know that loading times (especially for the monitors page) are getting pretty rough.

Checklist for Published Write-up

BLUF:

It's time to ship! We are presenting our work in Boston on April 1st at Health Refactored, a health and technology conference, where Code for America is included on a panel about equity in health.

In parallel with that presentation, we want to publish our narrative write up of the work completed to date and the human impact of the problem we are investigating, as eloquently outlined by @bengolder in #6.

We have a good amount of work left to do, so let's use this thread as a way to coordinate that work across contributors over the next week.

The write up itself (as in the content), and the presentation, need to ship by March 25th. We want to ship the full experience by the presentation date, which is April 1st.

Cc'ing all the current and future contributors. Please edit, amend or improve the list below. We can also talk about trimming the scope based on our time and availability.
@lippytak @Mr0grog @bengolder @davidrleonard @bensheldon

Narrative

Graphics without data

Portrait of Liliana @alanjosephwillliams
Portrait of Xiao in fetal position @alanjosephwilliams
Handwritten/drawn excerpts of hilariously bad copywriting on error pages @alanjosephwilliams
Text Message exchanges @alanjosephwilliams

Data Visualizations

Map of current downtime
Map of downtime over the last week
Implement better color scheme for the map per #45
Exploration of how to display screenshots of error/down pages. @Mr0grog
Animated Map of downtime over our best month of data @Mr0grog

Presentation

Presentation Outline
Adapt narrative and graphics for 20 minute presentation.
Discuss and confirm overarching calls to action.

Research

Historical uptime for comparables like Google, Banks, Credit Unions, the SEC, CNN, etc.

Front-end

Call to action - Send a letter customized for your area code or state. @davidrleonard
Dynamic headline for the article based on current or recent downtime. @Mr0grog
Subdomain on codeforamerica.org @davidrleonard
Twitter/FB card configuration
Typography

Technical write-up

First draft....@alanjosephwilliams @Mr0grog @bensheldon @lippytak

Other explorations

Visualizing downtime over a year
Visualizing downtime over a day

Confirm keyword for Louisiana is correct

http://dcfs.la.gov/index.cfm?md=pagebuilder&tmp=home&pid=407

Figure out how to keyword check Indiana

www.ifcem.com

How is the page even rendering?

Should be tracking checks, not just events

Persuant to #55, we should probably be capturing a record of every check (which will get big fast). This will give us more flexibility to talk about what “down” means in the future.

Map of February uptime from local data

@Mr0grog My understanding is that we are now visualizing our maps from local data per #51

For the presentation (which I need to turn in tomorrow AM at the latest), I want to include at least one map.

I could do a static shot of the current "Uptime over the past week" map, which has data good enough to include with generalized statements about uptime. However, I was thinking that a map for the month of February might make a more compelling point.

Would generating that map—with existing styles, labels, etc—have a low enough LOE to fit on your plate today?

Snapshots triggered by event hook should happen in a worker thread/process/dyno

Now that @bensheldon’s set up Sentry, I’m getting occasional notices about the event hook timing out. I remember seeing this occasionally in logs in the past, as well. I’m 99.9% certain this is caused by sites being slow to load when we screenshot them (not exactly surprising if the site reports as “down”).

In order to make sure we give the snapshot more time to complete and don’t cause errors, we should probably que the snapshots and perform them in some sort of worker process or thread. @bensheldon suggests using Que.

Add link to incident message

From our friend, @monfresh:

Convert to Rails

I think we've identified some tech (postgres, background jobs) and process benefits (me helping) to changing the architecture. This is just a brief summary of what I'm planning to do this weekend:

Re-implement Monitors, Incidents and MonitorEvents data model in Postgres
Re-implement the webhook process data catcher from Pingometer
Screenshotting: not sure yet if I'll use an existing image persistence library or just reuse the current code

I think that functionality is sufficient to deploy alongside the Sinatra app (snap-status-rails.herokuapp.com ... until it has full partiy). Can we point Pingometer to a second webhook?

Once I have that up and ready to catch webhooks, I'll work on pulling in the existing front-end reports.

I haven't gone through the Rake tasks fully. Is there any functionality in those that is being actively used?

Please don't let my re-architecting block any feature work. I take full responsibility for backporting any work on the Sinatra app until the Rails piece has full parity. But I suggest we put any of the backend or code cleanup improvements on hold.

Map monitors to states by domain name?

Or by ID. Basically, we need to not do it by name, since, even though it was convenient, it’s now broken for California, where all the names recently changed to not be in the format “state | name-of-site”.

Get screenshots of lots of sites when they go down

It will make for a more compelling essay.

(Some?) Transaction monitors have nameless alerts

Just logging a small @pingometer bug that some/all transaction monitor alerts don't include the monitor name:

We're only using these alerts internally so it's not a big deal for us. Check Indiana (monitor ID 54d05a90be653d2e76ff9ce7) as an example.

Provide public access to underlying monitoring data?

As I noted in #13, it seems like this site/page/essay/whatever could/should also be a place for public access to the monitoring data. That could be links to files on S3 or all kinds of other things:

CSV/JSON dumps of checks/events/report data
RSS/ATOM feeds
JSON API of data

We could do all of this as a pass-through to Pingometer for “report” data (avg. uptime, response time per day) because @pingometer support tells me the full history of that data is included in their API, but for checks or events, we’d have to do live aggregating into our own database (totally do-able, but another thing).

Need to declare a “good after” date for each monitor

It looks like when we set up monitoring for Louisiana, we grabbed the redirect URL (dcfs.la.gov:80/index.cfm?md=pagebuilder&tmp=home&pid=407) and began monitoring their error page, rather than the webservice itself.

In any display of the data, we should present Louisiana as having "no data", or something to that effect.

Rename heroku app to “snap-it-up”?

The heroku app is currently named “snap-status.” Should it be changed to “snap-it-up” to match the new repo name?

Also, @lippytak, @alanjosephwilliams, @daguar, @bengolder drop your Heroku e-mails here or send them to me—[email protected]—so I can add you to the app. Or we can move the app to a different account. Or whatever.

Distinguish between monitors with "no data" and those that are "not down" for a given timeframe

Our monitor has been showing that Vermont has been down since 03/24/15 at 6:15 AM. It's a basic HTTPS monitor with no transaction.

For the past two days, we've manually verified that the same URL (with the port removed) is available and loading without significant latency. (https://mybenefits.ahs.state.vt.us/Login.aspx).

@pingometer could you help us investigate?

Media plan

Stub for now. Let's just put together a list of who would be interested and send some loving tweets/emails. Simple.

Wrap Pingometer API up into something nice

The current method of hitting it is, obviously, horribly hacky. Should be easiest to wrap it up in a nice HTTParty class.

Figure out how to keyword check New Jersey

see details in old issue: codeforamerica/citizen-onboard#34

improve map color scheme

decide on appropriate breaks
adopt a better color scheme that:
- still communicates a scale of stable <--> unstable
- avoids conflation with common political maps
- complements other visual elements of the write up
- isn't garish

Sentry/Airbrake exception monitoring

Not sure what your favorite flavor of exception monitoring is (or what you're willing to pay for... I spend on Sentey and it's worth it) but you should add one before it goes into production.

upgrade to two dynos

Let's upgrade to two dynos on heroku so users don't encounter the slight lag in loading we currently experience.

Get alert when service comes back up

I want it and a user asked for it 👍

Seriously really refactor the database insanity

The code in app.rb and in the tasks right now is just absolutely completely nuts. So much duplicated crap. Need models. Need consolidated logic. (Need time.) Ooof.

Data Caching?

It seems like every front page load requires an API fetch. That seems less than ideal. Could I suggest tossing results into mencached with an time-based expiration of whatever granularity the uptime monitor has?

I suggest memcached because most libs have breaker behavior which makes it pretty resilient and doesn't require setting up a db for local development.

Erratic Ohio data

Hoping @pingometer can look into this:

I've tried to confirm manually and as far as I can tell these are all false positives.

A couple (2?) data visualizations

@Mr0grog you've made a ton of progress on this so far...would you like to work with @bengolder to figure out exactly what 1-2 visualizations we should include in the article? Static or dynamic? Real time or summaries of prior data? All that jaz...You seem to have the best sense of what's feasible and the best skills̶z to make it happen.

Show day/week aggregations

Per @alanjosephwilliams, it would be really useful to highlight, specifically, whether a site/state has been down in the past week. There are a few different data points that might be interesting here, and we can probably easily prototype them all:

Uptime over a given period
Uptime met a certain threshold (say, 95%) over a given period
Down at all (i.e. uptime below 100%) over a given period (note this might be interesting but ultimately kind of pointless as, if the period is more than a day, we’d probably wind up showing almost all sites as failing)

And of course aggregating over differing periods:

Weekly
Daily
Hourly
Arbitray?

Pingometer API returning `last_event` as JSON string, not object

I don’t even.

Don’t know when this change happened in their /monitors API. Totally breaks us. I assume it’s a bug on their end (can’t see why it would be intentional), but I guess we should be robust against it?

@pingometer Any chance you can shed some light on this?

Snapshots broken on Browserstack

CfA's Browserstack account no longer has API access (I guess it was trialed in?), so our screenshots are no longer working. Need to fix this.

/cc @migurski?

Chart downtime vs. time of day

For each site/state, it would be interesting to see whether they are regularly down at certain times, e.g. middle of the night, weekends, no pattern at all, etc.

Chart over 24 hours
Chart over a week
Chart over a month? Not super useful right now since we don’t have good data going back very far yet.

(Inspired by #22)

Extract pingometer connection into before filter

Would de-duplicate pingometer connection/raise code.

Better keywording for MyBenifts Calwin

It’s down in maintenance mode, but shows as up:

https://s3-us-west-2.amazonaws.com/snap-snapshots/CA-54bc8944be653d3f86065dc5-2015-02-17T03%3A38%3A20%2B00%3A00

Figure out how to keyword check Kansas

https://kscapportalp.dcf.ks.gov/client/start.swe?

Then look at the source :/

Supporting data vs. supporting analysis/viz

This is branched off a discussion w/ @bensheldon about yak-shaving, immediate goals, what actually needs doing to craft an effective presentation and narrative here: #39

Quick recap on the near-term goals here:

Author a narrative that illustrates the impact of downtime through some real stories and a high-level overview monitoring data.
Author a technical write up for government and civic tech folks looking to implement monitoring on services they care about.
Present the story at a health & technology conference in Boston on April 1.

(Longer term goals, which might benefit more from more squeaky-clean code, left out for now.)

To do all that, need to feel confident we are writing and building on reliable data. What’s the data/what do we need to support narrative and visualization?

Data that we feel confident in. Can we pull it reliably from backing services like Pingometer? Do we feel confident that what we record in the event hook is accurate? (e.g. doesn’t randomly have holes because the process crashed or timed out or something.) etc.
Snapshots of sites when down (also up). As many as possible, in all the varying ways they can be down. Seeing a sad, broken site is compelling. Seeing a wall of screenshots of sad, broken sites is especially so. (This issue has been kind of a crazy iceberg.)
If possible, our reliability/availability shouldn’t be hindered by that of supporting services (e.g. Pingometer) or the sites we’re monitoring. This is basically #17.
Easy ability to query the data for:
- Screenshots
- Times sites were down
- Current status of sites (up/down, maybe load time?)
- Down/uptime over various timeframes (today, past week, past month, etc) or as time series. Might not need this to be backend; could probably calculate easily enough in JS given a easy-to-analyze blob of JSON data from the server. See also #3.
- Basic info about the sites we are monitoring: Name, state, host/url, etc. Would be super-cool to have more meta-info about who built and who hosts the sites, when they were made, what services they support (e.g. just info? just application? balance checking?), planned downtimes, and so on. Getting the actual data is more research (looking here at @bengolder, @lippytak, @alanjosephwilliams), but making a place for it or making it possible to add easily should the above people feel it’s useful in the narrative is more what I mean.
API or daily data dumps or something in service of #15.
Place to put the various write-ups and transform them into a web page. Maybe a folder full of markdown that gets read or compiled and presented on a page? Maybe its just a static page and the work happens on Google Docs or in the repo wiki or somewhere else. See #6, I guess.
Place to write/track tech docs/description/methods stuff from #12. As above, maybe not technical at all. Maybe this is just the wiki for now.

That’s what I’ve got for now. It’s a little high-level and a little stream-of-consciousness and not very thought through. Anyone should feel free add to the list, pare it down, clarify, and ask for more details.