codeforamerica / snap-it-up Goto Github PK
View Code? Open in Web Editor NEWSuper-simple dashboard showing the status of SNAP-related web services.
Home Page: http://status.citizenonboard.com/
License: BSD 3-Clause "New" or "Revised" License
Super-simple dashboard showing the status of SNAP-related web services.
Home Page: http://status.citizenonboard.com/
License: BSD 3-Clause "New" or "Revised" License
The "FSSA" in "FSSA Benefits Portal" has been removed from the page, so we are showing continuous
downtime erroneously.
Ran across this site today, which will rate a domain’s SSL support. Might be interesting to run this for all the sites: https://www.ssllabs.com/ssltest/
Here’s Indiana: https://www.ssllabs.com/ssltest/analyze.html?d=ifcem.com
Which hopefully gives some overview to what we’re doing here.
Some sites have regular planned downtime. Even if we aren’t doing any useful analysis with it yet, it would be good to collect that data—it could be useful to people trying to use the services, it almost certainly has an interesting place in the narrative about service availability, and we could of course do interesting analysis with it later.
Maybe start by posting data here, but probably move it into a file in the repo. That way we can display it on pages or make it available in a machine readable format for others to work with.
(Inspired by #22)
Stub
Purpose: ensure a durable level of value from the work we've completed to date, with realistic bounds on the amount of future work to be invested.
Kinda like a mini methods section for the article. Hopefully we can include the full primary source data here for the target month. Assigning to @alanjosephwilliams because he brought it up but probably everyone will have to contribute.
After merging #51, I did some more general sanity checks and noticed some oddities, which I talked with @pingometer support about. Our calculations are now somewhat similar to Pingometer’s, but not the same—and they never will be:
It turns out Pingometer calculates uptime based on each check of a monitor. That is, every check is factored into Pingometer’s “uptime” calculation, regardless of whether it passed the threshold needed to trigger an event. A monitor’s sensitivity setting determines how many consecutive failed checks result in an event. There is no sensitivity level that results in a single failed check leading to an incident: http://support.pingometer.com/knowledge_base/topics/what-is-the-sensitivity
We’re now calculating uptime based on events, which leaves us with slightly different results. That’s not a bad thing—whether uptime is calculated based on times we actually classify the service as down vs. every individual unsuccessful check is pretty subjective. In some contexts or in some philosophies, what we’re no doing is more correct. In others, Pingometer’s approach is more right.
Either way, Pingometer gives us all the data we need to choose our approach. HOWEVER, because checks are frequent across their platform, Pingometer (quite reasonably) only stores individual check data for a few days. So if we want to change the way we calculate things, we can do it going forward, but can’t get historical data.
This probably means I should also be capturing checks in addition to events, but we should also figure out the appropriate approach to calculations here.
@bengolder let's start dropping notes here and build out an outline.
Maybe an awesome concrete outcome. Stub for now. I'll expand later!
(If we care at the moment). They all show down but I think they're all up.
Per #46, we'll need to identify the proper URL and keyword to effectively monitor Louisiana's SNAP webservice.
Synchronized downtime on two separate systems suggests Pingometer artifacts. Ideas @pingometer?
@pingometer just wanted to let you know that loading times (especially for the monitors page) are getting pretty rough.
It's time to ship! We are presenting our work in Boston on April 1st at Health Refactored, a health and technology conference, where Code for America is included on a panel about equity in health.
In parallel with that presentation, we want to publish our narrative write up of the work completed to date and the human impact of the problem we are investigating, as eloquently outlined by @bengolder in #6.
We have a good amount of work left to do, so let's use this thread as a way to coordinate that work across contributors over the next week.
The write up itself (as in the content), and the presentation, need to ship by March 25th. We want to ship the full experience by the presentation date, which is April 1st.
Cc'ing all the current and future contributors. Please edit, amend or improve the list below. We can also talk about trimming the scope based on our time and availability.
@lippytak @Mr0grog @bengolder @davidrleonard @bensheldon
How is the page even rendering?
Persuant to #55, we should probably be capturing a record of every check (which will get big fast). This will give us more flexibility to talk about what “down” means in the future.
@Mr0grog My understanding is that we are now visualizing our maps from local data per #51
For the presentation (which I need to turn in tomorrow AM at the latest), I want to include at least one map.
I could do a static shot of the current "Uptime over the past week" map, which has data good enough to include with generalized statements about uptime. However, I was thinking that a map for the month of February might make a more compelling point.
Would generating that map—with existing styles, labels, etc—have a low enough LOE to fit on your plate today?
Now that @bensheldon’s set up Sentry, I’m getting occasional notices about the event hook timing out. I remember seeing this occasionally in logs in the past, as well. I’m 99.9% certain this is caused by sites being slow to load when we screenshot them (not exactly surprising if the site reports as “down”).
In order to make sure we give the snapshot more time to complete and don’t cause errors, we should probably que the snapshots and perform them in some sort of worker process or thread. @bensheldon suggests using Que.
From our friend, @monfresh:
I think we've identified some tech (postgres, background jobs) and process benefits (me helping) to changing the architecture. This is just a brief summary of what I'm planning to do this weekend:
I think that functionality is sufficient to deploy alongside the Sinatra app (snap-status-rails.herokuapp.com ... until it has full partiy). Can we point Pingometer to a second webhook?
Once I have that up and ready to catch webhooks, I'll work on pulling in the existing front-end reports.
I haven't gone through the Rake tasks fully. Is there any functionality in those that is being actively used?
Please don't let my re-architecting block any feature work. I take full responsibility for backporting any work on the Sinatra app until the Rails piece has full parity. But I suggest we put any of the backend or code cleanup improvements on hold.
Or by ID. Basically, we need to not do it by name
, since, even though it was convenient, it’s now broken for California, where all the names recently changed to not be in the format “state | name-of-site”.
It will make for a more compelling essay.
Just logging a small @pingometer bug that some/all transaction monitor alerts don't include the monitor name:
We're only using these alerts internally so it's not a big deal for us. Check Indiana (monitor ID 54d05a90be653d2e76ff9ce7) as an example.
As I noted in #13, it seems like this site/page/essay/whatever could/should also be a place for public access to the monitoring data. That could be links to files on S3 or all kinds of other things:
We could do all of this as a pass-through to Pingometer for “report” data (avg. uptime, response time per day) because @pingometer support tells me the full history of that data is included in their API, but for checks or events, we’d have to do live aggregating into our own database (totally do-able, but another thing).
It looks like when we set up monitoring for Louisiana, we grabbed the redirect URL (dcfs.la.gov:80/index.cfm?md=pagebuilder&tmp=home&pid=407) and began monitoring their error page, rather than the webservice itself.
In any display of the data, we should present Louisiana as having "no data", or something to that effect.
The heroku app is currently named “snap-status.” Should it be changed to “snap-it-up” to match the new repo name?
Also, @lippytak, @alanjosephwilliams, @daguar, @bengolder drop your Heroku e-mails here or send them to me—[email protected]—so I can add you to the app. Or we can move the app to a different account. Or whatever.
Our monitor has been showing that Vermont has been down since 03/24/15 at 6:15 AM. It's a basic HTTPS monitor with no transaction.
For the past two days, we've manually verified that the same URL (with the port removed) is available and loading without significant latency. (https://mybenefits.ahs.state.vt.us/Login.aspx).
@pingometer could you help us investigate?
Stub for now. Let's just put together a list of who would be interested and send some loving tweets/emails. Simple.
The current method of hitting it is, obviously, horribly hacky. Should be easiest to wrap it up in a nice HTTParty class.
see details in old issue: codeforamerica/citizen-onboard#34
Not sure what your favorite flavor of exception monitoring is (or what you're willing to pay for... I spend on Sentey and it's worth it) but you should add one before it goes into production.
Let's upgrade to two dynos on heroku so users don't encounter the slight lag in loading we currently experience.
I want it and a user asked for it 👍
The code in app.rb
and in the tasks right now is just absolutely completely nuts. So much duplicated crap. Need models. Need consolidated logic. (Need time.) Ooof.
It seems like every front page load requires an API fetch. That seems less than ideal. Could I suggest tossing results into mencached with an time-based expiration of whatever granularity the uptime monitor has?
I suggest memcached because most libs have breaker behavior which makes it pretty resilient and doesn't require setting up a db for local development.
Hoping @pingometer can look into this:
I've tried to confirm manually and as far as I can tell these are all false positives.
@Mr0grog you've made a ton of progress on this so far...would you like to work with @bengolder to figure out exactly what 1-2 visualizations we should include in the article? Static or dynamic? Real time or summaries of prior data? All that jaz...You seem to have the best sense of what's feasible and the best skills̶z to make it happen.
Per @alanjosephwilliams, it would be really useful to highlight, specifically, whether a site/state has been down in the past week. There are a few different data points that might be interesting here, and we can probably easily prototype them all:
And of course aggregating over differing periods:
I don’t even.
Don’t know when this change happened in their /monitors
API. Totally breaks us. I assume it’s a bug on their end (can’t see why it would be intentional), but I guess we should be robust against it?
@pingometer Any chance you can shed some light on this?
CfA's Browserstack account no longer has API access (I guess it was trialed in?), so our screenshots are no longer working. Need to fix this.
/cc @migurski?
For each site/state, it would be interesting to see whether they are regularly down at certain times, e.g. middle of the night, weekends, no pattern at all, etc.
(Inspired by #22)
Would de-duplicate pingometer connection/raise code.
It’s down in maintenance mode, but shows as up:
https://s3-us-west-2.amazonaws.com/snap-snapshots/CA-54bc8944be653d3f86065dc5-2015-02-17T03%3A38%3A20%2B00%3A00
https://kscapportalp.dcf.ks.gov/client/start.swe?
Then look at the source :/
This is branched off a discussion w/ @bensheldon about yak-shaving, immediate goals, what actually needs doing to craft an effective presentation and narrative here: #39
Quick recap on the near-term goals here:
(Longer term goals, which might benefit more from more squeaky-clean code, left out for now.)
To do all that, need to feel confident we are writing and building on reliable data. What’s the data/what do we need to support narrative and visualization?
That’s what I’ve got for now. It’s a little high-level and a little stream-of-consciousness and not very thought through. Anyone should feel free add to the list, pare it down, clarify, and ask for more details.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.