Giter VIP home page Giter VIP logo

snap-it-up's People

Contributors

bensheldon avatar daguar avatar mr0grog avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snap-it-up's Issues

Add a README.

Which hopefully gives some overview to what we’re doing here.

Track known/planned downtimes

Some sites have regular planned downtime. Even if we aren’t doing any useful analysis with it yet, it would be good to collect that data—it could be useful to people trying to use the services, it almost certainly has an interesting place in the narrative about service availability, and we could of course do interesting analysis with it later.

Maybe start by posting data here, but probably move it into a file in the repo. That way we can display it on pages or make it available in a machine readable format for others to work with.

(Inspired by #22)

Weekly QA Checklist

Stub

Purpose: ensure a durable level of value from the work we've completed to date, with realistic bounds on the amount of future work to be invested.

Determine how to calculate uptime

After merging #51, I did some more general sanity checks and noticed some oddities, which I talked with @pingometer support about. Our calculations are now somewhat similar to Pingometer’s, but not the same—and they never will be:

It turns out Pingometer calculates uptime based on each check of a monitor. That is, every check is factored into Pingometer’s “uptime” calculation, regardless of whether it passed the threshold needed to trigger an event. A monitor’s sensitivity setting determines how many consecutive failed checks result in an event. There is no sensitivity level that results in a single failed check leading to an incident: http://support.pingometer.com/knowledge_base/topics/what-is-the-sensitivity

We’re now calculating uptime based on events, which leaves us with slightly different results. That’s not a bad thing—whether uptime is calculated based on times we actually classify the service as down vs. every individual unsuccessful check is pretty subjective. In some contexts or in some philosophies, what we’re no doing is more correct. In others, Pingometer’s approach is more right.

Either way, Pingometer gives us all the data we need to choose our approach. HOWEVER, because checks are frequent across their platform, Pingometer (quite reasonably) only stores individual check data for a few days. So if we want to change the way we calculate things, we can do it going forward, but can’t get historical data.

This probably means I should also be capturing checks in addition to events, but we should also figure out the appropriate approach to calculations here.

Checklist for Published Write-up

BLUF:

It's time to ship! We are presenting our work in Boston on April 1st at Health Refactored, a health and technology conference, where Code for America is included on a panel about equity in health.

In parallel with that presentation, we want to publish our narrative write up of the work completed to date and the human impact of the problem we are investigating, as eloquently outlined by @bengolder in #6.

We have a good amount of work left to do, so let's use this thread as a way to coordinate that work across contributors over the next week.

The write up itself (as in the content), and the presentation, need to ship by March 25th. We want to ship the full experience by the presentation date, which is April 1st.

Cc'ing all the current and future contributors. Please edit, amend or improve the list below. We can also talk about trimming the scope based on our time and availability.
@lippytak @Mr0grog @bengolder @davidrleonard @bensheldon

Narrative

Graphics without data

Data Visualizations

  • Map of current downtime
  • Map of downtime over the last week
  • Implement better color scheme for the map per #45
  • Exploration of how to display screenshots of error/down pages. @Mr0grog
  • Animated Map of downtime over our best month of data @Mr0grog

Presentation

  • Presentation Outline
  • Adapt narrative and graphics for 20 minute presentation.
  • Discuss and confirm overarching calls to action.

Research

  • Historical uptime for comparables like Google, Banks, Credit Unions, the SEC, CNN, etc.

Front-end

  • Call to action - Send a letter customized for your area code or state. @davidrleonard
  • Dynamic headline for the article based on current or recent downtime. @Mr0grog
  • Subdomain on codeforamerica.org @davidrleonard
  • Twitter/FB card configuration
  • Typography

Technical write-up

Other explorations

  • Visualizing downtime over a year
  • Visualizing downtime over a day

Should be tracking checks, not just events

Persuant to #55, we should probably be capturing a record of every check (which will get big fast). This will give us more flexibility to talk about what “down” means in the future.

Map of February uptime from local data

@Mr0grog My understanding is that we are now visualizing our maps from local data per #51

For the presentation (which I need to turn in tomorrow AM at the latest), I want to include at least one map.

I could do a static shot of the current "Uptime over the past week" map, which has data good enough to include with generalized statements about uptime. However, I was thinking that a map for the month of February might make a more compelling point.

Would generating that map—with existing styles, labels, etc—have a low enough LOE to fit on your plate today?

Snapshots triggered by event hook should happen in a worker thread/process/dyno

Now that @bensheldon’s set up Sentry, I’m getting occasional notices about the event hook timing out. I remember seeing this occasionally in logs in the past, as well. I’m 99.9% certain this is caused by sites being slow to load when we screenshot them (not exactly surprising if the site reports as “down”).

In order to make sure we give the snapshot more time to complete and don’t cause errors, we should probably que the snapshots and perform them in some sort of worker process or thread. @bensheldon suggests using Que.

Convert to Rails

I think we've identified some tech (postgres, background jobs) and process benefits (me helping) to changing the architecture. This is just a brief summary of what I'm planning to do this weekend:

  1. Re-implement Monitors, Incidents and MonitorEvents data model in Postgres
  2. Re-implement the webhook process data catcher from Pingometer
  3. Screenshotting: not sure yet if I'll use an existing image persistence library or just reuse the current code

I think that functionality is sufficient to deploy alongside the Sinatra app (snap-status-rails.herokuapp.com ... until it has full partiy). Can we point Pingometer to a second webhook?

Once I have that up and ready to catch webhooks, I'll work on pulling in the existing front-end reports.

I haven't gone through the Rake tasks fully. Is there any functionality in those that is being actively used?

Please don't let my re-architecting block any feature work. I take full responsibility for backporting any work on the Sinatra app until the Rails piece has full parity. But I suggest we put any of the backend or code cleanup improvements on hold.

Map monitors to states by domain name?

Or by ID. Basically, we need to not do it by name, since, even though it was convenient, it’s now broken for California, where all the names recently changed to not be in the format “state | name-of-site”.

(Some?) Transaction monitors have nameless alerts

Just logging a small @pingometer bug that some/all transaction monitor alerts don't include the monitor name:
screen shot 2015-02-03 at 7 55 47 pm

We're only using these alerts internally so it's not a big deal for us. Check Indiana (monitor ID 54d05a90be653d2e76ff9ce7) as an example.

Provide public access to underlying monitoring data?

As I noted in #13, it seems like this site/page/essay/whatever could/should also be a place for public access to the monitoring data. That could be links to files on S3 or all kinds of other things:

  • CSV/JSON dumps of checks/events/report data
  • RSS/ATOM feeds
  • JSON API of data

We could do all of this as a pass-through to Pingometer for “report” data (avg. uptime, response time per day) because @pingometer support tells me the full history of that data is included in their API, but for checks or events, we’d have to do live aggregating into our own database (totally do-able, but another thing).

Need to declare a “good after” date for each monitor

It looks like when we set up monitoring for Louisiana, we grabbed the redirect URL (dcfs.la.gov:80/index.cfm?md=pagebuilder&tmp=home&pid=407) and began monitoring their error page, rather than the webservice itself.

In any display of the data, we should present Louisiana as having "no data", or something to that effect.

Media plan

Stub for now. Let's just put together a list of who would be interested and send some loving tweets/emails. Simple.

improve map color scheme

  • decide on appropriate breaks
  • adopt a better color scheme that:
    • still communicates a scale of stable <--> unstable
    • avoids conflation with common political maps
    • complements other visual elements of the write up
    • isn't garish

Sentry/Airbrake exception monitoring

Not sure what your favorite flavor of exception monitoring is (or what you're willing to pay for... I spend on Sentey and it's worth it) but you should add one before it goes into production.

upgrade to two dynos

Let's upgrade to two dynos on heroku so users don't encounter the slight lag in loading we currently experience.

Data Caching?

It seems like every front page load requires an API fetch. That seems less than ideal. Could I suggest tossing results into mencached with an time-based expiration of whatever granularity the uptime monitor has?

I suggest memcached because most libs have breaker behavior which makes it pretty resilient and doesn't require setting up a db for local development.

Erratic Ohio data

Hoping @pingometer can look into this:
screen shot 2015-02-09 at 8 11 14 am

I've tried to confirm manually and as far as I can tell these are all false positives.

A couple (2?) data visualizations

@Mr0grog you've made a ton of progress on this so far...would you like to work with @bengolder to figure out exactly what 1-2 visualizations we should include in the article? Static or dynamic? Real time or summaries of prior data? All that jaz...You seem to have the best sense of what's feasible and the best skills̶z to make it happen.

Show day/week aggregations

Per @alanjosephwilliams, it would be really useful to highlight, specifically, whether a site/state has been down in the past week. There are a few different data points that might be interesting here, and we can probably easily prototype them all:

  • Uptime over a given period
  • Uptime met a certain threshold (say, 95%) over a given period
  • Down at all (i.e. uptime below 100%) over a given period (note this might be interesting but ultimately kind of pointless as, if the period is more than a day, we’d probably wind up showing almost all sites as failing)

And of course aggregating over differing periods:

  • Weekly
  • Daily
  • Hourly
  • Arbitray?

Chart downtime vs. time of day

For each site/state, it would be interesting to see whether they are regularly down at certain times, e.g. middle of the night, weekends, no pattern at all, etc.

  • Chart over 24 hours
  • Chart over a week
  • Chart over a month? Not super useful right now since we don’t have good data going back very far yet.

(Inspired by #22)

Supporting data vs. supporting analysis/viz

This is branched off a discussion w/ @bensheldon about yak-shaving, immediate goals, what actually needs doing to craft an effective presentation and narrative here: #39

Quick recap on the near-term goals here:

  • Author a narrative that illustrates the impact of downtime through some real stories and a high-level overview monitoring data.
  • Author a technical write up for government and civic tech folks looking to implement monitoring on services they care about.
  • Present the story at a health & technology conference in Boston on April 1.

(Longer term goals, which might benefit more from more squeaky-clean code, left out for now.)

To do all that, need to feel confident we are writing and building on reliable data. What’s the data/what do we need to support narrative and visualization?

  1. Data that we feel confident in. Can we pull it reliably from backing services like Pingometer? Do we feel confident that what we record in the event hook is accurate? (e.g. doesn’t randomly have holes because the process crashed or timed out or something.) etc.
  2. Snapshots of sites when down (also up). As many as possible, in all the varying ways they can be down. Seeing a sad, broken site is compelling. Seeing a wall of screenshots of sad, broken sites is especially so. (This issue has been kind of a crazy iceberg.)
  3. If possible, our reliability/availability shouldn’t be hindered by that of supporting services (e.g. Pingometer) or the sites we’re monitoring. This is basically #17.
  4. Easy ability to query the data for:
    • Screenshots
    • Times sites were down
    • Current status of sites (up/down, maybe load time?)
    • Down/uptime over various timeframes (today, past week, past month, etc) or as time series. Might not need this to be backend; could probably calculate easily enough in JS given a easy-to-analyze blob of JSON data from the server. See also #3.
    • Basic info about the sites we are monitoring: Name, state, host/url, etc. Would be super-cool to have more meta-info about who built and who hosts the sites, when they were made, what services they support (e.g. just info? just application? balance checking?), planned downtimes, and so on. Getting the actual data is more research (looking here at @bengolder, @lippytak, @alanjosephwilliams), but making a place for it or making it possible to add easily should the above people feel it’s useful in the narrative is more what I mean.
  5. API or daily data dumps or something in service of #15.
  6. Place to put the various write-ups and transform them into a web page. Maybe a folder full of markdown that gets read or compiled and presented on a page? Maybe its just a static page and the work happens on Google Docs or in the repo wiki or somewhere else. See #6, I guess.
  7. Place to write/track tech docs/description/methods stuff from #12. As above, maybe not technical at all. Maybe this is just the wiki for now.

That’s what I’ve got for now. It’s a little high-level and a little stream-of-consciousness and not very thought through. Anyone should feel free add to the list, pare it down, clarify, and ask for more details.

Add better Arizona keyword

Right now AZ is getting away with a ton of undetected downtime! So sneaky...

UP looks like this:
az-up

Down looks like this:
screen shot 2015-02-03 at 10 27 38 pm

So we need to look for some content around that 'get started' button.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.