Giter VIP home page Giter VIP logo

stackstorm-showcase-autoscaling's Introduction

Cloud Autoscaling

This project highligts the use of StackStorm in a generic, autoscaling pipeline. It is built off of st2-workroom and can be deployed in a local Vagrant environment or various cloud providers

Getting Started

To try this out, you will need to make a few configuration changes. These are listed below:

  • Step 1: Configure Packs
    • All of the necessary configuration exists in the hieradata/common.yaml file. Take a peek in there, and fill in any API keys that are left blank.
  • Step 2: Setup SSH Keys
    • The SSH Key used in this example is pulled from the Key/Value store. You will need to set the ssh_public_key parameter. This SSH key should be the same key used by StackStorm to log into remote hosts.
    • Set via WebUI: Head to http://:9101/webui. Navigate to the st2.kv.set action, and enter ssh_public_key for the key, and you SSH key for the value
    • Set via CLI: From the StackStorm server, run the command st2 key set ssh_public_key "<ssh_key>"
  • Step 3: Point New Relic to StackStorm
    • This can only be done via the New Relic Web UI. Log into the website, navigate to Webhooks, and enter http://:10001/st2/nrhook
  • Step 4: Create some AutoScale Groups and Rules!
  • Step 5: Profit

Trying it out

If you would like to try this demonstration, the following commands are available to you via ChatOps. You will need to invite your bot user to the room you plan on testing. (By default, the room is #bot-testing)

  • !asg create name=XXX domain=YYY - Creates a new autoscaling group.
  • !asg node add asg=XXX - Add a node to an ASG
  • !asg expand asg=XXX - Manually expand an ASG
  • !asg node delete name=ZZZ asg=XXX - Delete a node and its autoscaling group association
  • !asg deflate asg=XXX - Manually deflate an ASG
  • !asg delete name=XXX - Delete an ASG and all resources belonging to it

Overview

The goal of this project is to have a workspace that allows you to develop infrastructure in conjuction with StackStorm, or even work on the StackStorm product itself. This project also serves as a template to be used to begin building out and deploying infrastructure using your favorite Configuration Management tool in conjunction with StackStorm

Autoscaling Process

There are several reasons to leverage an autoscaling cloud. One of the more common use-cases include adding additional capacity due to a surge in demand or failure of existing resources. This is where we set our sights: How could StackStorm help facilitate the management of additional capacity when needed? So, we broke down the problem into the smallest components, and set our sights on potential solutions. In short, it broke down to a few phases...

  • Phase 0: Setting up an autoscaling group
  • Phase 1: Systems Failing
  • Phase 2: Monitor the Situation
  • Phase 3: Recover and Stand Down
  • Phase 4: Decommission an autoscaling group

In Phase 1, StackStorm would receive an event from a monitoring system, in this case New Relic. The monitoring system should tell us what application or infrastructure component is impacted, and how it is impacted (is it a warning alert in that the system has some time to respond before things go poorly, or are we already in a critical scenario where immediate action is needed?). From Phase 1, systems are provisioned in order to alleviate pressure. This phase may also include some escalation policies to let folks know of the situation.

Phase 2 deals with attempting to quantify the recovery state of an application. A critical incident may still be underway, but at this point additional resources are allocated to manage the load. During this phase, StackStorm needs to stay on top of things to make sure that if another tipping point is reached with resources that it is ready to provide additional relief as necessary. Likewise, StackStorm needs to be smart enough to know when an event has ceased, and when things can start cooling down.

Phase 3 is all about cleanup. An event is over, and now it's time to return to normal. StackStorm needs to have an understanding of what normal means, and how to safely get there with minimal to no disruption on the part of users.

We started our exploration detailing how we imagined the autoscaling workflow would be executed, and added creation and deletion actions on both ends of the process to ensure completeness. In the interest of brevity, a ton of details have been omitted. Those more inclined to dig into additional details can find more about our thought processes and how we put this together can take a look at https://gist.github.com/jfryman/2345a6c6b1abb312d8cb. The key takeaway though is that we were able to abstractly discuss the logic of how we expected the workflow to run without ever discussing tooling, which in turn allowed us to identify more This allowed us to better understand what data from our tools that we might need while integrating with the different parts of the stack.

Architecture and Integrations

At an abstract level, the workflow is very easy. But, the devil is always in the details, and with autoscaling this is doubly so. We needed to break down all the individual components used to create a new system ready to process requests and start building integrations for them. Considering the full lifecycle of a machine, we needed to:

  • Provisioning new VMs
  • Register a VM with DNS
  • Applying configuration profiles to new machines
  • Receive notifications that an application is misbehaving
  • Receive notifications that an application has recovered
  • Add nodes to load balancer
  • Remove nodes from load balancer
  • Removing a VM from DNS
  • Destroying a VM

So, let's walk through how it all works.

architecture_diagram

To begin with, we have a set of actions that is responsible for Phase 0: Setting up a new Auto-Scaling group. This process is responsible for creating a new association within StackStorm, and deciding what flavor/size of cloud compute nodes that will be set-up. These values are all stored in StackStorm's internal datastore. https://github.com/StackStorm/st2incubator/blob/master/packs/autoscale/actions/workflows/asg_create.yaml

Then, we wait. At some point, our application will fail. In our case, we even developed a fun new application that allows us to simulate App and Server errors. New Relic has four events that we're going to keep an eye out for - looking to see if an application or server has entered a critical state, and the corresponding recovery event. These events are sent to StackStorm via NewRelic's WebHook API, and processed as triggers, and then matched to rules like this: https://raw.githubusercontent.com/StackStorm/st2incubator/master/packs/autoscale/rules/newrelic_failure_alert.yaml.

Depending on the received event (Alert/Recovery), things go into action. In the event of a alert, StackStorm will set the alert state for the given application to 'Active'. This is used with the governor which I'll touch upon in a moment. Then, StackStorm jumps into action by kicking off adding as many new nodes to our Autoscale group as we specified at creation. This workflow is responsible for adding additional nodes, making sure they have been provisioned with Chef, and also adding the nodes to DNS and the Load Balancer. Finally, as all of these events fire, we send out ChatOps notifications to Slack to keep all the admins informed about what is happening within StackStorm. This workflow is articulated at https://github.com/StackStorm/st2incubator/blob/master/packs/autoscale/actions/workflows/asg_add_node.yaml.

All the while, another internal sensor that we call a TimerSensor is running, polling every 30 seconds. Each interval, the governor looks at the state of all AutoScale group alert statuses to decide whether or not additional nodes need to be created and added to the autoscale group. It does this by looking for any AutoScale groups that are in alert state, and attempts to add additional capacity if the right conditions are met. A sort of blunt sword throttling is in place for the first pass - the governor evaluates the time since the last scale event and responds accordingly. The same logic happens in reverse, but at a much slower rate (longer duration between deletions, fewer machines destroyed at a time).

stackstorm-showcase-autoscaling's People

Contributors

jfryman avatar manasdk avatar kami avatar tobijb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.