Giter VIP home page Giter VIP logo

Comments (1)

kanatohodets avatar kanatohodets commented on June 4, 2024

For context, here's the design doc for rollout blocks ported from the internal repo:

Rollout Blocks

During an outage, changes to the system should be reduced as much as possible.
This reduces cognitive burden for the incident responders. Rollouts are
expected to be the riskiest and most frequent source of change;1 therefore, the
deployment system should support administrative rollout blocks.

Considerations

Easy to globally apply

Rollout blocks are primarily a tool to reduce entropy during incidents. As
such, applying a global rollout block should be treated as a critical
operational task: it must be easy to do very quickly in a reliable way without
special access levels.

This requirement means that editing each Application object using kubectl in
a for loop is probably not what we want: that's prone to partial failure, and
requires elevated access rights to poke into all of the namespaces.

Easy to globally un-apply

Once the incident is contained, we want to enable teams to get moving again as
fast as we can: any delay in unblocking directly hurts innovation. This
means we need to be able to remove a global rollout block quickly and reliably.

Overridable

Rollout blocks should be overridable so that the incident responders can roll
out changes in order to fix or mitigate the problem. However, this override
should have a limited lifetime: we do not want objects to accidentally remain
in "override" mode long after the bypassed block has been lifted.

Possible to apply per-namespace

In addition to global blocks, there are a number of use-cases for
per-namespace blocks: for example, there's a known-bad commit on master, but
there has not been time to debug it. As such, it should be possible to block Applications in a single namespace from rolling out.

Local blocks should not be cleared when a global block is removed.

Leave the system in a safe state

The block should bring the system to a pause at the nearest available safe
state. For example, if we're in the middle of adjusting load balancer
membership, we should finish that action before halting.

This property helps to ensure that the system is in a 'normal' state in the
state machine, not mid-transition: we believe this simplifies reasoning about
problems and ensures that recovery actions can use the standard tooling.

This likely means the block should: prevent new rollouts, and pause existing
rollouts once the current target waypoint is achieved.

Isolated RBAC scope from normal rollouts

The block mechanisms -- global and local -- should be accessible to most/all
developers: escalations are cheap. However, we do not want all developers to be
able to roll out all applications. Therefore we need to design blocks such that
the RBAC scope for adding or removing rollout blocks is distinct from the scope
for normal rollouts.

We would also prefer the blocks to be automatable without a high-power service
account: for example, it is not good if the block automation requires a service
account capable of rolling out any application.

This restriction does not apply to block overrides: you need to be able to edit
Applications or Releases in a given namespace in order to take advantage of
a block override. As a consequence, it is ok if overriding a rollout block
requires editing an Application or Release.

Global blocks can be made available to everyone via a fine–grained RBAC
permission.

Implementation

Introduce a new, namespaced object called RolloutBlock. The RolloutBlock
object represents a rollout block in a specific namespace. When the object is
deleted, the block is lifted.

The semantics of the block are the following:

  • When a block is in place, edits to the .spec.template for any Application in
    the namespace will be rejected, and no new release object created.

  • When a block is in place, edits to the .spec for any Release in the
    namespace will be rejected: all Release objects will freeze on their current
    .spec.targetStep. This means that the system will converge on the current
    .spec.targetStep and then halt, which grants the 'stop in a safe state'
    property.

For global blocks, the RolloutBlock object is created in a configured special
namespace, such as shipper-system or global-rollout-block. This
special namespace will be configuration for Shipper on startup.

RolloutBlock Object

The RolloutBlock object is a namespaced object with the following spec:

apiVersion: shipper.booking.com/v1
kind: RolloutBlock
metadata:
  name: dns-outage
  namespace: shipper-system
  creationTimestamp: 2018-05-11T14:23:13Z
spec:
  message: DNS issues, troubleshooting in progress
  author:
    type: user
    name: jdoe

This indicates that a rollout block was put in place by user 'jdoe' on May
5th, 14:23.

Overrides

Rollout blocks may be overridden with an annotation applied to the
Application or Release object which needs to bypass the block. This annotation
should list each rollout block that it overrides with a fully-qualified name
(namespace + name). For example:

apiVersion: shipper.booking.com/v1
kind: Application
metadata:
  name: reviewsapi
  annotations:
    shipper.booking.com/block.override: shipper-system/dns-outage
spec:
  revisionHistoryLimit: 10
  template:
  # ... rest of template omitted here

The annotation may reference multiple blocks:

annotations:
  shipper.booking.com/block.override: shipper-system/dns-outage,frontend/demo-to-investors-in-progress

The block override annotation format is CSV.

The override annotation MUST reference specific, fully-qualified RolloutBlock
objects by name. This ensures that overrides expire when that specific block
expires. Non-existing blocks enlisted in this annotation should not be allowed.

Additionally, it enables a controller to add a .status.overrides field to the
RolloutBlock object so that operators can understand which changes may still
be going out during the block.

This might look like this:

apiVersion: shipper.booking.com/v1
kind: RolloutBlock
metadata:
  name: dns-outage
  namespace: shipper-system
# ... spec omitted
status:
  # associated because 'shipper-system/dns-outage' is referenced in override annotation
  overrides:
    application:
    - frontend/ui
    release:
    - frontend/ui-24eaeeb-0

RolloutBlock Controller

The RolloutBlock controller should watch for the deletion of RolloutBlock
objects and remove any override annotations which reference the deleted block
object from Application or Release objects.

Additionally it should populate the .status.overrides by inspecting
Application and Release objects for an override annotation which references
that particular RolloutBlock.

Application and Release conditions

Application and Release objects should have a .status.conditions entry
which lists all of the blocks which are currently in effect.

For example:

apiVersion: shipper.booking.com/v1
kind: Application
metadata:
  name: ui
  namespace: frontend
spec:
  # ... spec omitted
status:
  conditions:
  - type: Blocked
    status: True
    reason: RolloutsBlocked
    message: "rollouts blocked by: shipper-system/dns-outage,frontend/demo-to-investors-in-progress"

Admission controller

The Shipper admission controller is in charge of enforcing rollout blocks. It does the
following:

  • For global blocks, it checks for the existence of 1 or more RolloutBlock
    objects present in the designated 'global block' namespace. If present,
    rollouts are blocked. It should examine the currently admitting Application
    or Release for an annotation which overrides a global RolloutBlock.

  • For local blocks, it checks for the presence of one or more RolloutBlock
    objects present in the namespace of the object being admitted.

If the Admission controller understands that the admitting Application or
Release object is blocked, it should reject any update to the .spec of either
object.

Footnotes

  1. "These two data points seem to suggest that when Facebook employees are not actively making changes to infrastructure because they are busy with other things (weekends, holidays, or even performance reviews), the site experiences higher levels of reliability." -- https://queue.acm.org/detail.cfm?id=2839461

from shipper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.