Comments (1)
For context, here's the design doc for rollout blocks ported from the internal repo:
Rollout Blocks
During an outage, changes to the system should be reduced as much as possible.
This reduces cognitive burden for the incident responders. Rollouts are
expected to be the riskiest and most frequent source of change;1 therefore, the
deployment system should support administrative rollout blocks.
Considerations
Easy to globally apply
Rollout blocks are primarily a tool to reduce entropy during incidents. As
such, applying a global rollout block should be treated as a critical
operational task: it must be easy to do very quickly in a reliable way without
special access levels.
This requirement means that editing each Application object using kubectl
in
a for
loop is probably not what we want: that's prone to partial failure, and
requires elevated access rights to poke into all of the namespaces.
Easy to globally un-apply
Once the incident is contained, we want to enable teams to get moving again as
fast as we can: any delay in unblocking directly hurts innovation. This
means we need to be able to remove a global rollout block quickly and reliably.
Overridable
Rollout blocks should be overridable so that the incident responders can roll
out changes in order to fix or mitigate the problem. However, this override
should have a limited lifetime: we do not want objects to accidentally remain
in "override" mode long after the bypassed block has been lifted.
Possible to apply per-namespace
In addition to global blocks, there are a number of use-cases for
per-namespace blocks: for example, there's a known-bad commit on master, but
there has not been time to debug it. As such, it should be possible to block Applications in a single namespace from rolling out.
Local blocks should not be cleared when a global block is removed.
Leave the system in a safe state
The block should bring the system to a pause at the nearest available safe
state. For example, if we're in the middle of adjusting load balancer
membership, we should finish that action before halting.
This property helps to ensure that the system is in a 'normal' state in the
state machine, not mid-transition: we believe this simplifies reasoning about
problems and ensures that recovery actions can use the standard tooling.
This likely means the block should: prevent new rollouts, and pause existing
rollouts once the current target waypoint is achieved.
Isolated RBAC scope from normal rollouts
The block mechanisms -- global and local -- should be accessible to most/all
developers: escalations are cheap. However, we do not want all developers to be
able to roll out all applications. Therefore we need to design blocks such that
the RBAC scope for adding or removing rollout blocks is distinct from the scope
for normal rollouts.
We would also prefer the blocks to be automatable without a high-power service
account: for example, it is not good if the block automation requires a service
account capable of rolling out any application.
This restriction does not apply to block overrides: you need to be able to edit
Applications or Releases in a given namespace in order to take advantage of
a block override. As a consequence, it is ok if overriding a rollout block
requires editing an Application or Release.
Global blocks can be made available to everyone via a fine–grained RBAC
permission.
Implementation
Introduce a new, namespaced object called RolloutBlock
. The RolloutBlock
object represents a rollout block in a specific namespace. When the object is
deleted, the block is lifted.
The semantics of the block are the following:
-
When a block is in place, edits to the
.spec.template
for any Application in
the namespace will be rejected, and no new release object created. -
When a block is in place, edits to the
.spec
for any Release in the
namespace will be rejected: all Release objects will freeze on their current
.spec.targetStep
. This means that the system will converge on the current
.spec.targetStep
and then halt, which grants the 'stop in a safe state'
property.
For global blocks, the RolloutBlock
object is created in a configured special
namespace, such as shipper-system
or global-rollout-block
. This
special namespace will be configuration for Shipper on startup.
RolloutBlock
Object
The RolloutBlock
object is a namespaced object with the following spec:
apiVersion: shipper.booking.com/v1
kind: RolloutBlock
metadata:
name: dns-outage
namespace: shipper-system
creationTimestamp: 2018-05-11T14:23:13Z
spec:
message: DNS issues, troubleshooting in progress
author:
type: user
name: jdoe
This indicates that a rollout block was put in place by user 'jdoe' on May
5th, 14:23.
Overrides
Rollout blocks may be overridden with an annotation applied to the
Application or Release object which needs to bypass the block. This annotation
should list each rollout block that it overrides with a fully-qualified name
(namespace + name). For example:
apiVersion: shipper.booking.com/v1
kind: Application
metadata:
name: reviewsapi
annotations:
shipper.booking.com/block.override: shipper-system/dns-outage
spec:
revisionHistoryLimit: 10
template:
# ... rest of template omitted here
The annotation may reference multiple blocks:
annotations:
shipper.booking.com/block.override: shipper-system/dns-outage,frontend/demo-to-investors-in-progress
The block override annotation format is CSV.
The override annotation MUST reference specific, fully-qualified RolloutBlock
objects by name. This ensures that overrides expire when that specific block
expires. Non-existing blocks enlisted in this annotation should not be allowed.
Additionally, it enables a controller to add a .status.overrides
field to the
RolloutBlock
object so that operators can understand which changes may still
be going out during the block.
This might look like this:
apiVersion: shipper.booking.com/v1
kind: RolloutBlock
metadata:
name: dns-outage
namespace: shipper-system
# ... spec omitted
status:
# associated because 'shipper-system/dns-outage' is referenced in override annotation
overrides:
application:
- frontend/ui
release:
- frontend/ui-24eaeeb-0
RolloutBlock
Controller
The RolloutBlock controller should watch for the deletion of RolloutBlock
objects and remove any override annotations which reference the deleted block
object from Application or Release objects.
Additionally it should populate the .status.overrides
by inspecting
Application and Release objects for an override annotation which references
that particular RolloutBlock.
Application and Release conditions
Application
and Release
objects should have a .status.conditions
entry
which lists all of the blocks which are currently in effect.
For example:
apiVersion: shipper.booking.com/v1
kind: Application
metadata:
name: ui
namespace: frontend
spec:
# ... spec omitted
status:
conditions:
- type: Blocked
status: True
reason: RolloutsBlocked
message: "rollouts blocked by: shipper-system/dns-outage,frontend/demo-to-investors-in-progress"
Admission controller
The Shipper admission controller is in charge of enforcing rollout blocks. It does the
following:
-
For global blocks, it checks for the existence of 1 or more
RolloutBlock
objects present in the designated 'global block' namespace. If present,
rollouts are blocked. It should examine the currently admitting Application
or Release for an annotation which overrides a globalRolloutBlock
. -
For local blocks, it checks for the presence of one or more
RolloutBlock
objects present in the namespace of the object being admitted.
If the Admission controller understands that the admitting Application or
Release object is blocked, it should reject any update to the .spec
of either
object.
Footnotes
-
"These two data points seem to suggest that when Facebook employees are not actively making changes to infrastructure because they are busy with other things (weekends, holidays, or even performance reviews), the site experiences higher levels of reliability." -- https://queue.acm.org/detail.cfm?id=2839461
↩
from shipper.
Related Issues (20)
- Change shipperctl to use the user's context instead of the Shipper service account
- Shipper doesn't update status when there's a rollout block
- Webhook is not validating deletion of objects
- Release states and strategy conditions not updated when the Cluster Client Store can't provide a client
- Don't block when resolving chart versions
- Forget items with wrong clusters from the queues
- Create an instance of the instrumented client per chart repository, and expose metrics per instrumented client
- Consider the operational condition when aggregating conditions in the release object from target objects
- add `sideEffects: None` to our Validating Webhook Configuration HOT 1
- the inservice field on the Cluster object isn't updated when connection to a Cluster is restored
- Clean up the ClusterClientStore when the unschedulabvle field on a Cluster object is set to true
- Provide the expiration date of the webhook certificate as a metric HOT 1
- Provide a shipperctl command to refresh webhook secret HOT 1
- Add `metadata` to the `template` section for Applications
- Expose cluster names in Release controller error logs HOT 1
- Expose metrics from Janitor controller
- Run Shipper webhook on app clusters, too HOT 1
- Deprecate listening to InstallationTarget, CapacityTarget and TrafficTarget in shipper-mgmt HOT 1
- shipperctl backup restore '''failed calling webhook "shipper.booking.com"'''
- Add additional Kubernetes service that is routing the traffic to the contender only (the new version of the service) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from shipper.