(ported from the internal repo design docs)
This document describes part of our plan for helping users diagnose how their rollout is going. This has two components: the low-level high detail view in the status of the CapacityTarget
object; the high-level low-detail view in the status of the Release
object. This document discusses the low-level high-detail CapacityTarget
view: we think it'll be easier to start with a domain where we don't need to invent summarization/prioritization schemes.
Reporting Progress
Previously, we introduced the concept of sad pods, which allowed the
user to see the pods that were not ready. There were a few problems
with this approach:
- It was hard to read: we were dumping the whole status of the pod
into the capacity target, for every single pod that was not working
- We were only using a hard limit of 5 to keep the objects small. This
meant that different problems wouldn't be necessarily surfaced.
- The user wouldn't see the positive things (i.e. pods running), so
it'd be hard to see if the release was progressing, or for tooling
to show the status of the whole release across multiple clusters.
So, we decided to summarize the status of all the pods per cluster.
Criteria For Summarizing
Owner
The first level of the summary is the owner of the pod.
Multiple Kubernetes objects can lead to one or more pods being
created. DaemonSets, Deployments, Jobs, ReplicaSets, and StatefulSets
can create new pods, which means that later down the hierarchy, we
might have container names that clash. To prevent that, we are using
the owner of a pod to categorize the summary report at the top level.
Pod Condition: Type, Reason and Status
Under each owner, there is a pod status breakdown. This breakdown is
grouped by the following fields, in order:
- Pod condition type (e.g.
Ready
)
- Pod condition reason (e.g.
ContainersNotReady
)
- Pod condition status (
True
, False
, or Unknown
)
Apart from categorizing pods by their conditions, we also sort the
results with the same criteria to keep the ordering consistent across
multiple updates.
To aid humans in deciding which problem to look into, we also maintain
a count
for the number of pods with this type, reason and status.
Container Name
In the containers
field of each type + reason + status
combination, there is another grouping happening, and that is grouped
by container name.
This means that we have a report per container name. And that report
follows a pretty similar structure as the report for pods.
Container State: Type and Reason
Just as the pod status breakdown works with conditions, container
state breakdown works with container states. The only difference is
that unlike pods, container states are not as transparent, and we need
to infer type and reason through a logic of our own..
Type
Each container state has three nullable fields. They are called
Waiting, Running, and Terminated.
We use these to derive the container state Type. The type of a
container state is whichever field that is not null.
Reason
Containers keep two states, not one. The state called State is their
current state, and the one called LastTerminationState contains the
last state that happened. In other words, this is the state of the
container before it got restarted.
Reason is tricky mostly because it is not always informative. Based on
our experience so far, what users usually want to see is the Reason
of the current state, if the current state is Waiting.
Here are the steps we go through to come up with the Reason for a
container state:
- If the State (i.e. the container's current state) is Waiting, we
use the reason.
- If it's not, the Reason is empty.
Constructing Examples
In each pod status breakdown, we have an example that contains a pod
name and a message. At best, this message helps the user know what is
wrong without having to switch to the target cluster. At worst, the
user can use the pod name to look through logs or events after
switching to the application cluster.
This example is picked from the list of pods which fall into that
breakdown. However, to keep this example pod consistent, pods are
sorted, and then the first pod is picked as the example.
The example contains only two fields, the pod name and a message.
Pod Name
This is copied, verbatim, from the name of the example pod.
Message
We are trying to show some useful information to the user through the
Message of the example. Here is where we get the message from:
- If
LastTerminationState.Terminated.Message
is set, meaning that
the user has written to the termination message path, we choose
it.
- If it's not set, we construct a message ourselves. The initial
proposal is to go with a string like Terminated with exit code <exitcode>
or Terminated with signal <signal>
if there is a
signal instead of an exit code.
Example
To bring it all together, here is an example of what a capacity target would
look like with a replica set maintaining 20 pods with 2 containers (app
and
envoy
):
status:
clusters:
- name: us-west1
report:
- owner:
name: replicaset/reviewsapi-$hash-0-$hash
breakdown:
- type: Ready
status: True
count: 12
containers:
- name: app
states:
- type: Running
count: 12
example:
pod: reviewsapi-$hash-0-$hash-1234
- name: envoy
states:
- type: Running
count: 12
example:
pod: reviewsapi-$hash-0-$hash-1234
- type: Ready
status: False
reason: ContainersNotReady
count: 8
containers:
- name: app
states:
- type: Waiting
reason: ImagePullBackOff
count: 6
example:
pod: reviewsapi-$hash-0-$hash-4567
message: "failed to pull reviewsapi:abcd123"
- type: Waiting
reason: ContainerCreating
count: 1
example:
pod: reviewsapi-$hash-0-$hash-4567
- type: Waiting
reason: CrashLoopBackOff
count: 2
example:
pod: reviewsapi-$hash-0-$hash-4567
message: 'Terminated with exit code 1' # constructed by Shipper from `LastState.Terminated.ExitCode`
- name: envoy
states:
- type: Waiting
reason: CrashLoopBackOff
count: 8
example:
pod: reviewsapi-$hash-0-$hash-4567
message: 'cannot fetch service mesh config. argh!' # Read from terminationMessagePath
Caveats
Memory impact of pod informer for each cluster
This scheme is predicated on maintaining a pod informer for each cluster. For very large clusters with hundreds of thousands of pods, this may add up to a significant memory impact. Taking an extreme case, consider a management cluster orchestrating 10 Kubernetes clusters each with 5000 nodes and 100 pods per node: this represents about 50gb of heap if we think each pod is about ~10kb in memory.
Update rate for informers subscribing to a very large number of pod changes
We're not sure how client-go will handle a very high churn subscription on big clusters.
CPU impact of doing crunchy summarization work
The summarization scheme we're proposing, implemented involves a lot of aggregation over the set of pods and their containers. This might end up being a lot of CPU load for multiple very large clusters.
API call throttling updating capacity target objects for a high-churn pod fleet
We're likely to run into the client-go throttling limits when attempting to keep a CapacityTarget object up-to-date with a very large pod fleet. In this case, it should be safe to drop updates and re-process at the next resync period, or retry after a certain amount of time. None of the state depends on catching each update.