Giter VIP home page Giter VIP logo

descheduler's Introduction

Go Report Card Release Charts

↖️ Click at the [bullet list icon] at the top left corner of the Readme visualization for the github generated table of contents.

descheduler

Descheduler for Kubernetes

Scheduling in Kubernetes is the process of binding pending pods to nodes, and is performed by a component of Kubernetes called kube-scheduler. The scheduler's decisions, whether or where a pod can or can not be scheduled, are guided by its configurable policy which comprises of set of rules, called predicates and priorities. The scheduler's decisions are influenced by its view of a Kubernetes cluster at that point of time when a new pod appears for scheduling. As Kubernetes clusters are very dynamic and their state changes over time, there may be desire to move already running pods to some other nodes for various reasons:

  • Some nodes are under or over utilized.
  • The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node affinity requirements are not satisfied any more.
  • Some nodes failed and their pods moved to other nodes.
  • New nodes are added to clusters.

Consequently, there might be several pods scheduled on less desired nodes in a cluster. Descheduler, based on its policy, finds pods that can be moved and evicts them. Please note, in current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that.

⚠️ Documentation Versions by Release

If you are using a published release of Descheduler (such as registry.k8s.io/descheduler/descheduler:v0.26.1), follow the documentation in that version's release branch, as listed below:

Descheduler Version Docs link
v0.29.x release-1.29
v0.28.x release-1.28
v0.27.x release-1.27
v0.26.x release-1.26
v0.25.x release-1.25
v0.24.x release-1.24

The master branch is considered in-development and the information presented in it may not work for previous versions.

Quick Start

The descheduler can be run as a Job, CronJob, or Deployment inside of a k8s cluster. It has the advantage of being able to be run multiple times without needing user intervention. The descheduler pod is run as a critical pod in the kube-system namespace to avoid being evicted by itself or by the kubelet.

Run As A Job

kubectl create -f kubernetes/base/rbac.yaml
kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/job/job.yaml

Run As A CronJob

kubectl create -f kubernetes/base/rbac.yaml
kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/cronjob/cronjob.yaml

Run As A Deployment

kubectl create -f kubernetes/base/rbac.yaml
kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/deployment/deployment.yaml

Install Using Helm

Starting with release v0.18.0 there is an official helm chart that can be used to install the descheduler. See the helm chart README for detailed instructions.

The descheduler helm chart is also listed on the artifact hub.

Install Using Kustomize

You can use kustomize to install descheduler. See the resources | Kustomize for detailed instructions.

Run As A Job

kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/job?ref=v0.26.1' | kubectl apply -f -

Run As A CronJob

kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/cronjob?ref=v0.26.1' | kubectl apply -f -

Run As A Deployment

kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/deployment?ref=v0.26.1' | kubectl apply -f -

User Guide

See the user guide in the /docs directory.

Policy, Default Evictor and Strategy plugins

⚠️ v1alpha1 configuration is still supported, but deprecated (and soon will be removed). Please consider migrating to v1alpha2 (described bellow). For previous v1alpha1 documentation go to docs/deprecated/v1alpha1.md ⚠️

The Descheduler Policy is configurable and includes default strategy plugins that can be enabled or disabled. It includes a common eviction configuration at the top level, as well as configuration from the Evictor plugin (Default Evictor, if not specified otherwise). Top-level configuration and Evictor plugin configuration are applied to all evictions.

Top Level configuration

These are top level keys in the Descheduler Policy that you can use to configure all evictions.

Name type Default Value Description
nodeSelector string nil limiting the nodes which are processed. Only used when nodeFit=true and only by the PreEvictionFilter Extension Point
maxNoOfPodsToEvictPerNode int nil maximum number of pods evicted from each node (summed through all strategies)
maxNoOfPodsToEvictPerNamespace int nil maximum number of pods evicted from each namespace (summed through all strategies)

Evictor Plugin configuration (Default Evictor)

The Default Evictor Plugin is used by default for filtering pods before processing them in an strategy plugin, or for applying a PreEvictionFilter of pods before eviction. You can also create your own Evictor Plugin or use the Default one provided by Descheduler. Other uses for the Evictor plugin can be to sort, filter, validate or group pods by different criteria, and that's why this is handled by a plugin and not configured in the top level config.

Name type Default Value Description
nodeSelector string nil limiting the nodes which are processed
evictLocalStoragePods bool false allows eviction of pods with local storage
evictSystemCriticalPods bool false [Warning: Will evict Kubernetes system pods] allows eviction of pods with any priority, including system pods like kube-dns
ignorePvcPods bool false set whether PVC pods should be evicted or ignored
evictFailedBarePods bool false allow eviction of pods without owner references and in failed phase
labelSelector metav1.LabelSelector (see label filtering)
priorityThreshold priorityThreshold (see priority filtering)
nodeFit bool false (see node fit filtering)
minReplicas uint 0 ignore eviction of pods where owner (e.g. ReplicaSet) replicas is below this threshold

Example policy

As part of the policy, you will start deciding which top level configuration to use, then which Evictor plugin to use (if you have your own, the Default Evictor if not), followed by deciding the configuration passed to the Evictor Plugin. By default, the Default Evictor is enabled for both filter and preEvictionFilter extension points. After that you will enable/disable eviction strategies plugins and configure them properly.

See each strategy plugin section for details on available parameters.

Policy:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
nodeSelector: "node=node1" # you don't need to set this, if not set all will be processed
maxNoOfPodsToEvictPerNode: 5000 # you don't need to set this, unlimited if not set
maxNoOfPodsToEvictPerNamespace: 5000 # you don't need to set this, unlimited if not set
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        evictSystemCriticalPods: true
        evictFailedBarePods: true
        evictLocalStoragePods: true
        nodeFit: true
        minReplicas: 2
    plugins:
      # DefaultEvictor is enabled for both `filter` and `preEvictionFilter`
      # filter:
      #   enabled:
      #     - "DefaultEvictor"
      # preEvictionFilter:
      #   enabled:
      #     - "DefaultEvictor"
      deschedule:
        enabled:
          - ...
      balance:
        enabled:
          - ...
      [...]

The following diagram provides a visualization of most of the strategies to help categorize how strategies fit together.

Strategies diagram

The following sections provide an overview of the different strategy plugins available. These plugins are grouped based on their implementation of extension points: Deschedule or Balance.

Deschedule Plugins: These plugins process pods one by one, and evict them in a sequential manner.

Balance Plugins: These plugins process all pods, or groups of pods, and determine which pods to evict based on how the group was intended to be spread.

Name Extension Point Implemented Description
RemoveDuplicates Balance Spreads replicas
LowNodeUtilization Balance Spreads pods according to pods resource requests and node resources available
HighNodeUtilization Balance Spreads pods according to pods resource requests and node resources available
RemovePodsViolatingInterPodAntiAffinity Deschedule Evicts pods violating pod anti affinity
RemovePodsViolatingNodeAffinity Deschedule Evicts pods violating node affinity
RemovePodsViolatingNodeTaints Deschedule Evicts pods violating node taints
RemovePodsViolatingTopologySpreadConstraint Balance Evicts pods violating TopologySpreadConstraints
RemovePodsHavingTooManyRestarts Deschedule Evicts pods having too many restarts
PodLifeTime Deschedule Evicts pods that have exceeded a specified age limit
RemoveFailedPods Deschedule Evicts pods with certain failed reasons and exit codes

RemoveDuplicates

This strategy plugin makes sure that there is only one pod associated with a ReplicaSet (RS), ReplicationController (RC), StatefulSet, or Job running on the same node. If there are more, those duplicate pods are evicted for better spreading of pods in a cluster. This issue could happen if some nodes went down due to whatever reasons, and pods on them were moved to other nodes leading to more than one pod associated with a RS or RC, for example, running on the same node. Once the failed nodes are ready again, this strategy could be enabled to evict those duplicate pods.

It provides one optional parameter, excludeOwnerKinds, which is a list of OwnerRef Kinds. If a pod has any of these Kinds listed as an OwnerRef, that pod will not be considered for eviction. Note that pods created by Deployments are considered for eviction by this strategy. The excludeOwnerKinds parameter should include ReplicaSet to have pods created by Deployments excluded.

Parameters:

Name Type
excludeOwnerKinds list(string)
namespaces (see namespace filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemoveDuplicates"
      args:
        excludeOwnerKinds:
          - "ReplicaSet"
    plugins:
      balance:
        enabled:
          - "RemoveDuplicates"

LowNodeUtilization

This strategy finds nodes that are under utilized and evicts pods, if possible, from other nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes. The parameters of this strategy are configured under nodeResourceUtilizationThresholds.

The under utilization of nodes is determined by a configurable threshold thresholds. The threshold thresholds can be configured for cpu, memory, number of pods, and extended resources in terms of percentage (the percentage is calculated as the current resources requested on the node vs total allocatable. For pods, this means the number of pods on the node as a fraction of the pod capacity set for that node).

If a node's usage is below threshold for all (cpu, memory, number of pods and extended resources), the node is considered underutilized. Currently, pods request resource requirements are considered for computing node resource utilization.

There is another configurable threshold, targetThresholds, that is used to compute those potential nodes from where pods could be evicted. If a node's usage is above targetThreshold for any (cpu, memory, number of pods, or extended resources), the node is considered over utilized. Any node between the thresholds, thresholds and targetThresholds is considered appropriately utilized and is not considered for eviction. The threshold, targetThresholds, can be configured for cpu, memory, and number of pods too in terms of percentage.

These thresholds, thresholds and targetThresholds, could be tuned as per your cluster requirements. Note that this strategy evicts pods from overutilized nodes (those with usage above targetThresholds) to underutilized nodes (those with usage below thresholds), it will abort if any number of underutilized nodes or overutilized nodes is zero.

Additionally, the strategy accepts a useDeviationThresholds parameter. If that parameter is set to true, the thresholds are considered as percentage deviations from mean resource usage. thresholds will be deducted from the mean among all nodes and targetThresholds will be added to the mean. A resource consumption above (resp. below) this window is considered as overutilization (resp. underutilization).

NOTE: Node resource consumption is determined by the requests and limits of pods, not actual usage. This approach is chosen in order to maintain consistency with the kube-scheduler, which follows the same design for scheduling pods onto nodes. This means that resource usage as reported by Kubelet (or commands like kubectl top) may differ from the calculated consumption, due to these components reporting actual usage metrics. Implementing metrics-based descheduling is currently TODO for the project.

Parameters:

Name Type
useDeviationThresholds bool
thresholds map(string:int)
targetThresholds map(string:int)
numberOfNodes int
evictableNamespaces (see namespace filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "LowNodeUtilization"
      args:
        thresholds:
          "cpu" : 20
          "memory": 20
          "pods": 20
        targetThresholds:
          "cpu" : 50
          "memory": 50
          "pods": 50
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"

Policy should pass the following validation checks:

  • Three basic native types of resources are supported: cpu, memory and pods. If any of these resource types is not specified, all its thresholds default to 100% to avoid nodes going from underutilized to overutilized.
  • Extended resources are supported. For example, resource type nvidia.com/gpu is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in thresholds and targetThresholds explicitly.
  • thresholds or targetThresholds can not be nil and they must configure exactly the same types of resources.
  • The valid range of the resource's percentage value is [0, 100]
  • Percentage value of thresholds can not be greater than targetThresholds for the same resource.

There is another parameter associated with the LowNodeUtilization strategy, called numberOfNodes. This parameter can be configured to activate the strategy only when the number of under utilized nodes are above the configured value. This could be helpful in large clusters where a few nodes could go under utilized frequently or for a short period of time. By default, numberOfNodes is set to zero.

HighNodeUtilization

This strategy finds nodes that are under utilized and evicts pods from the nodes in the hope that these pods will be scheduled compactly into fewer nodes. Used in conjunction with node auto-scaling, this strategy is intended to help trigger down scaling of under utilized nodes. This strategy must be used with the scheduler scoring strategy MostAllocated. The parameters of this strategy are configured under nodeResourceUtilizationThresholds.

Note: On GKE, it is not possible to customize the default scheduler config. Instead, you can use the optimze-utilization autoscaling strategy, which has the same effect as enabling the MostAllocated scheduler plugin. Alternatively, you can deploy a second custom scheduler and edit that scheduler's config yourself.

The under utilization of nodes is determined by a configurable threshold thresholds. The threshold thresholds can be configured for cpu, memory, number of pods, and extended resources in terms of percentage. The percentage is calculated as the current resources requested on the node vs total allocatable. For pods, this means the number of pods on the node as a fraction of the pod capacity set for that node.

If a node's usage is below threshold for all (cpu, memory, number of pods and extended resources), the node is considered underutilized. Currently, pods request resource requirements are considered for computing node resource utilization. Any node above thresholds is considered appropriately utilized and is not considered for eviction.

The thresholds param could be tuned as per your cluster requirements. Note that this strategy evicts pods from underutilized nodes (those with usage below thresholds) so that they can be recreated in appropriately utilized nodes. The strategy will abort if any number of underutilized nodes or appropriately utilized nodes is zero.

NOTE: Node resource consumption is determined by the requests and limits of pods, not actual usage. This approach is chosen in order to maintain consistency with the kube-scheduler, which follows the same design for scheduling pods onto nodes. This means that resource usage as reported by Kubelet (or commands like kubectl top) may differ from the calculated consumption, due to these components reporting actual usage metrics. Implementing metrics-based descheduling is currently TODO for the project.

Parameters:

Name Type
thresholds map(string:int)
numberOfNodes int
evictableNamespaces (see namespace filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "HighNodeUtilization"
      args:
        thresholds:
          "cpu" : 20
          "memory": 20
          "pods": 20
        evictableNamespaces:
          exclude:
          - "kube-system"
          - "namespace1"
    plugins:
      balance:
        enabled:
          - "HighNodeUtilization"

Policy should pass the following validation checks:

  • Three basic native types of resources are supported: cpu, memory and pods. If any of these resource types is not specified, all its thresholds default to 100%.
  • Extended resources are supported. For example, resource type nvidia.com/gpu is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in thresholds explicitly.
  • thresholds can not be nil.
  • The valid range of the resource's percentage value is [0, 100]

There is another parameter associated with the HighNodeUtilization strategy, called numberOfNodes. This parameter can be configured to activate the strategy only when the number of under utilized nodes is above the configured value. This could be helpful in large clusters where a few nodes could go under utilized frequently or for a short period of time. By default, numberOfNodes is set to zero.

RemovePodsViolatingInterPodAntiAffinity

This strategy makes sure that pods violating interpod anti-affinity are removed from nodes. For example, if there is podA on a node and podB and podC (running on the same node) have anti-affinity rules which prohibit them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This issue could happen, when the anti-affinity rules for podB and podC are created when they are already running on node.

Parameters:

Name Type
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingInterPodAntiAffinity"
    plugins:
      deschedule:
        enabled:
          - "RemovePodsViolatingInterPodAntiAffinity"

RemovePodsViolatingNodeAffinity

This strategy makes sure all pods violating node affinity are eventually removed from nodes. Node affinity rules allow a pod to specify requiredDuringSchedulingIgnoredDuringExecution and/or preferredDuringSchedulingIgnoredDuringExecution.

The requiredDuringSchedulingIgnoredDuringExecution type tells the scheduler to respect node affinity when scheduling the pod but kubelet to ignore in case node changes over time and no longer respects the affinity. When enabled, the strategy serves as a temporary implementation of requiredDuringSchedulingRequiredDuringExecution and evicts pod for kubelet that no longer respects node affinity.

For example, there is podA scheduled on nodeA which satisfies the node affinity rule requiredDuringSchedulingIgnoredDuringExecution at the time of scheduling. Over time nodeA stops to satisfy the rule. When the strategy gets executed and there is another node available that satisfies the node affinity rule, podA gets evicted from nodeA.

The preferredDuringSchedulingIgnoredDuringExecution type tells the scheduler to respect node affinity when scheduling if that's possible. If not, the pod gets scheduled anyway. It may happen that, over time, the state of the cluster changes and now the pod can be scheduled on a node that actually fits its preferred node affinity. When enabled, the strategy serves as a temporary implementation of preferredDuringSchedulingPreferredDuringExecution, so the pod will be evicted if it can be scheduled on a "better" node.

Parameters:

Name Type
nodeAffinityType list(string)
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingNodeAffinity"
      args:
        nodeAffinityType:
        - "requiredDuringSchedulingIgnoredDuringExecution"
    plugins:
      deschedule:
        enabled:
          - "RemovePodsViolatingNodeAffinity"

RemovePodsViolatingNodeTaints

This strategy makes sure that pods violating NoSchedule taints on nodes are removed. For example there is a pod "podA" with a toleration to tolerate a taint key=value:NoSchedule scheduled and running on the tainted node. If the node's taint is subsequently updated/removed, taint is no longer satisfied by its pods' tolerations and will be evicted.

Node taints can be excluded from consideration by specifying a list of excludedTaints. If a node taint key or key=value matches an excludedTaints entry, the taint will be ignored.

For example, excludedTaints entry "dedicated" would match all taints with key "dedicated", regardless of value. excludedTaints entry "dedicated=special-user" would match taints with key "dedicated" and value "special-user".

If a list of includedTaints is provided, a taint will be considered if and only if it matches an included key or key=value from the list. Otherwise it will be ignored. Leaving includedTaints unset will include any taint by default.

Parameters:

Name Type
excludedTaints list(string)
includedTaints list(string)
includePreferNoSchedule bool
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

Setting excludedTaints

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingNodeTaints"
      args:
        excludedTaints:
        - dedicated=special-user # exclude taints with key "dedicated" and value "special-user"
        - reserved # exclude all taints with key "reserved"
    plugins:
      deschedule:
        enabled:
          - "RemovePodsViolatingNodeTaints"

Setting includedTaints

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingNodeTaints"
      args:
        includedTaints:
        - decommissioned=end-of-life # include only taints with key "decommissioned" and value "end-of-life"
        - reserved # include all taints with key "reserved"
    plugins:
      deschedule:
        enabled:
          - "RemovePodsViolatingNodeTaints"

RemovePodsViolatingTopologySpreadConstraint

This strategy makes sure that pods violating topology spread constraints are evicted from nodes. Specifically, it tries to evict the minimum number of pods required to balance topology domains to within each constraint's maxSkew. This strategy requires k8s version 1.18 at a minimum.

By default, this strategy only includes hard constraints, you can explicitly set constraints as shown below to include both:

constraints:
- DoNotSchedule
- ScheduleAnyway

The topologyBalanceNodeFit arg is used when balancing topology domains while the Default Evictor's nodeFit is used in pre-eviction to determine if a pod can be evicted.

topologyBalanceNodeFit: false

Strategy parameter labelSelector is not utilized when balancing topology domains and is only applied during eviction to determine if the pod can be evicted.

Supported Constraints fields:

Name Supported?
maxSkew Yes
minDomains No
topologyKey Yes
whenUnsatisfiable Yes
labelSelector Yes
matchLabelKeys Yes
nodeAffinityPolicy Yes
nodeTaintsPolicy Yes

Parameters:

Name Type
namespaces (see namespace filtering)
labelSelector (see label filtering)
constraints (see whenUnsatisfiable)
topologyBalanceNodeFit bool

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingTopologySpreadConstraint"
      args:
        constraints:
          - DoNotSchedule
    plugins:
      balance:
        enabled:
          - "RemovePodsViolatingTopologySpreadConstraint"

RemovePodsHavingTooManyRestarts

This strategy makes sure that pods having too many restarts are removed from nodes. For example a pod with EBS/PD that can't get the volume/disk attached to the instance, then the pod should be re-scheduled to other nodes. Its parameters include podRestartThreshold, which is the number of restarts (summed over all eligible containers) at which a pod should be evicted, and includingInitContainers, which determines whether init container restarts should be factored into that calculation.

You can also specify states parameter to only evict pods matching the following conditions:

If a value for states or podStatusPhases is not specified, Pods in any state (even Running) are considered for eviction.

Parameters:

Name Type
podRestartThreshold int
includingInitContainers bool
namespaces (see namespace filtering)
labelSelector (see label filtering)
states list(string)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsHavingTooManyRestarts"
      args:
        podRestartThreshold: 100
        includingInitContainers: true
    plugins:
      deschedule:
        enabled:
          - "RemovePodsHavingTooManyRestarts"

PodLifeTime

This strategy evicts pods that are older than maxPodLifeTimeSeconds.

You can also specify states parameter to only evict pods matching the following conditions:

  • Pod Phase status of: Running, Pending, Unknown
  • Pod Reason reasons of: NodeAffinity, NodeLost, Shutdown, UnexpectedAdmissionError
  • Container State Waiting condition of: PodInitializing, ContainerCreating, ImagePullBackOff, CrashLoopBackOff, CreateContainerConfigError, ErrImagePull, ImagePullBackOff, CreateContainerError, InvalidImageName

If a value for states or podStatusPhases is not specified, Pods in any state (even Running) are considered for eviction.

Parameters:

Name Type Notes
maxPodLifeTimeSeconds int
states list(string) Only supported in v0.25+
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        states:
        - "Pending"
        - "PodInitializing"
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

RemoveFailedPods

This strategy evicts pods that are in failed status phase. You can provide optional parameters to filter by failed pods' and containters' reasons. and exitCodes. exitCodes apply to failed pods' containers with terminated state only. reasons and exitCodes can be expanded to include those of InitContainers as well by setting the optional parameter includingInitContainers to true. You can specify an optional parameter minPodLifetimeSeconds to evict pods that are older than specified seconds. Lastly, you can specify the optional parameter excludeOwnerKinds and if a pod has any of these Kinds listed as an OwnerRef, that pod will not be considered for eviction.

Parameters:

Name Type
minPodLifetimeSeconds uint
excludeOwnerKinds list(string)
reasons list(string)
exitCodes list(int32)
includingInitContainers bool
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemoveFailedPods"
      args:
        reasons:
        - "NodeAffinity"
        exitCodes:
        - 1
        includingInitContainers: true
        excludeOwnerKinds:
        - "Job"
        minPodLifetimeSeconds: 3600
    plugins:
      deschedule:
        enabled:
          - "RemoveFailedPods"

Filter Pods

Namespace filtering

The following strategies accept a namespaces parameter which allows to specify a list of including and excluding namespaces respectively:

  • PodLifeTime
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingInterPodAntiAffinity
  • RemoveDuplicates
  • RemovePodsViolatingTopologySpreadConstraint
  • RemoveFailedPods

The following strategies accept an evictableNamespaces parameter which allows to specify a list of excluding namespaces:

  • LowNodeUtilization and HighNodeUtilization (Only filtered right before eviction)

In the following example with PodLifeTime, PodLifeTime gets executed only over namespace1 and namespace2.

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          include:
          - "namespace1"
          - "namespace2"
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

The similar holds for exclude field. The strategy gets executed over all namespaces but namespace1 and namespace2 in the following example.

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          exclude:
          - "namespace1"
          - "namespace2"
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

It's not allowed to combine include with exclude field.

Priority filtering

Priority threshold can be configured via the Default Evictor Filter, and, only pods under the threshold can be evicted. You can specify this threshold by setting priorityThreshold.name(setting the threshold to the value of the given priority class) or priorityThreshold.value(directly setting the threshold) parameters. By default, this threshold is set to the value of system-cluster-critical priority class.

Note: Setting evictSystemCriticalPods to true disables priority filtering entirely.

E.g.

Setting priorityThreshold value

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        priorityThreshold:
          value: 10000
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Setting Priority Threshold Class Name (priorityThreshold.name)

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        priorityThreshold:
          name: "priorityClassName1"
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Note that you can't configure both priorityThreshold.name and priorityThreshold.value, if the given priority class does not exist, descheduler won't create it and will throw an error.

Label filtering

The following strategies can configure a standard kubernetes labelSelector to filter pods by their labels:

  • PodLifeTime
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingInterPodAntiAffinity
  • RemovePodsViolatingTopologySpreadConstraint
  • RemoveFailedPods

This allows running strategies among pods the descheduler is interested in.

For example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        labelSelector:
          matchLabels:
            component: redis
          matchExpressions:
            - {key: tier, operator: In, values: [cache]}
            - {key: environment, operator: NotIn, values: [dev]}
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Node Fit filtering

NodeFit can be configured via the Default Evictor Filter. If set to true the descheduler will consider whether or not the pods that meet eviction criteria will fit on other nodes before evicting them. If a pod cannot be rescheduled to another node, it will not be evicted. Currently the following criteria are considered when setting nodeFit to true:

  • A nodeSelector on the pod
  • Any tolerations on the pod and any taints on the other nodes
  • nodeAffinity on the pod
  • Resource requests made by the pod and the resources available on other nodes
  • Whether any of the other nodes are marked as unschedulable
  • Any podAntiAffinity between the pod and the pods on the other nodes

E.g.

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        nodeFit: true
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Note that node fit filtering references the current pod spec, and not that of its owner. Thus, if the pod is owned by a ReplicationController (and that ReplicationController was modified recently), the pod may be running with an outdated spec, which the descheduler will reference when determining node fit. This is expected behavior as the descheduler is a "best-effort" mechanism.

Using Deployments instead of ReplicationControllers provides an automated rollout of pod spec changes, therefore ensuring that the descheduler has an up-to-date view of the cluster state.

Pod Evictions

When the descheduler decides to evict pods from a node, it employs the following general mechanism:

  • Critical pods (with priorityClassName set to system-cluster-critical or system-node-critical) are never evicted (unless evictSystemCriticalPods: true is set).
  • Pods (static or mirrored pods or standalone pods) not part of an ReplicationController, ReplicaSet(Deployment), StatefulSet, or Job are never evicted because these pods won't be recreated. (Standalone pods in failed status phase can be evicted by setting evictFailedBarePods: true)
  • Pods associated with DaemonSets are never evicted (unless evictDaemonSetPods: true is set).
  • Pods with local storage are never evicted (unless evictLocalStoragePods: true is set).
  • Pods with PVCs are evicted (unless ignorePvcPods: true is set).
  • In LowNodeUtilization and RemovePodsViolatingInterPodAntiAffinity, pods are evicted by their priority from low to high, and if they have same priority, best effort pods are evicted before burstable and guaranteed pods.
  • All types of pods with the annotation descheduler.alpha.kubernetes.io/evict are eligible for eviction. This annotation is used to override checks which prevent eviction and users can select which pod is evicted. Users should know how and if the pod will be recreated. The annotation only affects internal descheduler checks. The anti-disruption protection provided by the /eviction subresource is still respected.
  • Pods with a non-nil DeletionTimestamp are not evicted by default.

Setting --v=4 or greater on the Descheduler will log all reasons why any pod is not evictable.

Pod Disruption Budget (PDB)

Pods subject to a Pod Disruption Budget(PDB) are not evicted if descheduling violates its PDB. The pods are evicted by using the eviction subresource to handle PDB.

High Availability

In High Availability mode, Descheduler starts leader election process in Kubernetes. You can activate HA mode if you choose to deploy your application as Deployment.

Deployment starts with 1 replica by default. If you want to use more than 1 replica, you must consider enable High Availability mode since we don't want to run descheduler pods simultaneously.

Configure HA Mode

The leader election process can be enabled by setting --leader-elect in the CLI. You can also set --set=leaderElection.enabled=true flag if you are using Helm.

To get best results from HA mode some additional configurations might require:

  • Configure a podAntiAffinity rule if you want to schedule onto a node only if that node is in the same zone as at least one already-running descheduler
  • Set the replica count greater than 1

Metrics

name type description
build_info gauge constant 1
pods_evicted CounterVec total number of pods evicted

The metrics are served through https://localhost:10258/metrics by default. The address and port can be changed by setting --binding-address and --secure-port flags.

Compatibility Matrix

The below compatibility matrix shows the k8s client package(client-go, apimachinery, etc) versions that descheduler is compiled with. At this time descheduler does not have a hard dependency to a specific k8s release. However a particular descheduler release is only tested against the three latest k8s minor versions. For example descheduler v0.18 should work with k8s v1.18, v1.17, and v1.16.

Starting with descheduler release v0.18 the minor version of descheduler matches the minor version of the k8s client packages that it is compiled with.

Descheduler Supported Kubernetes Version
v0.27 v1.27
v0.26 v1.26
v0.25 v1.25
v0.24 v1.24
v0.23 v1.23
v0.22 v1.22
v0.21 v1.21
v0.20 v1.20
v0.19 v1.19
v0.18 v1.18
v0.10 v1.17
v0.4-v0.9 v1.9+
v0.1-v0.3 v1.7-v1.8

Getting Involved and Contributing

Are you interested in contributing to descheduler? We, the maintainers and community, would love your suggestions, contributions, and help! Also, the maintainers can be contacted at any time to learn more about how to get involved.

To get started writing code see the contributor guide in the /docs directory.

In the interest of getting more new people involved we tag issues with [good first issue][good_first_issue]. These are typically issues that have smaller scope but are good ways to start to get acquainted with the codebase.

We also encourage ALL active community participants to act as if they are maintainers, even if you don't have "official" write permissions. This is a community effort, we are here to serve the Kubernetes community. If you have an active interest and you want to get involved, you have real power! Don't assume that the only people who can get things done around here are the "maintainers".

We also would love to add more "official" maintainers, so show us what you can do!

This repository uses the Kubernetes bots. See a full list of the commands [here][prow].

Communicating With Contributors

You can reach the contributors of this project at:

Learn how to engage with the Kubernetes community on the community page.

Roadmap

This roadmap is not in any particular order.

  • Consideration of pod affinity
  • Strategy to consider number of pending pods
  • Integration with cluster autoscaler
  • Integration with metrics providers for obtaining real load metrics
  • Consideration of Kubernetes's scheduler's predicates

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

descheduler's People

Contributors

a7i avatar aveshagarwal avatar binacs avatar bytetwin avatar concaf avatar damemi avatar dentrax avatar dongjiang1989 avatar farah avatar garrybest avatar harshanarayana avatar ingvagabund avatar invidian avatar janeliul avatar jelmersnoeck avatar jklaw90 avatar k8s-ci-robot avatar kevinz857 avatar knelasevero avatar lixiang233 avatar pravarag avatar ravisantoshgudimetla avatar ryandevlin avatar seanmalloy avatar sharkannon avatar spike-liu avatar stephan2012 avatar tammert avatar tioxy avatar xiaoanyunfei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

descheduler's Issues

Serviceaccount descheduler-sa have no permission to evict pod

Go through the README and got

I1130 06:29:15.559480       1 duplicates.go:59] Error when evicting pod: "nginx-1-55kh7" (&errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:""}, Status:"Failure", Message:"pods \"nginx-1-55kh7\" is forbidden: User \"system:serviceaccount:kube-system:descheduler-sa\" cannot create pods/eviction in the namespace \"default\": User \"system:serviceaccount:kube-system:descheduler-sa\" cannot create pods/eviction in project \"default\"", Reason:"Forbidden", Details:(*v1.StatusDetails)(0xc4202d77a0), Code:403}})

Unable to specify only one or two of cpu, memory or pods for LowNodeUtilization

Hi,
I was just playing around with the project.

My policy file looks like -

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 20
         targetThresholds:
           "cpu" : 50

and I run the descheduler like -

$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/policy.yaml  -v 5
I1123 17:14:37.581631   13825 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 17:14:37.581785   13825 node.go:50] node lister returned empty list, now fetch directly
I1123 17:14:37.582104   13825 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 17:14:38.069287   13825 lownodeutilization.go:104] no target resource threshold for pods is configured

The exit code is 0, but I'm not sure if the descheduler actually went ahead and processed the nodes, etc, because it might have stopped while seeing that there is no targetThreshold for pods.

If this is the case, does it make sense to make all the 3 parameters, pods, memory and cpu, mandatory for the descheduler to take decisions, why can I not set the parameter to just cpu, or memory, or pods.

Does this make sense?

LowNodeUtilization not working in k8s 1.9 (GKE non-alpha cluster)

I have the following config but the descheduler does not remove any pods

      LowNodeUtilization:
         enabled: true
         params:
           nodeResourceUtilizationThresholds:
             thresholds:
               cpu: 30
               memory: 30
               pods: 30
             targetThresholds:
               cpu: 50
               memory: 50
               pods: 50
I0508 07:06:40.389008       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-2mj6" is over utilized with usage: api.ResourceThresholds{"memory":63.35539318178952, "pods":10, "cpu":29.98741346758968}
I0508 07:06:40.389076       1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:1, bPods:2, gPods:3
I0508 07:06:40.389225       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-1qsh" is appropriately utilized with usage: api.ResourceThresholds{"cpu":31.623662680931403, "memory":38.651985111461904, "pods":10}
I0508 07:06:40.389265       1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:1, bPods:4, gPods:1
I0508 07:06:40.389353       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-8fc1" is appropriately utilized with usage: api.ResourceThresholds{"pods":5.454545454545454, "cpu":33.38577721837634, "memory":46.87748520961707}
I0508 07:06:40.389375       1 lownodeutilization.go:149] allPods:6, nonRemovablePods:4, bePods:0, bPods:1, gPods:1
I0508 07:06:40.389508       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-0s07" is over utilized with usage: api.ResourceThresholds{"cpu":43.14033983637508, "memory":61.846487904599, "pods":8.181818181818182}
I0508 07:06:40.389535       1 lownodeutilization.go:149] allPods:9, nonRemovablePods:4, bePods:0, bPods:2, gPods:3
I0508 07:06:40.389712       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-nq13" is over utilized with usage: api.ResourceThresholds{"cpu":84.36123348017621, "memory":86.35326222824197, "pods":10}
I0508 07:06:40.389744       1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:0, bPods:2, gPods:4
I0508 07:06:40.389853       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-sb0v" is over utilized with usage: api.ResourceThresholds{"pods":7.2727272727272725, "cpu":40.308370044052865, "memory":65.15821416455076}
I0508 07:06:40.389879       1 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:0, bPods:2, gPods:1
I0508 07:06:40.389965       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-3290" is appropriately utilized with usage: api.ResourceThresholds{"cpu":27.72183763373191, "memory":45.02291850404409, "pods":6.363636363636363}
I0508 07:06:40.389988       1 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0508 07:06:40.390079       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-7w01" is appropriately utilized with usage: api.ResourceThresholds{"cpu":37.161736941472626, "memory":43.96316610085953, "pods":6.363636363636363}
I0508 07:06:40.390114       1 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:0, gPods:2
I0508 07:06:40.390311       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-pcz2" is over utilized with usage: api.ResourceThresholds{"memory":59.747681387354575, "pods":12.727272727272727, "cpu":67.36941472624292}
I0508 07:06:40.390335       1 lownodeutilization.go:149] allPods:14, nonRemovablePods:6, bePods:1, bPods:5, gPods:2
I0508 07:06:40.390425       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-rw0t" is appropriately utilized with usage: api.ResourceThresholds{"cpu":33.38577721837634, "memory":38.39946598414058, "pods":6.363636363636363}
I0508 07:06:40.390452       1 lownodeutilization.go:149] allPods:7, nonRemovablePods:4, bePods:0, bPods:2, gPods:1
I0508 07:06:40.390580       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-36ae422e-wnp4" is over utilized with usage: api.ResourceThresholds{"cpu":65.48143486469478, "memory":81.88036194839464, "pods":8.181818181818182}
I0508 07:06:40.390617       1 lownodeutilization.go:149] allPods:9, nonRemovablePods:4, bePods:0, bPods:3, gPods:2
I0508 07:06:40.390701       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-150f" is appropriately utilized with usage: api.ResourceThresholds{"cpu":27.847702957835118, "memory":39.989094588917425, "pods":5.454545454545454}
I0508 07:06:40.390722       1 lownodeutilization.go:149] allPods:6, nonRemovablePods:4, bePods:0, bPods:1, gPods:1
I0508 07:06:40.390907       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-x9lg" is over utilized with usage: api.ResourceThresholds{"memory":76.15314534759058, "pods":12.727272727272727, "cpu":47.860289490245435}
I0508 07:06:40.390929       1 lownodeutilization.go:149] allPods:14, nonRemovablePods:6, bePods:0, bPods:5, gPods:3
I0508 07:06:40.391013       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-mcvm" is appropriately utilized with usage: api.ResourceThresholds{"cpu":27.72183763373191, "memory":45.02291850404409, "pods":6.363636363636363}
I0508 07:06:40.391032       1 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0508 07:06:40.391230       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-tpck" is over utilized with usage: api.ResourceThresholds{"cpu":79.32662051604783, "memory":86.72583143248654, "pods":15.454545454545455}
I0508 07:06:40.391254       1 lownodeutilization.go:149] allPods:17, nonRemovablePods:8, bePods:1, bPods:6, gPods:2
I0508 07:06:40.391429       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-bftv" is over utilized with usage: api.ResourceThresholds{"cpu":67.0547514159849, "memory":40.020142022604475, "pods":12.727272727272727}
I0508 07:06:40.391454       1 lownodeutilization.go:149] allPods:14, nonRemovablePods:5, bePods:0, bPods:5, gPods:4
I0508 07:06:40.391560       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-zdd4" is over utilized with usage: api.ResourceThresholds{"cpu":41.63624921334173, "memory":81.68473886035868, "pods":7.2727272727272725}
I0508 07:06:40.391579       1 lownodeutilization.go:149] allPods:8, nonRemovablePods:7, bePods:0, bPods:0, gPods:1
I0508 07:06:40.391630       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-ffq9" is over utilized with usage: api.ResourceThresholds{"pods":4.545454545454546, "cpu":28.351164254247955, "memory":58.79969974544339}
I0508 07:06:40.391659       1 lownodeutilization.go:149] allPods:5, nonRemovablePods:5, bePods:0, bPods:0, gPods:0
I0508 07:06:40.391865       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-0plb" is over utilized with usage: api.ResourceThresholds{"cpu":69.57205789804908, "memory":26.510368710913788, "pods":12.727272727272727}
I0508 07:06:40.391891       1 lownodeutilization.go:149] allPods:14, nonRemovablePods:5, bePods:1, bPods:2, gPods:6
I0508 07:06:40.391985       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-36ae422e-32s4" is over utilized with usage: api.ResourceThresholds{"cpu":45.972309628697296, "memory":55.872961663211015, "pods":7.2727272727272725}
I0508 07:06:40.392007       1 lownodeutilization.go:149] allPods:8, nonRemovablePods:4, bePods:0, bPods:3, gPods:1
I0508 07:06:40.392185       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-v103" is over utilized with usage: api.ResourceThresholds{"cpu":88.82945248584015, "memory":65.58252909160707, "pods":11.818181818181818}
I0508 07:06:40.392206       1 lownodeutilization.go:149] allPods:13, nonRemovablePods:7, bePods:0, bPods:3, gPods:3
I0508 07:06:40.392274       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-290fc974-pwh6" is appropriately utilized with usage: api.ResourceThresholds{"cpu":22.183763373190686, "memory":36.015023076975325, "pods":5.454545454545454}
I0508 07:06:40.392297       1 lownodeutilization.go:149] allPods:6, nonRemovablePods:5, bePods:0, bPods:1, gPods:0
I0508 07:06:40.392450       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-fxpk" is over utilized with usage: api.ResourceThresholds{"cpu":47.23096286972939, "memory":56.65535699212463, "pods":10}
I0508 07:06:40.392479       1 lownodeutilization.go:149] allPods:11, nonRemovablePods:5, bePods:0, bPods:6, gPods:0
I0508 07:06:40.392632       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-fsr8" is appropriately utilized with usage: api.ResourceThresholds{"pods":11.818181818181818, "cpu":35.08495909376967, "memory":38.59402990191275}
I0508 07:06:40.392652       1 lownodeutilization.go:149] allPods:13, nonRemovablePods:4, bePods:1, bPods:7, gPods:1
I0508 07:06:40.392727       1 lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-290fc974-rq6s" is over utilized with usage: api.ResourceThresholds{"cpu":34.64443045940843, "memory":62.24389505579321, "pods":5.454545454545454}
I0508 07:06:40.392753       1 lownodeutilization.go:149] allPods:6, nonRemovablePods:6, bePods:0, bPods:0, gPods:0
I0508 07:06:40.392759       1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 30, Mem: 30, Pods: 30
I0508 07:06:40.392782       1 lownodeutilization.go:69] No node is underutilized, nothing to do here, you might tune your thersholds further
kubectl top nodes

NAME                                                  CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
gke-asia-northeast1-std--default-pool-36ae422e-32s4   199m         1%        12555Mi         12%       
gke-asia-northeast1-std--default-pool-290fc974-pwh6   101m         0%        10892Mi         11%       
gke-asia-northeast1-std--default-pool-290fc974-2mj6   218m         1%        8947Mi          9%        
gke-asia-northeast1-std--default-pool-290fc974-bftv   372m         2%        17092Mi         17%       
gke-asia-northeast1-std--default-pool-290fc974-pcz2   279m         1%        44959Mi         46%       
gke-asia-northeast1-std--default-pool-36ae422e-mcvm   286m         1%        29233Mi         30%       
gke-asia-northeast1-std--default-pool-290fc974-0s07   120m         0%        9409Mi          9%        
gke-asia-northeast1-std--default-pool-290fc974-ffq9   164m         1%        13839Mi         14%       
gke-asia-northeast1-std--default-pool-290fc974-sb0v   404m         2%        11927Mi         12%       
gke-asia-northeast1-std--default-pool-290fc974-fxpk   211m         1%        30067Mi         31%       
gke-asia-northeast1-std--default-pool-290fc974-v103   1337m        8%        42334Mi         43%       
gke-asia-northeast1-std--default-pool-36ae422e-wnp4   291m         1%        19506Mi         20%       
gke-asia-northeast1-std--default-pool-36ae422e-fsr8   532m         3%        22507Mi         23%       
gke-asia-northeast1-std--default-pool-36ae422e-3290   235m         1%        33359Mi         34%       
gke-asia-northeast1-std--default-pool-290fc974-rq6s   78m          0%        34039Mi         35%       
gke-asia-northeast1-std--default-pool-36ae422e-8fc1   112m         0%        10349Mi         10%       
gke-asia-northeast1-std--default-pool-36ae422e-7w01   185m         1%        10906Mi         11%       
gke-asia-northeast1-std--default-pool-36ae422e-150f   162m         1%        11357Mi         11%       
gke-asia-northeast1-std--default-pool-290fc974-x9lg   333m         2%        13055Mi         13%       
gke-asia-northeast1-std--default-pool-290fc974-0plb   137m         0%        22509Mi         23%       
gke-asia-northeast1-std--default-pool-36ae422e-1qsh   269m         1%        20021Mi         20%       
gke-asia-northeast1-std--default-pool-290fc974-nq13   256m         1%        56451Mi         58%       
gke-asia-northeast1-std--default-pool-290fc974-tpck   435m         2%        40776Mi         42%       
gke-asia-northeast1-std--default-pool-290fc974-zdd4   88m          0%        27627Mi         28%       
gke-asia-northeast1-std--default-pool-36ae422e-rw0t   120m         0%        26677Mi         27%

Stark difference between what is reported in the logs vs reported by kubectl top

lownodeutilization.go:144] Node "gke-asia-northeast1-std--default-pool-36ae422e-32s4" is over utilized with usage: api.ResourceThresholds{"cpu":45.972309628697296, "memory":55.872961663211015, "pods":7.2727272727272725}

vs

gke-asia-northeast1-std--default-pool-36ae422e-32s4   199m         1%        12555Mi         12%       
Server Version: version.Info{Major:"1", Minor:"9+", GitVersion:"v1.9.4-gke.1", GitCommit:"10e47a740d0036a4964280bd663c8500da58e3aa", GitTreeState:"clean", BuildDate:"2018-03-13T18:00:36Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}

Pods do not get evicted while logs say "evicting pods from node"

So, if I understood correctly,

  • any node below the percentages in nodeResourceUtilizationThresholds.thresholds is considered underutilized
  • any node above the percentages in nodeResourceUtilizationThresholds.targetThresholds is considered overutilized
  • any node below the above 2 range is considered appropriately utilized by the descheduler and not taken into consideration

If this is correct, the following happens -

I have 4 nodes, 1 master node and 3 worker nodes -

$ kubectl get nodes
NAME                           STATUS                     ROLES     AGE       VERSION
kubernetes-master              Ready,SchedulingDisabled   <none>    6h        v1.10.0-alpha.0.456+f85649c6cd2032-dirty
kubernetes-minion-group-1vp4   Ready                      <none>    6h        v1.10.0-alpha.0.456+f85649c6cd2032-dirty
kubernetes-minion-group-frgx   Ready                      <none>    6h        v1.10.0-alpha.0.456+f85649c6cd2032-dirty
kubernetes-minion-group-k7c7   Ready                      <none>    6h        v1.10.0-alpha.0.456+f85649c6cd2032-dirty

I tainted and then uncordoned node kubernetes-minion-group-1vp4, which means there are no pods or Kubernetes resources on that node -

$ kubectl get all -o wide | grep kubernetes-minion-group-1vp4
$

and the allocated resources on this node are -

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  200m (10%)    0 (0%)      200Mi (2%)       300Mi (4%)

while on the other 2 worker nodes the allocated resources are -

--
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  1896m (94%)   446m (22%)  1133952Ki (15%)  1441152Ki (19%)
--
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  1840m (92%)   300m (15%)  1130Mi (15%)     1540Mi (21%)

So with the right DeschedulerPolicy, pods should have been descheduled from the loads that are over utilized and scheduled on the fresh node.

I wrote the following DeschedulerPolicy -

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:  # any node below the following percentages is considered underutilized
           "cpu" : 40
           "memory": 40
           "pods": 40
         targetThresholds: # any node above the following percentages is considered overutilized
           "cpu" : 30
           "memory": 2
           "pods": 1

I run the descheduler as the following -

$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/policy.yaml  -v 5             
I1123 22:12:27.298937    9381 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 22:12:27.299080    9381 node.go:50] node lister returned empty list, now fetch directly
I1123 22:12:27.299230    9381 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1123 22:12:31.596854    9381 lownodeutilization.go:115] Node "kubernetes-master" usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1123 22:12:31.597019    9381 lownodeutilization.go:115] Node "kubernetes-minion-group-1vp4" usage: api.ResourceThresholds{"memory":2.764226588836412, "pods":1.8181818181818181, "cpu":10}
I1123 22:12:31.597508    9381 lownodeutilization.go:115] Node "kubernetes-minion-group-frgx" usage: api.ResourceThresholds{"cpu":94.8, "memory":15.305177094063607, "pods":16.363636363636363}
I1123 22:12:31.597910    9381 lownodeutilization.go:115] Node "kubernetes-minion-group-k7c7" usage: api.ResourceThresholds{"cpu":92, "memory":15.617880226925726, "pods":14.545454545454545}
I1123 22:12:31.597955    9381 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-frgx" with usage: api.ResourceThresholds{"cpu":94.8, "memory":15.305177094063607, "pods":16.363636363636363}
I1123 22:12:31.597993    9381 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-k7c7" with usage: api.ResourceThresholds{"cpu":92, "memory":15.617880226925726, "pods":14.545454545454545}
I1123 22:12:31.598017    9381 lownodeutilization.go:163] evicting pods from node "kubernetes-master" with usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
$

Seems like the descheduler ended up making the decisions for evicting pods from overutilized nodes, but when I check the cluster, nothing on the old nodes was terminated and nothing on the fresh node popped up -

$ kubectl get all -o wide | grep kubernetes-minion-group-1vp4
$

What am I doing wrong? :(

max-pods-to-evict-per-node default to 0? and loglevel default to 0?

Hi

I found two things that

  1. when with no -v option, descheduler pods have no output, so loglevel default to 0?
    Command:
      /bin/descheduler
    Args:
      --policy-config-file=/policy-dir/policy.yaml
      --dry-run

And I checked help info and found no such things metioned.

  -v, --v Level                          log level for V logs
  1. when run with non-dry-run mode with no --max-pods-to-evict-per-node option no pod will be evicted, so the flag default to 0? also no declaration in help info
# oc logs -f descheduler-cronjob-1523411400-z2z8z 
I0411 01:50:37.761513       1 round_trippers.go:436] GET https://172.30.0.1:443/api 200 OK in 124 milliseconds
I0411 01:50:37.882412       1 round_trippers.go:436] GET https://172.30.0.1:443/apis 200 OK in 15 milliseconds
I0411 01:50:37.898564       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1 200 OK in 15 milliseconds
I0411 01:50:37.903192       1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0411 01:50:37.903215       1 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0411 01:50:37.919025       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0 200 OK in 15 milliseconds
I0411 01:50:37.963552       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/nodes?resourceVersion=8694&timeoutSeconds=481&watch=true 200 OK in 25 milliseconds
I0411 01:50:38.011943       1 duplicates.go:50] Processing node: "ip-172-18-7-158.ec2.internal"
I0411 01:50:38.028600       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-7-158.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 16 milliseconds
I0411 01:50:38.433484       1 duplicates.go:54] "ReplicationController/hello-1"
I0411 01:50:38.433510       1 duplicates.go:50] Processing node: "ip-172-18-14-173.ec2.internal"
I0411 01:50:38.461836       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-14-173.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 28 milliseconds
I0411 01:50:38.479347       1 duplicates.go:54] "ReplicationController/hello-1"
I0411 01:50:38.495887       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-7-158.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 16 milliseconds
I0411 01:50:38.568027       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-14-173.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 15 milliseconds
I0411 01:50:38.569526       1 lownodeutilization.go:141] Node "ip-172-18-7-158.ec2.internal" is under utilized with usage: api.ResourceThresholds{"cpu":30, "memory":14.27776271919991, "pods":5.6}
I0411 01:50:38.569571       1 lownodeutilization.go:149] allPods:14, nonRemovablePods:9, bePods:4, bPods:1, gPods:0
I0411 01:50:38.569603       1 lownodeutilization.go:141] Node "ip-172-18-14-173.ec2.internal" is under utilized with usage: api.ResourceThresholds{"cpu":20, "memory":11.422210175359927, "pods":4.4}
I0411 01:50:38.569616       1 lownodeutilization.go:149] allPods:11, nonRemovablePods:3, bePods:8, bPods:0, gPods:0
I0411 01:50:38.569623       1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 40, Mem: 40, Pods: 40
I0411 01:50:38.569630       1 lownodeutilization.go:72] Total number of underutilized nodes: 2
I0411 01:50:38.569635       1 lownodeutilization.go:80] all nodes are underutilized, nothing to do here
I0411 01:50:38.569644       1 pod_antiaffinity.go:45] Processing node: "ip-172-18-7-158.ec2.internal"
I0411 01:50:38.585039       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-7-158.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 15 milliseconds
I0411 01:50:38.595997       1 pod_antiaffinity.go:45] Processing node: "ip-172-18-14-173.ec2.internal"
I0411 01:50:38.635903       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-18-14-173.ec2.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 39 milliseconds
I0411 01:50:38.659140       1 node_affinity.go:31] Evicted 0 pods

`LowNodeUtilization` policy needs all thresholds to be violated

I am testing out LowNodeUtilization policy with the following value:

           nodeResourceUtilizationThresholds:
             thresholds:
               cpu: 60
               memory: 60
               pods: 5
             targetThresholds:
               cpu: 100
               memory: 100
               pods: 1000

However all nodes are appropriately utilized.
Eg:

I0514 20:35:09.877111       1 lownodeutilization.go:147] Node "gke-asia-northeast1-std--default-pool-36ae422e-wnp4" is appropriately utilized with usage: api.ResourceThresholds{"memory":48.0697066631997, "pods":12.727272727272727, "cpu":34.64443045940843}

For the above node
"memory":48.0697066631997, < 60
"cpu":34.64443045940843 < 60
But "pods":12.727272727272727 > 5

I checked the code and it looks like IsNodeWithLowUtilization will return false if any threshold is not violated - https://github.com/kubernetes-incubator/descheduler/blob/master/pkg/descheduler/strategies/lownodeutilization.go#L298

This means that ALL thresholds need be violated instead of ANY. Is that by design?

Add support for node affinity strategy

From the little that I read about node affinity, does adding the following strategy make sense -

For pods with node affinity set using preferredDuringSchedulingIgnoredDuringExecution, it might be possible that the preferred node was unavailable during scheduling and the pod was scheduled on another node. In this case, if the descheduler is run, it does the following -

  1. checks for all the pods with nodeAffinity defined using preferredDuringSchedulingIgnoredDuringExecution
  2. checks if the pod is actually scheduled on the preferred node or not
  3. if not, descheduler checks if the preferred node is available and is schedulable
  4. if such a node is found, descheduler evicts the pod (and hopefully the scheduler schedules it on the preferred node 🎉)

Maybe we could have a policy file describing the strategy like -

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeAffinity":
     enabled: true

@aveshagarwal @ravisantoshgudimetla if this makes sense, can I take a stab at a PoC for this?

descheduler and devices/latency sensitive pods

If a pod consumes high-value devices (gpus, hugepages) or even has higher latency benefits via cpu pinning, we should avoid descheduling those as its not known if those new pods will actually get a better fit.

PVC Consideration

Project looks interesting! One consideration on the "Future Roadmap" that would be worth considering is the Fault Domains and the PVCs that are associated with a pod.

"Evicting" the pod with a PVC due to LowNodeUtilization on another node would not result in actual re-placement of that pod, so it shouldn't be attempted.

Usage with Openshift

It would be very helpfull to get a documentation about how to use and setup the descheduler job within a openshift environment.

I tried to follow the README within my openshift cluster but when creating the ClusterRole i get the following error:
error: unable to recognize "STDIN": no matches for rbac.authorization.k8s.io/, Kind=ClusterRole

When calling the "make" on my MAC OS or CENTOS also the build fails:

go build -ldflags "-X github.com/kubernetes-incubator/descheduler/cmd/descheduler/app.version=git describe --tags -X github.com/kubernetes-incubator/descheduler/cmd/descheduler/app.buildDate=date +%FT%T%z -X github.com/kubernetes-incubator/descheduler/cmd/descheduler/app.gitCommit=git rev-parse HEAD" -o _output/bin/descheduler github.com/kubernetes-incubator/descheduler/cmd/descheduler

Status of this incubator project?

Hello! This feature seems fundamental to strong bin packing, but it's been months since the last update.

Is this project still active and is there a timeline to have it merged into an official K8 release?

confusing with "RemovePodsViolatingInterPodAntiAffinity" strategy

hi:

Just want to make it work and I found that this strategy is not work as my expect.

descheduler version
Descheduler version {Major:0 Minor:4+ GitCommit:d3c2f256852874fdca4682c3c94bc30624979036 GitVersion:v0.4.0 BuildDate:2018-01-10T13:23:09+0800 GoVersion:go1.8.5 Compiler:gc Platform:linux/amd64}

The origin try is as the following steps:

  1. keep only one node schedulable
  2. create a rc
    oc run hello --image=openshift/hello-openshift:latest
  3. Create another rc with antiaffinity
affinity:
   podAntiAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
     - labelSelector:
         matchExpressions:
         - key: key
           operator: In
           values: [“value”]
       topologyKey: kubernetes.io/hostname
  1. wait all pods of rc is running;
  2. label the pod of first rc with
    oc label pod <pod_name> key=value
  3. Setup descheduler and try to evicted pods

Then I found no one pod has been evicted.

Then I go through all the unit test and try to catch a demo for this strategy.
And try to reproduce the test in https://github.com/kubernetes-incubator/descheduler/blob/master/pkg/descheduler/strategies/pod_antiaffinity_test.go

Then the reproduced steps is:

The origin try is as the following steps:

  1. keep only one node schedulable
  2. Create a rc with antiaffinity
affinity:
   podAntiAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
     - labelSelector:
         matchExpressions:
         - key: key
           operator: In
           values: [“value”]
       topologyKey: kubernetes.io/hostname
  1. Create a rc with the some antiaffinity as step 2
  2. wait all pods of rc is running;
  3. label the pod of first rc with
    oc label pod <pod_name> key=value
  4. Setup descheduler and try to evicted pods

Then I found there is one pod has been evicted.

So here I want to discuss is if only one scenario(in the unit test) is for the strategy.
And why my origin steps can not work? Is it a bug?

Thanks!

Builds failing.

Not to sure if this is an issue for anyone else but in order to build and run out of box I had to update the Dockerfile it build FROM debian:stretch-slim in order to run with/bin/sh.

Add an Ascii-art [before/after] diagram and fix some typos in the rescheduler README.md

This is just a starter issue to get ramped up on contributing. Its been a while since I've contributed anything to upstream :).

For this issue we'd like to:

  • Update an ascii diagram of before/after a rescheduling scenario (specifically, one for the low node usage / bin packing, as thats strategic for us at this time).
  • Fix some minor nits and typos in the README.md.

all nodes are under target utilization, nothing to do here

Given the following policy:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 50
           "memory": 50
           "pods": 10
         targetThresholds:
           "cpu" : 50
           "memory": 50
           "pods": 50

I am confused by this output:

./_output/bin/descheduler --kubeconfig ~/.kube/config --policy-config-file policy.yaml --node-selector beta.kubernetes.io/instance-type=n1-highmem-4 -v 4
I0814 11:56:02.699491   27948 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0814 11:56:02.699629   27948 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0814 11:56:02.799560   27948 node.go:51] node lister returned empty list, now fetch directly
I0814 11:56:04.839125   27948 request.go:480] Throttling request took 122.384366ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-mr9b%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.039313   27948 request.go:480] Throttling request took 82.857788ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-ps8k%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.239153   27948 request.go:480] Throttling request took 65.339548ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-qcpq%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.439252   27948 request.go:480] Throttling request took 126.223138ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-qg9g%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.639253   27948 request.go:480] Throttling request took 128.039815ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-tw62%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.839299   27948 request.go:480] Throttling request took 111.435987ms, request: GET:https://x.x.x.x.x/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-node-bf2a5a1e-w7n7%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
I0814 11:56:05.912036   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-kfh8" is under utilized with usage: api.ResourceThresholds{"cpu":21.428571428571427, "memory":7.093371019678181, "pods":7.2727272727272725}
I0814 11:56:05.912091   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912148   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-mr9b" is under utilized with usage: api.ResourceThresholds{"pods":9.090909090909092, "cpu":33.92857142857143, "memory":3.773439230136872}
I0814 11:56:05.912160   27948 lownodeutilization.go:149] allPods:10, nonRemovablePods:7, bePods:0, bPods:3, gPods:0
I0814 11:56:05.912219   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-qcpq" is under utilized with usage: api.ResourceThresholds{"cpu":15.561224489795919, "memory":7.322094578473096, "pods":9.090909090909092}
I0814 11:56:05.912230   27948 lownodeutilization.go:149] allPods:10, nonRemovablePods:6, bePods:0, bPods:4, gPods:0
I0814 11:56:05.912264   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-qg9g" is under utilized with usage: api.ResourceThresholds{"cpu":27.551020408163264, "memory":7.681984428891371, "pods":7.2727272727272725}
I0814 11:56:05.912273   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912513   27948 lownodeutilization.go:147] Node "gke-node-bf2a5a1e-9b15" is appropriately utilized with usage: api.ResourceThresholds{"memory":13.213583436348788, "pods":10.909090909090908, "cpu":22.372448979591837}
I0814 11:56:05.912537   27948 lownodeutilization.go:149] allPods:12, nonRemovablePods:6, bePods:0, bPods:6, gPods:0
I0814 11:56:05.912613   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-cs2l" is under utilized with usage: api.ResourceThresholds{"memory":2.9337443495765223, "pods":7.2727272727272725, "cpu":27.551020408163264}
I0814 11:56:05.912631   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912749   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-dk7j" is under utilized with usage: api.ResourceThresholds{"cpu":15.051020408163266, "memory":6.403485637657102, "pods":7.2727272727272725}
I0814 11:56:05.912770   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:1, bPods:1, gPods:0
I0814 11:56:05.912828   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-ggdw" is under utilized with usage: api.ResourceThresholds{"cpu":19.387755102040817, "memory":1.9656336753240788, "pods":7.2727272727272725}
I0814 11:56:05.912855   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.912897   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-tw62" is under utilized with usage: api.ResourceThresholds{"cpu":20.918367346938776, "memory":2.4510629446719583, "pods":6.363636363636363}
I0814 11:56:05.912909   27948 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0814 11:56:05.920135   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-3rt5" is under utilized with usage: api.ResourceThresholds{"cpu":27.551020408163264, "memory":7.68198179540342, "pods":7.2727272727272725}
I0814 11:56:05.920172   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.920269   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-50rb" is under utilized with usage: api.ResourceThresholds{"cpu":28.316326530612244, "memory":8.43523410990875, "pods":10}
I0814 11:56:05.920288   27948 lownodeutilization.go:149] allPods:11, nonRemovablePods:6, bePods:0, bPods:5, gPods:0
I0814 11:56:05.920354   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-hmq1" is under utilized with usage: api.ResourceThresholds{"cpu":27.806122448979593, "memory":7.93306590023853, "pods":8.181818181818182}
I0814 11:56:05.920370   27948 lownodeutilization.go:149] allPods:9, nonRemovablePods:6, bePods:0, bPods:3, gPods:0
I0814 11:56:05.920444   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-ps8k" is under utilized with usage: api.ResourceThresholds{"cpu":27.040816326530614, "memory":7.1798135857332, "pods":5.454545454545454}
I0814 11:56:05.920467   27948 lownodeutilization.go:149] allPods:6, nonRemovablePods:6, bePods:0, bPods:0, gPods:0
I0814 11:56:05.920580   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-w7n7" is under utilized with usage: api.ResourceThresholds{"cpu":23.724489795918366, "memory":4.948809588699222, "pods":8.181818181818182}
I0814 11:56:05.920632   27948 lownodeutilization.go:149] allPods:9, nonRemovablePods:5, bePods:0, bPods:4, gPods:0
I0814 11:56:05.920674   27948 lownodeutilization.go:147] Node "gke-node-bf2a5a1e-1t8l" is appropriately utilized with usage: api.ResourceThresholds{"memory":0.8776025543719349, "pods":3.6363636363636362, "cpu":7.653061224489796}
I0814 11:56:05.920690   27948 lownodeutilization.go:149] allPods:4, nonRemovablePods:4, bePods:0, bPods:0, gPods:0
I0814 11:56:05.920733   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-fjbd" is under utilized with usage: api.ResourceThresholds{"pods":5.454545454545454, "cpu":27.040816326530614, "memory":7.1798135857332}
I0814 11:56:05.920745   27948 lownodeutilization.go:149] allPods:6, nonRemovablePods:6, bePods:0, bPods:0, gPods:0
I0814 11:56:05.921125   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-kxb8" is under utilized with usage: api.ResourceThresholds{"cpu":27.29591836734694, "memory":7.43089769056831, "pods":6.363636363636363}
I0814 11:56:05.921145   27948 lownodeutilization.go:149] allPods:7, nonRemovablePods:6, bePods:0, bPods:1, gPods:0
I0814 11:56:05.921205   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-l4xf" is under utilized with usage: api.ResourceThresholds{"memory":2.8898642218579256, "pods":7.2727272727272725, "cpu":22.193877551020407}
I0814 11:56:05.921220   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.921279   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-cvsc" is under utilized with usage: api.ResourceThresholds{"cpu":8.418367346938776, "memory":1.4256836686734968, "pods":7.2727272727272725}
I0814 11:56:05.921297   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:1, bPods:1, gPods:1
I0814 11:56:05.921388   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-kntr" is under utilized with usage: api.ResourceThresholds{"cpu":12.525510204081632, "memory":3.879292539212722, "pods":6.363636363636363}
I0814 11:56:05.921407   27948 lownodeutilization.go:149] allPods:7, nonRemovablePods:5, bePods:0, bPods:2, gPods:0
I0814 11:56:05.921478   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-5415" is under utilized with usage: api.ResourceThresholds{"cpu":21.428571428571427, "memory":7.093371019678181, "pods":7.2727272727272725}
I0814 11:56:05.921495   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:6, bePods:0, bPods:2, gPods:0
I0814 11:56:05.921562   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-5dd0" is under utilized with usage: api.ResourceThresholds{"cpu":27.806122448979593, "memory":7.93306590023853, "pods":8.181818181818182}
I0814 11:56:05.921580   27948 lownodeutilization.go:149] allPods:9, nonRemovablePods:6, bePods:0, bPods:3, gPods:0
I0814 11:56:05.921636   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-74z1" is under utilized with usage: api.ResourceThresholds{"cpu":21.1734693877551, "memory":2.7021479758398677, "pods":7.2727272727272725}
I0814 11:56:05.921652   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:0, bPods:3, gPods:0
I0814 11:56:05.923369   27948 lownodeutilization.go:141] Node "gke-node-bf2a5a1e-bvsh" is under utilized with usage: api.ResourceThresholds{"cpu":21.1734693877551, "memory":2.7021470495070683, "pods":7.2727272727272725}
I0814 11:56:05.923431   27948 lownodeutilization.go:149] allPods:8, nonRemovablePods:5, bePods:0, bPods:3, gPods:0
I0814 11:56:05.923442   27948 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 50, Mem: 50, Pods: 10
I0814 11:56:05.923478   27948 lownodeutilization.go:72] Total number of underutilized nodes: 22
I0814 11:56:05.923493   27948 lownodeutilization.go:85] all nodes are under target utilization, nothing to do here

According to the descheduler, Total number of underutilized nodes: 22 and all nodes are under target utilization, yet nothing to do here. None of my underutilized nodes get drained.

How can I instruct the descheduler to drain the underutilized nodes?

Enable profiling in descheduler

/cc @aveshagarwal - As per our offline discussion, I think first step would be to enable profiling. I am planning to add flag(s) which enable profiling. I will try to avoid starting a httpserver based profiling in the initial stages.

Add support for inter-pod affinity strategy

For pods with podAffinity set using preferredDuringSchedulingIgnoredDuringExecution, it might be possible that at the time of scheduling on the current node, no pod with the set labels were running, but still the pod got scheduled on the current node since the nature of the affinity was preferred and not required.

In such a case, if the descheduler is run, it can do the following -

  1. finds pods running with podAffinity set using preferredDuringSchedulingIgnoredDuringExecution
  2. checks if the pods found in 1 are scheduled on the desired node or not
  3. if not, descheduler checks on other schedulable nodes if the desired pods are running where this podAffinity condition can be met
  4. if such a node is found, descheduler evicts the pod (and hopefully the scheduler schedules it on the desired node 🎉)

Maybe we could have a policy file describing the strategy like -

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingPodAffinity":
     enabled: true

Does it make sense to support such a strategy?

RemoveDuplicates should not evict pods when other schedulable nodes are not available

When I run the following policy config file -

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: true

with RemoveDuplicates strategy enabled, and if there is only one schedulable node available (on which the pods are already running on), then the descheduler still evicts the pods, only to be scheduled on the same node again. This would lead to disruption of service without any gain.

$ descheduler --kubeconfig $KUBECONFIG --policy-config-file policy.yaml -v 5
I0120 02:49:28.828911   13141 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I0120 02:49:28.828993   13141 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I0120 02:49:28.929080   13141 node.go:50] node lister returned empty list, now fetch directly
I0120 02:49:30.343166   13141 duplicates.go:49] Processing node: "kubernetes-master"
I0120 02:49:30.873924   13141 duplicates.go:49] Processing node: "kubernetes-minion-group-5xwf"
I0120 02:49:31.123129   13141 duplicates.go:53] "ReplicaSet/wordpress-57f4bb46bf"
I0120 02:49:31.367105   13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-fn8qm" (<nil>)
I0120 02:49:31.607484   13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-k6tqz" (<nil>)
I0120 02:49:31.865925   13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-rvwd9" (<nil>)
I0120 02:49:32.155498   13141 duplicates.go:61] Evicted pod: "wordpress-57f4bb46bf-v9bzq" (<nil>)
I0120 02:49:32.155526   13141 duplicates.go:49] Processing node: "kubernetes-minion-group-cxg1"
I0120 02:49:32.433999   13141 duplicates.go:49] Processing node: "kubernetes-minion-group-v738"

How about descheduler only evict the pods if there are other schedulable nodes are available?

/bin/sh not found when using this image in kubernetes job

The descheduler will be run as a job in kube-system namespace, and the command is

    Command:
      /bin/sh
      -ec
      /bin/descheduler --policy-config-file /policy-dir/policy.yaml

So, there should be a /bin/sh binary in the container, but the image was build from sratch and didn't include it. We can find this from Dockerfile:

FROM scratch

MAINTAINER Avesh Agarwal <[email protected]>

COPY --from=0 /go/src/github.com/kubernetes-incubator/descheduler/_output/bin/descheduler /bin/descheduler

CMD ["/bin/descheduler", "--help"]

And we got the Error:

Error: failed to start container "descheduler": Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: "/bin/sh": stat /bin/sh: no such file or directory"

This makes the pod runs into ContainerCannotRun state and the job create a new pod immediatly,
several minutes later I got hundreds of pods and my small cluster finally went down for no responding.
image

image

Eviction of stateful set

Does the descheduler evict pods created by a StatefulSet?

I've got 3 pods from the same the StatefulSet in one node but this is not picked up by the duplicate strategy

Putting non supported resource names in threshold does not throw an error

Currently, the descheduler only supports cpu, memory and pods, but if we put another resource name or an invalid resource name, then we do not get an error -

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 40
           "memory": 40
           "pods": 40
           "storage": 25 # unsupported value
         targetThresholds:
           "cpu" : 30
           "memory": 2
           "pods": 1

should throw an error, but it does not -

$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/custom.yaml -v 5 
I1124 15:07:19.211499   16232 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1124 15:07:19.211643   16232 node.go:50] node lister returned empty list, now fetch directly
I1124 15:07:19.211789   16232 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1124 15:07:21.982785   16232 lownodeutilization.go:115] Node "kubernetes-minion-group-j57l" usage: api.ResourceThresholds{"cpu":74.5, "memory":12.991864967531136, "pods":13.636363636363637}
I1124 15:07:21.983109   16232 lownodeutilization.go:115] Node "kubernetes-minion-group-jk62" usage: api.ResourceThresholds{"memory":17.931192353458197, "pods":12.727272727272727, "cpu":87.3}
I1124 15:07:21.983179   16232 lownodeutilization.go:115] Node "kubernetes-minion-group-vkv3" usage: api.ResourceThresholds{"cpu":10, "memory":2.764226588836412, "pods":1.8181818181818181}
I1124 15:07:21.983320   16232 lownodeutilization.go:115] Node "kubernetes-master" usage: api.ResourceThresholds{"pods":8.181818181818182, "cpu":95, "memory":11.575631035804197}
I1124 15:07:21.983374   16232 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-jk62" with usage: api.ResourceThresholds{"cpu":87.3, "memory":17.931192353458197, "pods":12.727272727272727}
I1124 15:07:21.983402   16232 lownodeutilization.go:163] evicting pods from node "kubernetes-master" with usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1124 15:07:21.983485   16232 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-j57l" with usage: api.ResourceThresholds{"cpu":74.5, "memory":12.991864967531136, "pods":13.636363636363637}

Create a SECURITY_CONTACTS file.

As per the email sent to kubernetes-dev[1], please create a SECURITY_CONTACTS
file.

The template for the file can be found in the kubernetes-template repository[2].
A description for the file is in the steering-committee docs[3], you might need
to search that page for "Security Contacts".

Please feel free to ping me on the PR when you make it, otherwise I will see when
you close this issue. :)

Thanks so much, let me know if you have any questions.

(This issue was generated from a tool, apologies for any weirdness.)

[1] https://groups.google.com/forum/#!topic/kubernetes-dev/codeiIoQ6QE
[2] https://github.com/kubernetes/kubernetes-template-project/blob/master/SECURITY_CONTACTS
[3] https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance-template-short.md

Add a strategy for taints and tolerations

Recently one of the users requested a strategy for taints and tolerations. While I don't have cycles to work on this, I would be more than happy to review if anyone in the community is interested to work on.

Allow descheduling of Pods which have hostDirs

Currently the descheduler only checks for the kubernetes.io/created-by annotation in order to proceed with descheduling. It also ignores every pod which has a hostDir volume mounted.

Will it be possible to allow descheduling of Pods which have hostDirs (maybe configurable, based on the content in kubernetes.io/created-by) ?

No Auth Provider found for name "gcp"

Hi,

My k8s cluster is running on GKE.

I tried using the descheduler, but after compiling I get this error:

$ ./bin/descheduler --dry-run --kubeconfig ~/.kube/config
E0521 16:26:53.978025    4226 server.go:46] No Auth Provider found for name "gcp"

Apparently the code in descheduler/pkg/descheduler/client/client.go needs to import _ "k8s.io/client-go/plugin/pkg/client/auth/gcp" (or _ "k8s.io/client-go/plugin/pkg/client/auth" to support other auth providers)

After adding _ "k8s.io/client-go/plugin/pkg/client/auth/gcp" to the import list I was able to authenticate against GKE.

Terminology confusion around utilization

The current configuration types describe NodeResourceUtilizationThresholds.

I think utilization implies observed usage, not what is scheduled or allocated.

If we use the term utilization, it should mean the decision is based on metrics.

If we use the term allocated or node scheduling thresholds, it should mean the decision is based on pod resource requests, and not observed usage.

deschedule pods that fail to start or restart too often

It is not uncommon that pods get scheduled on nodes that are not able to start it.
For example, a node may have network issues and unable to mount a networked persistent volume, or cannot pull a docker image, or has some docker configuration issue which is seen only on container startup.

Another common issue is when a container gets restarted by liveliness check because of some local node issue (e.g. wrong routing table, slow storage, network latency or packet-drop). In that case, a pod is unhealthy most of the time and hangs in a restart state forever without a chance of being migrated to another node.

As of now, there is no possibility to re-schedule pods with faulty containers. It may be helpful to introduce two new Strategies:

  • container-restart-rate: re-schedule a pod if it is unhealthy since $notReadyPeriod seconds and one of its containers was restarted $maxRestartCount times.
  • pod-startup-failure: a pod was scheduled on a node, but was unable to start all of its containers since $maxStartupTime seconds.

The similar issue is filed against kubernetes: kubernetes/kubernetes#13385

Is it based on kubectl top node CPU & memory?

While I am using this descheduler, I have noticed that the log shows the exactly the same number of memory utilization for many nodes. Also, each node shows exactly the same number of CPU util & memory util in the log. it seems like descheduler is calculating the utilization from resource requests & limits?
I was hoping it is utilizing the kubectl top nodes to calculating current utilization (which should reflect the results in the log with dynamically changing % of CPU & memory util at the moment). Please clarify how is it calculating the current node resource utilization.

e.g. here is the data that I am talking: in the log, I see this: Node “172.16.4.3" is appropriately utilized with usage: api.ResourceThresholds{“cpu”:52.5, “memory”:32.080248132547204, “pods”:41.25} but kubectl top node shows 172.16.4.3 2025m 25% 8699Mi 55% - meaning CPU 25%, memory 55% utilized
Also, many of my pods are showing memory utilization exactly same as "memory":4.193744917801392

HighNodeUtilization strategy

I was reading the Rescheduler-Design-Implementation document (https://docs.google.com/document/d/1KXw02Q0cOF1MUrdpPNiug0yGZlixvPg2SwBycrT5DkE/edit) and saw that the descheduler should support also the HighNodeUtilization strategy option.

Meaning that the descheduler should evict pods from nodes that reached high thresholds.

This is what i am trying to achieve, balancing a heavy load nodes pods into low utilized nodes but cannot seem to get that to work :(

Any idea how does a policy that balances HighNodeUtilization nodes should be defined ? Or is it not implemented in the code ? Is it a feature that can be added ?

Thank you for any kind of help

Roiy

Improve the test coverage

Running go test -cover:

?   	github.com/kubernetes-incubator/descheduler/cmd/descheduler	[no test files]
?   	github.com/kubernetes-incubator/descheduler/cmd/descheduler/app	[no test files]
?   	github.com/kubernetes-incubator/descheduler/cmd/descheduler/app/options	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/api	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/api/install	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/api/v1alpha1	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/apis/componentconfig	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/apis/componentconfig/install	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/apis/componentconfig/v1alpha1	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/descheduler	[no test files]
?   	github.com/kubernetes-incubator/descheduler/pkg/descheduler/client	[no test files]
ok  	github.com/kubernetes-incubator/descheduler/pkg/descheduler/evictions	0.273s	coverage: 50.0% of statements
?   	github.com/kubernetes-incubator/descheduler/pkg/descheduler/evictions/utils	[no test files]
ok  	github.com/kubernetes-incubator/descheduler/pkg/descheduler/node	0.268s	coverage: 72.5% of statements
ok  	github.com/kubernetes-incubator/descheduler/pkg/descheduler/pod	0.144s	coverage: 33.3% of statements
?   	github.com/kubernetes-incubator/descheduler/pkg/descheduler/scheme	[no test files]
ok  	github.com/kubernetes-incubator/descheduler/pkg/descheduler/strategies	0.077s	coverage: 73.6% of statements
?   	github.com/kubernetes-incubator/descheduler/pkg/utils	[no test files]
?   	github.com/kubernetes-incubator/descheduler/test	[no test files]

Some packages are missing the tests completely.

Another feature I love about the Golang :). More about test coverage at https://blog.golang.org/cover

Nodes with scheduling disabled should not be taken into consideration for LowNodeUtilization

I have the following nodes -

$ kubectl get nodes
NAME                           STATUS                     ROLES     AGE       VERSION
kubernetes-master              Ready,SchedulingDisabled   <none>    56m       v1.8.4-dirty
kubernetes-minion-group-5rrh   Ready                      <none>    56m       v1.8.4-dirty
kubernetes-minion-group-fb8c   Ready                      <none>    56m       v1.8.4-dirty
kubernetes-minion-group-t1r3   Ready,SchedulingDisabled   <none>    56m       v1.8.4-dirty

The worker node kubernetes-minion-group-t1r3 was cordoned and marked as unschedulable, however it fulfilled the criteria for being an underutilized node according to the following policy file -

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:  # any node below the following percentages is considered underutilized
           "cpu" : 40
           "memory": 40
           "pods": 40
         targetThresholds: # any node above the following percentages is considered overutilized
           "cpu" : 30
           "memory": 2
           "pods": 1

When I ran the descheduler, kubernetes-minion-group-t1r3 (the cordoned node) was taken into account and marked as underutilized and multiple pods were evicted from other nodes in the hope that the scheduler will schedule on kubernetes-minion-group-t1r3, but that never happened since the node was cordoned.

Does it make sense to not take a cordoned node into consideration while looking for underutilized nodes?

I ran the descheduler like the following -

$ _output/bin/descheduler --kubeconfig-file /var/run/kubernetes/admin.kubeconfig --policy-config-file examples/custom.yaml -v 5 
I1125 18:58:46.014381    2813 reflector.go:198] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1125 18:58:46.016167    2813 node.go:50] node lister returned empty list, now fetch directly
I1125 18:58:46.017010    2813 reflector.go:236] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I1125 18:58:47.834184    2813 lownodeutilization.go:115] Node "kubernetes-master" usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1125 18:58:47.834986    2813 lownodeutilization.go:115] Node "kubernetes-minion-group-5rrh" usage: api.ResourceThresholds{"cpu":90.5, "memory":6.932161992316314, "pods":17.272727272727273}
I1125 18:58:47.835701    2813 lownodeutilization.go:115] Node "kubernetes-minion-group-fb8c" usage: api.ResourceThresholds{"cpu":96.5, "memory":14.0975556030657, "pods":17.272727272727273}
I1125 18:58:47.835783    2813 lownodeutilization.go:115] Node "kubernetes-minion-group-t1r3" usage: api.ResourceThresholds{"cpu":10, "memory":2.764226588836412, "pods":1.8181818181818181}
I1125 18:58:47.835819    2813 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-fb8c" with usage: api.ResourceThresholds{"cpu":96.5, "memory":14.0975556030657, "pods":17.272727272727273}
I1125 18:58:48.096681    2813 lownodeutilization.go:194] Evicted pod: "database-6f97f65956-6pxp5" (<nil>)
I1125 18:58:48.098323    2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":91.5, "memory":14.0975556030657, "pods":16.363636363636363}
I1125 18:58:48.361411    2813 lownodeutilization.go:194] Evicted pod: "wordpress-57f4bb46bf-g27k6" (<nil>)
I1125 18:58:48.361522    2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":86.5, "memory":14.0975556030657, "pods":15.454545454545455}
I1125 18:58:48.623304    2813 lownodeutilization.go:194] Evicted pod: "wordpress-57f4bb46bf-m62cm" (<nil>)
I1125 18:58:48.623330    2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":81.5, "memory":14.0975556030657, "pods":14.545454545454547}
I1125 18:58:48.894712    2813 lownodeutilization.go:194] Evicted pod: "wordpress-57f4bb46bf-mblx7" (<nil>)
I1125 18:58:48.894832    2813 lownodeutilization.go:208] updated node usage: api.ResourceThresholds{"cpu":76.5, "memory":14.0975556030657, "pods":13.636363636363638}
I1125 18:58:48.894991    2813 lownodeutilization.go:163] evicting pods from node "kubernetes-master" with usage: api.ResourceThresholds{"cpu":95, "memory":11.575631035804197, "pods":8.181818181818182}
I1125 18:58:48.895063    2813 lownodeutilization.go:163] evicting pods from node "kubernetes-minion-group-5rrh" with usage: api.ResourceThresholds{"cpu":90.5, "memory":6.932161992316314, "pods":17.272727272727273}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.4-dirty", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"dirty", BuildDate:"2017-11-25T12:04:44Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.4-dirty", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"dirty", BuildDate:"2017-11-25T11:54:10Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

pod antiaffinity strategy evicts all pods.

Pod anti affinity strategy doesn't have a check on type of pod to be evicted. It can evict critical, mirror pods. As this is some functionality that needs to be respected by all strategies in descheduler, I am planning to move this to pods.go to avoid code duplication so that people implementing strategies won't have to think about them.

Evictions found but pods are not deleted

I have this weird issue where the descheduler correctly spots which pods to be evicted but no pods are actually deleted.

Could it be a permission issue? I'm using RBAC and have setup the roles like described in the README

I0608 11:47:47.133914       1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0608 11:47:47.133965       1 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0608 11:47:47.234083       1 duplicates.go:50] Processing node: "ip-172-20-34-164.eu-west-2.compute.internal"
I0608 11:47:47.342677       1 duplicates.go:54] "ReplicaSet/notification-service-v1-b66596d89"
I0608 11:47:47.342721       1 duplicates.go:50] Processing node: "ip-172-20-80-152.eu-west-2.compute.internal"
I0608 11:47:47.354665       1 duplicates.go:54] "ReplicaSet/rabbitmq-k8s-0-5b87cfcfbd"
I0608 11:47:47.354688       1 duplicates.go:54] "ReplicaSet/revenue-modeller-data-store-v2-97fcdc568"
I0608 11:47:47.354697       1 duplicates.go:54] "ReplicaSet/entitlement-service-v1-8454ccd585"
I0608 11:47:47.354705       1 duplicates.go:50] Processing node: "ip-172-20-104-255.eu-west-2.compute.internal"
I0608 11:47:47.407949       1 duplicates.go:54] "ReplicaSet/alert-service-v1-7dc6ddcf8d"
I0608 11:47:47.408001       1 duplicates.go:54] "ReplicaSet/hazelcast-k8s-0-7466b7cb4f"
I0608 11:47:47.438606       1 lownodeutilization.go:141] Node "ip-172-20-34-164.eu-west-2.compute.internal" is under utilized with usage: api.ResourceThresholds{"cpu":32.5, "memory":20.6892852865826, "pods":5.454545454545454}
I0608 11:47:47.438649       1 lownodeutilization.go:149] allPods:6, nonRemovablePods:2, bePods:0, bPods:2, gPods:2
I0608 11:47:47.438798       1 lownodeutilization.go:144] Node "ip-172-20-80-152.eu-west-2.compute.internal" is over utilized with usage: api.ResourceThresholds{"cpu":99, "memory":86.26743748074475, "pods":16.363636363636363}
I0608 11:47:47.438821       1 lownodeutilization.go:149] allPods:18, nonRemovablePods:6, bePods:1, bPods:10, gPods:1
I0608 11:47:47.438990       1 lownodeutilization.go:144] Node "ip-172-20-104-255.eu-west-2.compute.internal" is over utilized with usage: api.ResourceThresholds{"cpu":99.5, "memory":92.91303949597655, "pods":15.454545454545455}
I0608 11:47:47.439014       1 lownodeutilization.go:149] allPods:17, nonRemovablePods:8, bePods:0, bPods:6, gPods:3
I0608 11:47:47.439023       1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 74, Mem: 68, Pods: 12
I0608 11:47:47.439034       1 lownodeutilization.go:72] Total number of underutilized nodes: 1
I0608 11:47:47.439047       1 lownodeutilization.go:89] Criteria for a node above target utilization: CPU: 77, Mem: 75, Pods: 14
I0608 11:47:47.439061       1 lownodeutilization.go:91] Total number of nodes above target utilization: 2
I0608 11:47:47.439077       1 lownodeutilization.go:183] Total capacity to be moved: CPU:1780, Mem:9.083513856e+09, Pods:9.4
I0608 11:47:47.439093       1 lownodeutilization.go:184] ********Number of pods evicted from each node:***********
I0608 11:47:47.439101       1 lownodeutilization.go:191] evicting pods from node "ip-172-20-104-255.eu-west-2.compute.internal" with usage: api.ResourceThresholds{"pods":15.454545454545455, "cpu":99.5, "memory":92.91303949597655}
I0608 11:47:47.439125       1 lownodeutilization.go:202] 0 pods evicted from node "ip-172-20-104-255.eu-west-2.compute.internal" with usage map[cpu:99.5 memory:92.91303949597655 pods:15.454545454545455]
I0608 11:47:47.439152       1 lownodeutilization.go:191] evicting pods from node "ip-172-20-80-152.eu-west-2.compute.internal" with usage: api.ResourceThresholds{"cpu":99, "memory":86.26743748074475, "pods":16.363636363636363}
I0608 11:47:47.439175       1 lownodeutilization.go:202] 0 pods evicted from node "ip-172-20-80-152.eu-west-2.compute.internal" with usage map[cpu:99 memory:86.26743748074475 pods:16.363636363636363]
I0608 11:47:47.439195       1 lownodeutilization.go:94] Total number of pods evicted: 0
I0608 11:47:47.439203       1 pod_antiaffinity.go:45] Processing node: "ip-172-20-34-164.eu-west-2.compute.internal"
I0608 11:47:47.446324       1 pod_antiaffinity.go:45] Processing node: "ip-172-20-80-152.eu-west-2.compute.internal"
I0608 11:47:47.455917       1 pod_antiaffinity.go:45] Processing node: "ip-172-20-104-255.eu-west-2.compute.internal"
I0608 11:47:47.492859       1 node_affinity.go:31] Evicted 0 pods

This is using Kubernetes v1.10.3

Better control for Critical Pods

Right now the control to mark pods as critical is very basic and requires doing changes in many pods' annotation.

Proposal 1 - Non-critical annotation
If I have 100 pods but I want the descheduler to consider "non-critical" only 20, that means I have to add annotations to 80 pods. We could have a "non-critical" annotation to only mark 20 pods. This could be controlled with an argument. --non-critical-pod-matcher=true (default false).

Proposal 2 - Consider current labels as critical
If I already have an annotation in my running applications that I know identifies a set of critical pods, it would be nice to be able to say "Pods with this custom annotation and value are considered critical". With this, no changes would have to be applied at all to make descheduler run. Personally, I have an annotation called "layer" with values (backend|monitoring|data|frontend). I consider my data and monitoring Pods critical, if I already have this annotation, why add another?

It could be done with --extra-critical-annotations="layer=data,layer=monitoring,k8s-app=prometheus" . And if --non-critical-pod-matcher is set to true, then --extra-non-critical-annotations="...."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.