Giter VIP home page Giter VIP logo

external-resizer's Introduction

CSI Resizer

The CSI external-resizer is a sidecar container that watches the Kubernetes API server for PersistentVolumeClaim updates and triggers ControllerExpandVolume operations against a CSI endpoint if user requested more storage on PersistentVolumeClaim object.

Overview

A storage provider that allows volume expansion after creation, may choose to implement volume expansion either via a control-plane CSI RPC call or via node CSI RPC call or both as a two step process. The external-resizer is an external-controller that watches Kubernetes API server for PersistentVolumeClaim modifications and triggers CSI calls for control-plane volume-expansion. More details can be found on - CSI Volume expansion

Compatibility

This information reflects the head of this branch.

Compatible with CSI Version Container Image Min K8s Version Recommended K8s Version
CSI Spec v1.5.0 k8s.gcr.io/sig-storage/csi-resizer 1.16 1.28

Feature status

Various external-resizer releases come with different alpha / beta features.

The following table reflects the head of this branch.

Feature Status Default Description
VolumeExpansion Stable On Support for expanding CSI volumes.
ReadWriteOncePod Stable On Single pod access mode for PersistentVolumes.
VolumeAttributesClass Alpha Off Volume Attributes Classes.

Usage

It is necessary to create a new service account and give it enough privileges to run the external-resizer, see deploy/kubernetes/rbac.yaml. The resizer is then deployed as single Deployment as illustrated below:

kubectl create deploy/kubernetes/deployment.yaml

The external-resizer may run in the same pod with other external CSI controllers such as the external-attacher, external-snapshotter and/or external-provisioner.

Note that the external-resizer does not scale with more replicas. Only one external-resizer is elected as leader and running. The others are waiting for the leader to die. They re-elect a new active leader in ~15 seconds after death of the old leader.

Command line options

Recommended optional arguments

  • --csi-address <path to CSI socket>: This is the path to the CSI driver socket inside the pod that the external-resizer container will use to issue CSI operations (/run/csi/socket is used by default).

  • --leader-election: Enables leader election. This is mandatory when there are multiple replicas of the same external-resizer running for one CSI driver. Only one of them may be active (=leader). A new leader will be re-elected when current leader dies or becomes unresponsive for ~15 seconds.

  • --leader-election-namespace: Namespace where the leader election resource lives. Defaults to the pod namespace if not set.

  • --leader-election-lease-duration <duration>: Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds.

  • --leader-election-renew-deadline <duration>: Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds.

  • --leader-election-retry-period <duration>: Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds.

  • --timeout <duration>: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of ControllerExpandVolume calls. 10 seconds is used by default.

  • -kube-api-burst <int> : Burst to use while communicating with the kubernetes apiserver. Defaults to 10. (default 10).

  • -kube-api-qps <float> : QPS to use while communicating with the kubernetes apiserver. Defaults to 5.0. (default 5).

  • --retry-interval-start: The starting value of the exponential backoff for failures. 1 second is used by default.

  • --retry-interval-max: The exponential backoff maximum value. 5 minutes is used by default.

  • --workers <num>: Number of simultaneously running ControllerExpandVolume operations. Default value is 10.

  • --http-endpoint: The TCP network address where the HTTP server for diagnostics, including metrics and leader election health check, will listen (example: :8080 which corresponds to port 8080 on local host). The default is empty string, which means the server is disabled.

  • --metrics-path: The HTTP path where prometheus metrics will be exposed. Default is /metrics.

  • --handle-volume-inuse-error <true/false>: Enable or disable volume-in-use error handling in external-resizer. Defaults to true and resize-controller will watch for all pods in all namespaces to check if PVC being expanded is in-use by a pod or not before retrying volume expansion if CSI driver throws volume-in-use error. Setting this to false will cause external-resizer to ignore volume-in-use error and resize-controller will retry volume expansion even if volume is already in use by a pod and CSI driver does not support expansion of in-use volumes. If CSI driver being used supports online expansion, it might be desirable to set handle-volume-inuse-error to false - to save costs associated with watching all pods in the cluster.

  • `-feature-gates**: A set of key/value pairs that describe alpha/experimental features of external-resizer.

    • AnnotateFsResize=true|false (ALPHA - default=false): Store current size of pvc in pv's annotation, so as if pvc is deleted while expansion was pending on the node, the size of pvc can be restored to old value. This permits expansion on the node in case pvc was deleted while expansion was pending on the node (but completed in the controller). Use of this feature depends on Kubernetes version 1.21.

    • RecoverVolumeExpansionFailure=true|false (ALPHA - default=false): Allow users to reduce size of PVC if expansion to current size is failing. If the feature gate RecoverVolumeExpansionFailure is enabled and expansion has failed for a PVC, you can retry expansion with a smaller size than the previously requested value. To request a new expansion attempt with a smaller proposed size, edit .spec.resources for that PVC and choose a value that is less than the value you previously tried. This is useful if expansion to a higher value did not succeed because of capacity constraint. If that has happened, or you suspect that it might have, you can retry expansion by specifying a size that is within the capacity limits of underlying storage provider. You can monitor status of resize operation by watching .status.resizeStatus and events on the PVC. Use of this feature-gate requires Kubernetes 1.28.

Other recognized arguments

  • --kubeconfig <path>: Path to Kubernetes client configuration that the external-resizer uses to connect to Kubernetes API server. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-resizer does not run as a Kubernetes pod, e.g. for debugging. Either this or --master needs to be set if the external-resizer is being run out of cluster.

  • --master <url>: Master URL to build a client config from. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-resizer does not run as a Kubernetes pod, e.g. for debugging. Either this or --kubeconfig needs to be set if the external-resizer is being run out of cluster.

  • --metrics-address: (deprecated) The TCP network address where the prometheus metrics endpoint will run (example: :8080 which corresponds to port 8080 on local host). The default is empty string, which means metrics endpoint is disabled.

  • --version: Prints current external-resizer version and quits.

  • All glog / klog arguments are supported, such as -v <log level> or -alsologtostderr.

HTTP endpoint

The external-resizer optionally exposes an HTTP endpoint at address:port specified by --http-endpoint argument. When set, these two paths are exposed:

  • Metrics path, as set by --metrics-path argument (default is /metrics).
  • Leader election health check at /healthz/leader-election. It is recommended to run a liveness probe against this endpoint when leader election is used to kill external-resizer leader that fails to connect to the API server to renew its leadership. See kubernetes-csi/csi-lib-utils#66 for details.

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page.

You can reach the maintainers of this project at:

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

external-resizer's People

Contributors

andrewsykim avatar andyzhangx avatar animeshk08 avatar chrishenzie avatar ddebroy avatar dependabot[bot] avatar dobsonj avatar gnufied avatar humblec avatar huww98 avatar jarrpa avatar jiawei0227 avatar jsafrane avatar k8s-ci-robot avatar mauriciopoppe avatar mlmhl avatar mowangdk avatar msau42 avatar mucahitkurt avatar namrata-ibm avatar pohly avatar raunakshah avatar saad-ali avatar sneha-at avatar spiffxp avatar sunnylovestiramisu avatar sunpa93 avatar verult avatar windayski avatar xing-yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

external-resizer's Issues

How to resize a volume with filesystem from cloning?

When cloning a volume from an existing PVC, it's allowed to specify a bigger size for the new PVC, which means clone + resize in one step.

If they're both block volumes, it's possible to do the resize work at the end of CSI call CreateVolume. This is okay.

However, when they're filesystem volumes (volumeMode=Filesystem), I can still do the resize work at the end of CSI call CreateVolume, but I can only resize the volume, not the filesystem in it. The filesystem still has the old size. The filesystem expand work should be done by the kubelet calling the NodeExpandVolume, but how can I let kubelet to do that since I'm in the CSI driver code?

For just expanding volume, the external-resizer will add a status condition FileSystemResizePending for the PVC, so kubelet will call the NodeExpandVolume when the PVC is attached to a pod. Is there is way to add this when cloning volume?

I hope the CSI call ControllerExpandVolume can be invoked automatically in such case.

ps, my csi driver capability:

var DefaultControllerServiceCapability = []csi.ControllerServiceCapability_RPC_Type{
	csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
	csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
	csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
	csi.ControllerServiceCapability_RPC_CLONE_VOLUME,
}

var DefaultNodeServiceCapability = []csi.NodeServiceCapability_RPC_Type{
	csi.NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME,
	csi.NodeServiceCapability_RPC_EXPAND_VOLUME,
	csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
}

var DefaultPluginCapability = []*csi.PluginCapability{
	{
		Type: &csi.PluginCapability_Service_{
			Service: &csi.PluginCapability_Service{
				Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
			},
		},
	},
	{
		Type: &csi.PluginCapability_VolumeExpansion_{
			VolumeExpansion: &csi.PluginCapability_VolumeExpansion{
				Type: csi.PluginCapability_VolumeExpansion_OFFLINE,
			},
		},
	},
}

resize failed and can't recover due to pvc rejection "Forbidden: field can not be less than previous value "

How to reproduce:

  1. create a pvc with size 1Gi

  2. resize to 10Pi, kubectl edit pvc xxx and update spec.resources.requests.storage to 10Pi

  3. assume the new request size 10Pi is too large๏ผŒ csi-driver and backend storage refuse to update and response an error to call of ControllerExpandVolume
    image

  4. pvc status change into Resizing
    image

  5. when realize 10Pi is to large for a storage backend, try reedit(kubectl edit pvc xxx) and set to a lower value, for example,1T

  6. pvc reject to save: spec.resources.requests.storage: Forbidden: field can not be less than previous value
    image

That mean we can never update the size again, because csi driver(storage backend) only accept a smaller size
but pvc don't allow resizing to a capacity less than previous value.
Since external-resizer have received a failed rpc response , can it do something to recover the pvc
for example reset the size?

Block volumes are marked with FileSystemResizeRequired

The in-tree controller, after finishing Controller expand, checks not only if the plugin reports that FS resize is required but also if the volume is mode FS or not: https://github.com/kubernetes/kubernetes/blob/1dac5fd14a54ac4972339dbe55f9f03688fd7542/pkg/volume/util/operationexecutor/operation_generator.go#L1578 before it marks volumes FileSystemResizeRequired.

The external-resizer should do the same.

This was caught by the e2e test
[Driver: aws] [Testpattern: Dynamic PV (block volmode)(allowExpansion)] volume-expand [It] Verify if offline PVC expansion works

/assign

Respect volume-in-use error when calling ControllerExpand volume

if a CSI driver throws volume-in-use error when calling ControllerExpandVolume, the external-resizer should not retry expansion until it can verify that volume is not in use. This will be a best effort check and only be performed after a plugin has thrown Volume in use error. This is different from enforcing online and offline plugin capabilities.

cc #62

OFFLINE resizing woes

There seems to be a problem with operations ordering in OFFLINE resize situation.
First of all, I indicate supported plugin resize capabilities with PluginCapability_VolumeExpansion_OFFLINE and implement a ControllerExpandVolume method.

First naรฏve solution

Just hope that external-resizer somehow understands that state of a given PVC (Volume) and calls resize only when the Volume is Unpublished.

Does not work, calls ControllerExpandVolume once PVC in Kubernetes API is resized.

Return a gRPC error solution

There is an options to send back a gRPC error 9 FAILED_PRECONDITION which should be interpreted by caller (external-resizer) as Caller SHOULD ensure that volume is not published and retry with exponential back off. It kind of works, but there is another problem: Pod may be scheduled and ControllerPublishVolume may be called earlier than back off expires. Perhaps, there is a way to hold ControllerPublishVolume until resize completes? But is there a way to know that Volume has a resize pending?

Multi-arch image support

Hello,

Would like to put in a request for support of multi-arch images (Arm). Current external-resizer images only support x86.

nfs-resizer container crashes with CSI driver neither supports controller resize nor node resize

Hello,

I am using nfs-resizer as part of manila-csi (1.22.0) and csi-nfs-driver (mcr.microsoft.com/k8s/csi/nfs-csi:latest) configuration.
manila-csi-openstack-manila-csi-controllerplugin-0 pod is crashing because of the following error being thrown by nfs-resizer:

I1116 14:06:34.639310 1 main.go:90] Version : v1.2.0 I1116 14:06:34.645487 1 common.go:111] Probing CSI driver for readiness F1116 14:06:34.656982 1 main.go:158] CSI driver neither supports controller resize nor node resize goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc00013a001, 0xc00053e000, 0x69, 0xa0) /workspace/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x27fdaa0, 0xc000000003, 0x0, 0x0, 0xc0005b4000, 0x208b4d3, 0x7, 0x9e, 0x40e000) /workspace/vendor/k8s.io/klog/v2/klog.go:970 +0x191 k8s.io/klog/v2.(*loggingT).printDepth(0x27fdaa0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0000275f0, 0x1, 0x1) /workspace/vendor/k8s.io/klog/v2/klog.go:733 +0x16f k8s.io/klog/v2.(*loggingT).print(...) /workspace/vendor/k8s.io/klog/v2/klog.go:715 k8s.io/klog/v2.Fatal(...) /workspace/vendor/k8s.io/klog/v2/klog.go:1489 main.main() /workspace/cmd/csi-resizer/main.go:158 +0x123d

As far as I can see in csi-nfs-driver does not support resizing but shouldn't in this case nfs-resizer just somehow ignore the resize requests? Or at least not fail until a resize request is issued?

In summary, I would like to know if anyone was able to make resize work (I've seen some comments that indicate that it is possible) If it is not possible, how can I make nfs-resizer not crash?

Many thanks in advance

Handle per-pvc secrets for resizing

Similar to attach/detach operations we should be able to handle per-pvc secrets for resizing. This will also require an API change in CSIVolumeSource and corresponding change in external-provisioner to ensure that those secrets are set correctly in CSIVolumeSource.

Not able to Docker Pull Image

While going through README , the container image specify in compatible section is not working in my enviorment , when i docker pull this image, i got error as

docker pull k8s.gcr.io/sig-storage/csi-provisioner Using default tag: latest Error response from daemon: manifest for k8s.gcr.io/sig-storage/csi-provisioner:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/sig-storage/csi-provisioner/manifests/latest"

while running this image is working fine

docker pull quay.io/k8scsi/csi-provisioner:canary canary: Pulling from k8scsi/csi-provisioner e59bd8947ac7: Pull complete 2ff1188e8e73: Pull complete Digest: sha256:7af768c615f33eb644ade6ef65c0bda64b0a4411d58dca459c168f812c6dff4f Status: Downloaded newer image for quay.io/k8scsi/csi-provisioner:canary quay.io/k8scsi/csi-provisioner:canary

Any suggestions on this!
Thanks

only online volume expansion not supported

What I am doing in my csi driver:

Step 1- Added following Capabilities in the identity server assuming that only online expansion will be supported

csi.GetPluginCapabilitiesResponse{
		Capabilities: []*csi.PluginCapability{
			{
				Type: &csi.PluginCapability_Service_{
					Service: &csi.PluginCapability_Service{
						Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
					},
				},
			},
			{
				Type: &csi.PluginCapability_Service_{
					Service: &csi.PluginCapability_Service{
						Type: csi.PluginCapability_Service_VOLUME_ACCESSIBILITY_CONSTRAINTS,
					},
				},
			},
			{
				Type: &csi.PluginCapability_VolumeExpansion_{
					VolumeExpansion: &csi.PluginCapability_VolumeExpansion{
						Type: csi.PluginCapability_VolumeExpansion_ONLINE,
					},
				},
			},
		},
	}

Step 2- Following are the other capabilities which added in the driver to support volume operations including expansion

**ControllerServiceCapability:**
[]csi.ControllerServiceCapability_RPC_Type{
		csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
		csi.ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME,
		csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
		csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
	} 

**NodeServiceCapability:**
[]csi.NodeServiceCapability_RPC_Type{
		csi.NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME,
		csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
		csi.NodeServiceCapability_RPC_EXPAND_VOLUME,
	}

**VolumeCapability:**
[]csi.VolumeCapability_AccessMode_Mode{
		csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
	}

Step 3- Created PVC with 10 GB and then trying to expand size from 10 to 20 GB by editing PVC

Result: PVC not expanded and getting following error/msg in the PVC describe

  Normal   Resizing            49m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Normal   Resizing            41m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Normal   Resizing            33m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Warning  VolumeResizeFailed  25m    external-resizer vpc.block.csi.ibm.io  resize volume "pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c" by resizer "vpc.block.csi.ibm.io" failed: rpc error: code = Unavailable desc = transport is closing
  Normal   Resizing            23m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Warning  VolumeResizeFailed  15m    external-resizer vpc.block.csi.ibm.io  resize volume "pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c" by resizer "vpc.block.csi.ibm.io" failed: rpc error: code = Unavailable desc = transport is closing
  Normal   Resizing            12m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Normal   Resizing            2m24s  external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c

What is expected:

Offline volume expansion should provide info/error to user that its not supported and kubernetes should not retry

How to skip file system resize required

When I update a pvc size, it will have a FileSystemResizePending status, but I don't need to resize file system, how to skip it.

image

csi-resizer log:

image

driver:

func NewDriver(nodeID, endpoint string, clusterName string, parentDir string) *yrDriver {
	glog.Infof("Driver: %v version: %v", driverName, version)
	glog.Infof("cluster namespace: %s", clusterName)
	d := &yrDriver{}

	d.endpoint = endpoint

	csiDriver := csicommon.NewCSIDriver(driverName, version, nodeID, replace(clusterName), replace(parentDir))
	csiDriver.AddVolumeCapabilityAccessModes([]csi.VolumeCapability_AccessMode_Mode{
		csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
	})
	csiDriver.AddControllerServiceCapabilities([]csi.ControllerServiceCapability_RPC_Type{
		csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
		csi.ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME,
		csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
	})
	d.csiDriver = csiDriver

	return d
}

identity:

func (ids *identityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error) {
	return &csi.GetPluginCapabilitiesResponse{
		Capabilities: []*csi.PluginCapability{
			{
				Type: &csi.PluginCapability_Service_{
					Service: &csi.PluginCapability_Service{
						Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
					},
				},
			},
			{
				Type: &csi.PluginCapability_VolumeExpansion_{
					VolumeExpansion: &csi.PluginCapability_VolumeExpansion{
						Type: csi.PluginCapability_VolumeExpansion_ONLINE,
					},
				},
			},
		},
	}, nil
}

Resizer may run without initializing

resizeController.Run() waits for the informer caches to sync before spawning goroutines for syncPVCs - https://github.com/kubernetes-csi/external-resizer/blob/master/pkg/controller/controller.go#L250

If the cache syncing errors out, the error message is simply logged and resizer container never restarts.

root@422f45813bff2a241fdfeda9996a783b [ ~ ]# kubectl logs vsphere-csi-controller-5594766d5b-p6q2z -n kube-system -c csi-resizer -f
I1007 22:13:51.988959       1 main.go:79] Version : v1.0.0-rc2
I1007 22:13:51.995046       1 connection.go:153] Connecting to unix:///csi/csi.sock
I1007 22:13:52.000134       1 common.go:111] Probing CSI driver for readiness
I1007 22:13:52.013139       1 csi_resizer.go:77] CSI driver name: "csi.vsphere.vmware.com"
W1007 22:13:52.013753       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I1007 22:13:52.022003       1 controller.go:114] Register Pod informer for resizer csi.vsphere.vmware.com
I1007 22:13:52.026512       1 main.go:136] 1
I1007 22:13:52.027627       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/external-resizer-csi-vsphere-vmware-com...
I1007 22:13:52.079012       1 leader_election.go:172] new leader detected, current leader: vsphere-csi-controller-5594766d5b-5vh99
I1007 22:14:11.410440       1 leaderelection.go:253] successfully acquired lease kube-system/external-resizer-csi-vsphere-vmware-com
I1007 22:14:11.410624       1 leader_election.go:172] new leader detected, current leader: vsphere-csi-controller-5594766d5b-p6q2z
I1007 22:14:11.411511       1 leader_election.go:165] became leader, starting
I1007 22:14:11.411683       1 controller.go:238] Starting external resizer csi.vsphere.vmware.com
I1007 22:14:11.612176       1 controller.go:248] Cannot sync pod, pv or pvc caches
I1007 22:14:11.612201       1 controller.go:253] Shutting down external resizer csi.vsphere.vmware.com
^C

root@422f45813bff2a241fdfeda9996a783b [ ~ ]# kubectl get pod -n kube-system
NAME                                      READY   STATUS    RESTARTS   AGE
vsphere-csi-controller-5594766d5b-p6q2z   6/6     Running   0          6m30s

PVC used by a job doesn't get resize after the pod of the job completed

Summary:
We have a setup in which the external-resizer is used with the storage provider that only supports offline expansion (e.g., only supports PluginCapability_VolumeExpansion_OFFLINE). We deployed a job that uses a PVC provisioned by the storage provider. While the job pod is running, we resize the PVC by modifying spec.resources.requests.storage. The PVC cannot be resized while the pod is running as expected. However, after the job pod is completed, the PVC still doesn't get resized. external-resizerdoesn't send resizing gRPC call to the storage provider. The PVC is stuck in this state forever until we manually delete the job pod.

Reproduce steps:

  1. Deploy external-resizer together with a storage provider (we use Longhorn)

  2. Don't set the --handle-volume-inuse-error flag for the external-resizer . It means that by default, external-resizer will handle handle volume in use error in resizer controller, link

  3. Deploy a job that uses a PVC as below. The job creates a pod that will sleep for 2 minutes and complete.

    Click to open
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-job-pvc
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: longhorn
      resources:
        requests:
          storage: 1Gi
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: test-job
      namespace: default
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: test-job
        spec:
          containers:
            - name: test-job
              image: ubuntu:latest
              imagePullPolicy: IfNotPresent
              securityContext:
                privileged: true
              command: ["/bin/sh"]
              args: ["-c", "echo 'sleep for 120s then exit'; sleep 120"]
              volumeMounts:
                - mountPath: /data
                  name: vol
          restartPolicy: OnFailure
          volumes:
            - name: vol
              persistentVolumeClaim:
                claimName: test-job-pvc
    
  4. While the job pod become running, try to expand the PVC by editing the spec.resources.requests.storage

  5. Observe that the resizing fail

  6. Wait for the job pod to become completed.

  7. Observer that that PVC stuck in the current state forever. It doesn't get resized because external-resizer doesn't attempt to make gRPC expanding call to the storage provider.

Expected Behavior:

Once the job pod is completed, the PVC is no longer consider to be in-used. Therefore external-resizer should attempt to make gRPC expanding call to the storage provider.

Propose:
We dig into the source code see that:

  1. This checker prevent the external-resizer from retrying if the PVC has InUseErrors before AND it is in the ctrl.usedPVCs map
  2. The problem is that the PVC is never removed from the ctrl.usedPVCs map when a pod move to completed phase. PVC is only removed when the pod is deleted, link
  3. We think that the logic over here should be changed to handle the case when the pod become completed. I.e.,:
    func (ctrl *resizeController) updatePod(oldObj, newObj interface{}) {
        pod := parsePod(newObj)
        if pod == nil {
    	    return
        }
        
        if isPodTerminated(pod) {
    	    ctrl.usedPVCs.removePod(pod)
        } else {
    	    ctrl.usedPVCs.addPod(pod)
        }
    }
    

Evn:

  • external-resizer v1.2.0
  • Longhorn v1.2.2

Update Klog to v2

External-resizer is still using klog v1 , Other sidecars is using klog v2 so this should be sync with rest Sidecar , i think that should be upgraded to klog v2

external-resizer crashing in azuredisk-csi-driver and azurefile-csi-driver

From kubernetes-sigs/azurefile-csi-driver#495

What happened:
After installing azurefile-csi-driver and azuredisk-csi-driver in a Kubernetes cluster, csi-resizer container, inside csi-azurefile-controller and csi-azuredisk-controller pods, is crashing every 1 or 2 minutes with the following message:

csi-resizer log:

...
I1211 12:27:26.339777       1 leaderelection.go:283] successfully renewed lease kube-system/external-resizer-file-csi-azure-com
I1211 12:27:31.349381       1 leaderelection.go:283] successfully renewed lease kube-system/external-resizer-file-csi-azure-com
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

runtime stack:
runtime.throw(0x15d27f3, 0xc)
    /usr/lib/go-1.14/src/runtime/panic.go:1112 +0x72
runtime.mlockGsignal(0xc000682a80)
    /usr/lib/go-1.14/src/runtime/os_linux_x86.go:72 +0x107
runtime.mpreinit(0xc000079180)
    /usr/lib/go-1.14/src/runtime/os_linux.go:341 +0x78
runtime.mcommoninit(0xc000079180)
    /usr/lib/go-1.14/src/runtime/proc.go:630 +0x108
runtime.allocm(0xc00004f800, 0x1672e98, 0x14f676dd7e26c)
    /usr/lib/go-1.14/src/runtime/proc.go:1390 +0x14e
runtime.newm(0x1672e98, 0xc00004f800)
    /usr/lib/go-1.14/src/runtime/proc.go:1704 +0x39
runtime.startm(0x0, 0xc000103201)
    /usr/lib/go-1.14/src/runtime/proc.go:1869 +0x12a
runtime.wakep(...)
    /usr/lib/go-1.14/src/runtime/proc.go:1953
runtime.resetspinning()
    /usr/lib/go-1.14/src/runtime/proc.go:2415 +0x93
runtime.schedule()
    /usr/lib/go-1.14/src/runtime/proc.go:2527 +0x2de
runtime.park_m(0xc000103200)
    /usr/lib/go-1.14/src/runtime/proc.go:2690 +0x9d
runtime.mcall(0x0)
    /usr/lib/go-1.14/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [select, 2 minutes]:
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1(0x13763e0, 0x0, 0xc0000eee40)
...

What you expected to happen:
The container should not fail so frequently.

How to reproduce it:
The failure started right after installing v0.7.0 of azurefile-csi-driver. I upgraded to v0.9.0 (for both, azurefile and azuredisk) with the same results. The Kubernetes cluster is composed of 3 master nodes and 3 workers running on Azure VMs (not AKS).

Anything else we need to know?:
Found a couple issues in golang/go repository that seems to be related:

Possibly upgrading golang version from 1.14 to 1.15 will solve the problem.

Environment:

  • CSI Driver version: v0.7.0 and v0.9.0
  • Kubernetes version (use kubectl version): v1.19.14
  • OS (e.g. from /etc/os-release): Ubuntu v20.04.1 LTS
  • Kernel (e.g. uname -a): 5.4.0-1032-azure #33-Ubuntu SMP Fri Nov 13 14:23:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: Helm v3.4.2
  • Others:
    • Master node size: Standard D2ds_v4 (2 vcpus, 8 GiB memory)
    • Worker node size: Standard D16ds_v4 (16 vcpus, 64 GiB memory)

Complete log file: csi-resizer.log

csi-resizer logs a warning every time a new PVC is created

csi-resizer came to output the following warning message every time a new PVC is created from v1.2.0.

I0816 07:50:11.380497       1 controller.go:291] Started PVC processing "default/sample-volume-jtdql"
W0816 07:50:11.380594       1 controller.go:318] PV "" bound to PVC default/sample-volume-jtdql not found

Until v1.1.0, it had output the following info-level message when a new PVC is created.

I0816 07:24:52.092956       1 controller.go:281] Started PVC processing "default/sample-volume-fffx6"
I0816 07:24:52.092986       1 controller.go:304] No need to resize PVC "default/sample-volume-fffx6"

This is the side effect of this commit.
The case had been handled by this block until v1.1.0, but it is handled by here since v1.2.0.

I think the log level should be changed to info, or the empty check of pvc.Spec.VolumeName should be added before here.

Resizer sidecar doesn't consistently increase the time interval during exponential backoff on errors

csi-resizer v0.5 retries ControllerExpandVolume after random intervals of time instead of an increasing interval of time during exponential backoff on an error.
Refer to the event logs I got when I tried to perform online expansion when the CSI driver supports offline expansion only - https://gist.github.com/shalini-b/e4c8256cd46eb096e0efc3ae322ba644. The time logged for each event is not consistently increasing.

Steps to reproduce -

  1. Use a CSI driver which uses csi-resizer v0.5.0
  2. Perform any operation which raises an error in ControllerExpandVolume call. For example, try executing an online volume expansion in a driver which only supports offline expansion. This will ensure csi-resizer keeps retrying ControllerExpandVolume till it succeeds.
  3. Check the events logged in csi-resizer while it is retrying with exponential backoff.

VolumeExpansion ONLINE/OFFLINE capability is not checked

When a CSI driver reports PluginCapability_VolumeExpansion_OFFLINE only, external-resizer still calls ControllerExpandVolume when the PVC is used by a Pod.

Instead, the extenal-resizer should check VolumeExpansion plugin capability and only invoke ControllerExpandVolume when PluginCapability_VolumeExpansion_ONLINE is supported by the CSI driver.

Add resizing support in csi-mock driver

  • Add general resize support
  • Add configurable support for node expansion
  • Add configurable support for controller expansion
  • Add configuration support for online/offline expansion

Problem with volumes that support node only expansion

It looks like if a CSI volume supports node only expansion and does not have controller EXPAND_VOLUME capability this resize controller will exit.

But problem is - in that case the PV will never be updated and currently node side expansion expeects PV to have correct size before calling node side expansion. Also even if Node side expansion handled updating PV, there would be resistance to updating PV from node.

I am thinking this controller should not exit if CSI volume does not support control-plane volume expansion but should do a noop expansion and update the PV regardless. This is how flexvolume resizing has been implemented.

cc @mlmhl @msau42 @chakri-nelluri

csi-resizer:v1.0.0 image having VA issues

following is the VA issues in csi-resizer:v1.0.0 image

Vulnerable Packages Found
Vulnerability ID Policy Status Affected Packages How to Resolve
DLA-2542-1 Active tzdata Upgrade tzdata to >= 2021a-0+deb9u1
DLA-2509-1 Active tzdata Upgrade tzdata to >= 2021a-0+deb9u1

Hostpath PV is resized but PVC is not when resizing offline

When trying to expand a volume which is offline using the CSI hostpath driver, PV is expanded but PVC is not. There is probably a bug with CSI hostpath driver implementation, but external-resizer should do the check and prevent this from happening.

How to reproduce:

  1. I deployed csi-driver-host-path for k8s 1.14 using the script in the following location.
    https://github.com/kubernetes-csi/csi-driver-host-path/tree/master/deploy/kubernetes-1.14

  2. kubectl create -f https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/examples/csi-storageclass.yaml

  3. kubectl apply -f https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/examples/csi-pvc.yaml

  4. https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/examples/csi-app.yaml

  5. Modified csi-pvc.yaml to increase the PVC size to 2Gi.
    external-resizer worked properly and expanded online PVC/PV size to 2Gi.

  6. Deleted the pod created using csi-app.yaml. PVC is now offline.

  7. Modified csi-pvc.yaml to change the PVC size to 3Gi.

  8. PV was increased to 3Gi, but PVC remained at 2Gi.

root@k8s_ubuntu1:~/go/src/github.com/kubernetes-csi/csi-driver-host-path/examples# kubectl get pvc
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
csi-pvc   Bound    pvc-67c942df-df41-11e9-a872-000c296bb855   2Gi        RWO            csi-hostpath-sc   15h
root@k8s_ubuntu1:~/go/src/github.com/kubernetes-csi/csi-driver-host-path/examples# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS      REASON   AGE
pvc-67c942df-df41-11e9-a872-000c296bb855   3Gi        RWO            Delete           Bound    default/csi-pvc   csi-hostpath-sc            15h

Not able to pull v1.3.0 release image

#docker pull k8s.gcr.io/sig-storage/csi-resizer:v1.3.0
Error response from daemon: manifest for k8s.gcr.io/sig-storage/csi-resizer:v1.3.0 not found: manifest unknown: Failed to fetch "v1.3.0" from request "/v2/sig-storage/csi-resizer/manifests/v1.3.0".

not sure, its just a sync delay though

cc @gnufied

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.