kubernetes-csi / external-resizer Goto Github PK

Sidecar container that watches Kubernetes PersistentVolumeClaims objects and triggers controller side expansion operation against a CSI endpoint

License: Apache License 2.0

Makefile 4.31% Go 62.15% Dockerfile 0.06% Shell 29.45% Python 4.03%

k8s-sig-storage

external-resizer's Introduction

CSI Resizer

The CSI external-resizer is a sidecar container that watches the Kubernetes API server for PersistentVolumeClaim updates and triggers ControllerExpandVolume operations against a CSI endpoint if user requested more storage on PersistentVolumeClaim object.

Overview

A storage provider that allows volume expansion after creation, may choose to implement volume expansion either via a control-plane CSI RPC call or via node CSI RPC call or both as a two step process. The external-resizer is an external-controller that watches Kubernetes API server for PersistentVolumeClaim modifications and triggers CSI calls for control-plane volume-expansion. More details can be found on - CSI Volume expansion

Compatibility

This information reflects the head of this branch.

Compatible with CSI Version	Container Image	Min K8s Version	Recommended K8s Version
CSI Spec v1.5.0	k8s.gcr.io/sig-storage/csi-resizer	1.16	1.28

Feature status

Various external-resizer releases come with different alpha / beta features.

The following table reflects the head of this branch.

Feature	Status	Default	Description
VolumeExpansion	Stable	On	Support for expanding CSI volumes.
ReadWriteOncePod	Stable	On	Single pod access mode for PersistentVolumes.
VolumeAttributesClass	Alpha	Off	Volume Attributes Classes.

Usage

It is necessary to create a new service account and give it enough privileges to run the external-resizer, see deploy/kubernetes/rbac.yaml. The resizer is then deployed as single Deployment as illustrated below:

kubectl create deploy/kubernetes/deployment.yaml

The external-resizer may run in the same pod with other external CSI controllers such as the external-attacher, external-snapshotter and/or external-provisioner.

Note that the external-resizer does not scale with more replicas. Only one external-resizer is elected as leader and running. The others are waiting for the leader to die. They re-elect a new active leader in ~15 seconds after death of the old leader.

Command line options

Recommended optional arguments

--csi-address <path to CSI socket>: This is the path to the CSI driver socket inside the pod that the external-resizer container will use to issue CSI operations (/run/csi/socket is used by default).
--leader-election: Enables leader election. This is mandatory when there are multiple replicas of the same external-resizer running for one CSI driver. Only one of them may be active (=leader). A new leader will be re-elected when current leader dies or becomes unresponsive for ~15 seconds.
--leader-election-namespace: Namespace where the leader election resource lives. Defaults to the pod namespace if not set.
--leader-election-lease-duration <duration>: Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds.
--leader-election-renew-deadline <duration>: Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds.
--leader-election-retry-period <duration>: Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds.
--timeout <duration>: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of ControllerExpandVolume calls. 10 seconds is used by default.
-kube-api-burst <int> : Burst to use while communicating with the kubernetes apiserver. Defaults to 10. (default 10).
-kube-api-qps <float> : QPS to use while communicating with the kubernetes apiserver. Defaults to 5.0. (default 5).
--retry-interval-start: The starting value of the exponential backoff for failures. 1 second is used by default.
--retry-interval-max: The exponential backoff maximum value. 5 minutes is used by default.
--workers <num>: Number of simultaneously running ControllerExpandVolume operations. Default value is 10.
--http-endpoint: The TCP network address where the HTTP server for diagnostics, including metrics and leader election health check, will listen (example: :8080 which corresponds to port 8080 on local host). The default is empty string, which means the server is disabled.
--metrics-path: The HTTP path where prometheus metrics will be exposed. Default is /metrics.
--handle-volume-inuse-error <true/false>: Enable or disable volume-in-use error handling in external-resizer. Defaults to true and resize-controller will watch for all pods in all namespaces to check if PVC being expanded is in-use by a pod or not before retrying volume expansion if CSI driver throws volume-in-use error. Setting this to false will cause external-resizer to ignore volume-in-use error and resize-controller will retry volume expansion even if volume is already in use by a pod and CSI driver does not support expansion of in-use volumes. If CSI driver being used supports online expansion, it might be desirable to set handle-volume-inuse-error to false - to save costs associated with watching all pods in the cluster.
`-feature-gates**: A set of key/value pairs that describe alpha/experimental features of external-resizer.
- AnnotateFsResize=true|false (ALPHA - default=false): Store current size of pvc in pv's annotation, so as if pvc is deleted while expansion was pending on the node, the size of pvc can be restored to old value. This permits expansion on the node in case pvc was deleted while expansion was pending on the node (but completed in the controller). Use of this feature depends on Kubernetes version 1.21.
- RecoverVolumeExpansionFailure=true|false (ALPHA - default=false): Allow users to reduce size of PVC if expansion to current size is failing. If the feature gate RecoverVolumeExpansionFailure is enabled and expansion has failed for a PVC, you can retry expansion with a smaller size than the previously requested value. To request a new expansion attempt with a smaller proposed size, edit .spec.resources for that PVC and choose a value that is less than the value you previously tried. This is useful if expansion to a higher value did not succeed because of capacity constraint. If that has happened, or you suspect that it might have, you can retry expansion by specifying a size that is within the capacity limits of underlying storage provider. You can monitor status of resize operation by watching .status.resizeStatus and events on the PVC. Use of this feature-gate requires Kubernetes 1.28.

Other recognized arguments

--kubeconfig <path>: Path to Kubernetes client configuration that the external-resizer uses to connect to Kubernetes API server. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-resizer does not run as a Kubernetes pod, e.g. for debugging. Either this or --master needs to be set if the external-resizer is being run out of cluster.
--master <url>: Master URL to build a client config from. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-resizer does not run as a Kubernetes pod, e.g. for debugging. Either this or --kubeconfig needs to be set if the external-resizer is being run out of cluster.
--metrics-address: (deprecated) The TCP network address where the prometheus metrics endpoint will run (example: :8080 which corresponds to port 8080 on local host). The default is empty string, which means metrics endpoint is disabled.
--version: Prints current external-resizer version and quits.
All glog / klog arguments are supported, such as -v <log level> or -alsologtostderr.

HTTP endpoint

The external-resizer optionally exposes an HTTP endpoint at address:port specified by --http-endpoint argument. When set, these two paths are exposed:

Metrics path, as set by --metrics-path argument (default is /metrics).
Leader election health check at /healthz/leader-election. It is recommended to run a liveness probe against this endpoint when leader election is used to kill external-resizer leader that fails to connect to the API server to renew its leadership. See kubernetes-csi/csi-lib-utils#66 for details.

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page.

You can reach the maintainers of this project at:

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

external-resizer's People

Contributors

Stargazers

Watchers

Forkers

mlmhl msau42 gnufied madhu-1 cwdsuzhou andrewsykim pohly jarrpa jsafrane avutu bertinatto smileusd leakingtapan amruta-bandhu-chaudhury openshift huffmanca wongma7 lpabon libopenstorage ggriffiths qianwens aland-zhang timoreimann timyinshi wangzihao3 saad-ali flant davidz627 wrbooth linux-on-ibm-z nickrenren windayski jingxu97 humblec brahmaroutu saikat-royc jonuwz raunakshah c3y1huang divyenpatel calfox jiawei0227 kartik494 meldafrawi bnrjee iamprvn aayushrangwala verult sunpa93 nearora-msft payes subodh01 sfowl isabella232 bai3shuo4 chrishenzie dobsonj anubha-v-ardhan yeya24 ialidzhikov takuhiro mauriciopoppe phanle1010 raulcabello ekmixon felix-0326 travisghansen antebox shaileshkumar007 jiangchuan0426 harshanarayana gtxu mrpre hktttygithub chrisamti wangmin362 tinydeskdev-csi xing-yang adityac45 rainbow954 edgecloudx fredliu-db sunnylovestiramisu mantissahz mowangdk andyzhangx onelapahead carlory connorbradshaw10 sneha-at datadog saku3 zhuxiaow-test dannawang0221 k8s-infra-cherrypick-robot zafs23 songjiaxun huww98 kbsonlong gsoc2

external-resizer's Issues

Add unit test for external resize controller

We need some basic unit tests and sanity checks for controller.

Capture metrics for external volume resizing

How to resize a volume with filesystem from cloning?

When cloning a volume from an existing PVC, it's allowed to specify a bigger size for the new PVC, which means clone + resize in one step.

If they're both block volumes, it's possible to do the resize work at the end of CSI call CreateVolume. This is okay.

However, when they're filesystem volumes (volumeMode=Filesystem), I can still do the resize work at the end of CSI call CreateVolume, but I can only resize the volume, not the filesystem in it. The filesystem still has the old size. The filesystem expand work should be done by the kubelet calling the NodeExpandVolume, but how can I let kubelet to do that since I'm in the CSI driver code?

For just expanding volume, the external-resizer will add a status condition FileSystemResizePending for the PVC, so kubelet will call the NodeExpandVolume when the PVC is attached to a pod. Is there is way to add this when cloning volume?

I hope the CSI call ControllerExpandVolume can be invoked automatically in such case.

ps, my csi driver capability:

var DefaultControllerServiceCapability = []csi.ControllerServiceCapability_RPC_Type{
	csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
	csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
	csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
	csi.ControllerServiceCapability_RPC_CLONE_VOLUME,
}

var DefaultNodeServiceCapability = []csi.NodeServiceCapability_RPC_Type{
	csi.NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME,
	csi.NodeServiceCapability_RPC_EXPAND_VOLUME,
	csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
}

var DefaultPluginCapability = []*csi.PluginCapability{
	{
		Type: &csi.PluginCapability_Service_{
			Service: &csi.PluginCapability_Service{
				Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
			},
		},
	},
	{
		Type: &csi.PluginCapability_VolumeExpansion_{
			VolumeExpansion: &csi.PluginCapability_VolumeExpansion{
				Type: csi.PluginCapability_VolumeExpansion_OFFLINE,
			},
		},
	},
}

resize failed and can't recover due to pvc rejection "Forbidden: field can not be less than previous value "

How to reproduce:

create a pvc with size 1Gi
resize to 10Pi, kubectl edit pvc xxx and update spec.resources.requests.storage to 10Pi
assume the new request size 10Pi is too large， csi-driver and backend storage refuse to update and response an error to call of ControllerExpandVolume
pvc status change into Resizing
when realize 10Pi is to large for a storage backend, try reedit(kubectl edit pvc xxx) and set to a lower value, for example,1T
pvc reject to save: spec.resources.requests.storage: Forbidden: field can not be less than previous value

That mean we can never update the size again, because csi driver(storage backend) only accept a smaller size
but pvc don't allow resizing to a capacity less than previous value.
Since external-resizer have received a failed rpc response , can it do something to recover the pvc
for example reset the size?

distributed resizing

For local storage (e.g. LVM disks) it makes sense to support a deployment model where the driver only runs locally on nodes together with sidecars. This is already support for provisioning (kubernetes-csi/external-provisioner#524, https://github.com/kubernetes-csi/external-provisioner#deployment-on-each-node).

Block volumes are marked with FileSystemResizeRequired

The in-tree controller, after finishing Controller expand, checks not only if the plugin reports that FS resize is required but also if the volume is mode FS or not: https://github.com/kubernetes/kubernetes/blob/1dac5fd14a54ac4972339dbe55f9f03688fd7542/pkg/volume/util/operationexecutor/operation_generator.go#L1578 before it marks volumes FileSystemResizeRequired.

The external-resizer should do the same.

This was caught by the e2e test
[Driver: aws] [Testpattern: Dynamic PV (block volmode)(allowExpansion)] volume-expand [It] Verify if offline PVC expansion works

/assign

Respect volume-in-use error when calling ControllerExpand volume

if a CSI driver throws volume-in-use error when calling ControllerExpandVolume, the external-resizer should not retry expansion until it can verify that volume is not in use. This will be a best effort check and only be performed after a plugin has thrown Volume in use error. This is different from enforcing online and offline plugin capabilities.

cc #62

Make it optional to watch for pods for volume-in-use errors

While speaking with @msau42 about recently introduced volume-in-use mechanism, I kind of agreed that watching for all pods in a cluster is expensive and this feature could be an opt-in for certain deployments.

cc @xing-yang

stop retry volume expansion when driver returns codes.Unimplemented

Even when the CSI driver does not support volume expansion and returns Unimplemented error code, resizer does not stop expanding volume and retry calling controller to resize.

Refer: kubernetes-sigs/vsphere-csi-driver#301 (comment)

OFFLINE resizing woes

There seems to be a problem with operations ordering in OFFLINE resize situation.
First of all, I indicate supported plugin resize capabilities with PluginCapability_VolumeExpansion_OFFLINE and implement a ControllerExpandVolume method.

First naïve solution

Just hope that external-resizer somehow understands that state of a given PVC (Volume) and calls resize only when the Volume is Unpublished.

Does not work, calls ControllerExpandVolume once PVC in Kubernetes API is resized.

Return a gRPC error solution

There is an options to send back a gRPC error 9 FAILED_PRECONDITION which should be interpreted by caller (external-resizer) as Caller SHOULD ensure that volume is not published and retry with exponential back off. It kind of works, but there is another problem: Pod may be scheduled and ControllerPublishVolume may be called earlier than back off expires. Perhaps, there is a way to hold ControllerPublishVolume until resize completes? But is there a way to know that Volume has a resize pending?

Multi-arch image support

Hello,

Would like to put in a request for support of multi-arch images (Arm). Current external-resizer images only support x86.

release-tools fails on verifying subtree changes

@pohly looks like when we re-merged release-tools it broke the tooling needed for ./verify-subtree.sh and it finds "unmerged" commits and hence build is broken.

nfs-resizer container crashes with CSI driver neither supports controller resize nor node resize

Hello,

I am using nfs-resizer as part of manila-csi (1.22.0) and csi-nfs-driver (mcr.microsoft.com/k8s/csi/nfs-csi:latest) configuration.
manila-csi-openstack-manila-csi-controllerplugin-0 pod is crashing because of the following error being thrown by nfs-resizer:

I1116 14:06:34.639310 1 main.go:90] Version : v1.2.0 I1116 14:06:34.645487 1 common.go:111] Probing CSI driver for readiness F1116 14:06:34.656982 1 main.go:158] CSI driver neither supports controller resize nor node resize goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc00013a001, 0xc00053e000, 0x69, 0xa0) /workspace/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x27fdaa0, 0xc000000003, 0x0, 0x0, 0xc0005b4000, 0x208b4d3, 0x7, 0x9e, 0x40e000) /workspace/vendor/k8s.io/klog/v2/klog.go:970 +0x191 k8s.io/klog/v2.(*loggingT).printDepth(0x27fdaa0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0000275f0, 0x1, 0x1) /workspace/vendor/k8s.io/klog/v2/klog.go:733 +0x16f k8s.io/klog/v2.(*loggingT).print(...) /workspace/vendor/k8s.io/klog/v2/klog.go:715 k8s.io/klog/v2.Fatal(...) /workspace/vendor/k8s.io/klog/v2/klog.go:1489 main.main() /workspace/cmd/csi-resizer/main.go:158 +0x123d

As far as I can see in csi-nfs-driver does not support resizing but shouldn't in this case nfs-resizer just somehow ignore the resize requests? Or at least not fail until a resize request is issued?

In summary, I would like to know if anyone was able to make resize work (I've seen some comments that indicate that it is possible) If it is not possible, how can I make nfs-resizer not crash?

Many thanks in advance

Provider example YAMLs for deploying external-resizer

Use new csi-translation-lib version

Need to use >=v0.21.0 in order to handle v1 topology key.

CVE-2019-11255: CSI resizing feature can result in unauthorized volume mutation

Overall tracking issue: kubernetes/kubernetes#85233

Fixed in:
0.3.0: #57

Report version information from resizer

Add CSI e2e tests with GCE and EBS drivers

It's better to set default value for the socket in the external-resizer yaml？

@gnufied

Successful resize, but the value of the pvc capacity in the ectd not update

kubernets version : v1.15.0
external-resizer: v0.2.0

when resizing the pvc, the capacity of the pvc has been successful expanded, the value of the pv capacity int the etcd update, but the value of the pvc capacity in the ectd not update

Fix pull jobs in this repo

We had overriden https://github.com/kubernetes-csi/external-resizer/blob/master/.prow.sh to temporarily deal with failing tests. We should undo those changes and figure out how to get resizer jobs configured correctly.

We also need to update release-tools

Handle per-pvc secrets for resizing

Similar to attach/detach operations we should be able to handle per-pvc secrets for resizing. This will also require an API change in CSIVolumeSource and corresponding change in external-provisioner to ensure that those secrets are set correctly in CSIVolumeSource.

Arg "--csiTimeout" is not consistent with other sidecars

other sidecars are using timeout as argument example csi-provisioner https://github.com/kubernetes-csi/external-provisioner/blob/master/cmd/csi-provisioner/csi-provisioner.go#L64

Cut release with latest translation-lib

Should happen before 1.21 release.

kubernetes/kubernetes#97823 has the changes we want.

/cc @Jiawei0227

Release external-resizer alpha image

Not able to Docker Pull Image

While going through README , the container image specify in compatible section is not working in my enviorment , when i docker pull this image, i got error as

docker pull k8s.gcr.io/sig-storage/csi-provisioner Using default tag: latest Error response from daemon: manifest for k8s.gcr.io/sig-storage/csi-provisioner:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/sig-storage/csi-provisioner/manifests/latest"

while running this image is working fine

docker pull quay.io/k8scsi/csi-provisioner:canary canary: Pulling from k8scsi/csi-provisioner e59bd8947ac7: Pull complete 2ff1188e8e73: Pull complete Digest: sha256:7af768c615f33eb644ade6ef65c0bda64b0a4411d58dca459c168f812c6dff4f Status: Downloaded newer image for quay.io/k8scsi/csi-provisioner:canary quay.io/k8scsi/csi-provisioner:canary

Any suggestions on this!
Thanks

only online volume expansion not supported

What I am doing in my csi driver:

Step 1- Added following Capabilities in the identity server assuming that only online expansion will be supported

csi.GetPluginCapabilitiesResponse{
		Capabilities: []*csi.PluginCapability{
			{
				Type: &csi.PluginCapability_Service_{
					Service: &csi.PluginCapability_Service{
						Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
					},
				},
			},
			{
				Type: &csi.PluginCapability_Service_{
					Service: &csi.PluginCapability_Service{
						Type: csi.PluginCapability_Service_VOLUME_ACCESSIBILITY_CONSTRAINTS,
					},
				},
			},
			{
				Type: &csi.PluginCapability_VolumeExpansion_{
					VolumeExpansion: &csi.PluginCapability_VolumeExpansion{
						Type: csi.PluginCapability_VolumeExpansion_ONLINE,
					},
				},
			},
		},
	}

Step 2- Following are the other capabilities which added in the driver to support volume operations including expansion

**ControllerServiceCapability:**
[]csi.ControllerServiceCapability_RPC_Type{
		csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
		csi.ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME,
		csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
		csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
	} 

**NodeServiceCapability:**
[]csi.NodeServiceCapability_RPC_Type{
		csi.NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME,
		csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
		csi.NodeServiceCapability_RPC_EXPAND_VOLUME,
	}

**VolumeCapability:**
[]csi.VolumeCapability_AccessMode_Mode{
		csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
	}

Step 3- Created PVC with 10 GB and then trying to expand size from 10 to 20 GB by editing PVC

Result: PVC not expanded and getting following error/msg in the PVC describe

  Normal   Resizing            49m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Normal   Resizing            41m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Normal   Resizing            33m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Warning  VolumeResizeFailed  25m    external-resizer vpc.block.csi.ibm.io  resize volume "pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c" by resizer "vpc.block.csi.ibm.io" failed: rpc error: code = Unavailable desc = transport is closing
  Normal   Resizing            23m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Warning  VolumeResizeFailed  15m    external-resizer vpc.block.csi.ibm.io  resize volume "pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c" by resizer "vpc.block.csi.ibm.io" failed: rpc error: code = Unavailable desc = transport is closing
  Normal   Resizing            12m    external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c
  Normal   Resizing            2m24s  external-resizer vpc.block.csi.ibm.io  External resizer is resizing volume pvc-b61e2c37-1c02-47c3-90ce-5d0c9ad8a69c

What is expected:

Offline volume expansion should provide info/error to user that its not supported and kubernetes should not retry

How to skip file system resize required

When I update a pvc size, it will have a FileSystemResizePending status, but I don't need to resize file system, how to skip it.

csi-resizer log:

driver:

func NewDriver(nodeID, endpoint string, clusterName string, parentDir string) *yrDriver {
	glog.Infof("Driver: %v version: %v", driverName, version)
	glog.Infof("cluster namespace: %s", clusterName)
	d := &yrDriver{}

	d.endpoint = endpoint

	csiDriver := csicommon.NewCSIDriver(driverName, version, nodeID, replace(clusterName), replace(parentDir))
	csiDriver.AddVolumeCapabilityAccessModes([]csi.VolumeCapability_AccessMode_Mode{
		csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
	})
	csiDriver.AddControllerServiceCapabilities([]csi.ControllerServiceCapability_RPC_Type{
		csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
		csi.ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME,
		csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
	})
	d.csiDriver = csiDriver

	return d
}

identity:

func (ids *identityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error) {
	return &csi.GetPluginCapabilitiesResponse{
		Capabilities: []*csi.PluginCapability{
			{
				Type: &csi.PluginCapability_Service_{
					Service: &csi.PluginCapability_Service{
						Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
					},
				},
			},
			{
				Type: &csi.PluginCapability_VolumeExpansion_{
					VolumeExpansion: &csi.PluginCapability_VolumeExpansion{
						Type: csi.PluginCapability_VolumeExpansion_ONLINE,
					},
				},
			},
		},
	}, nil
}

Ensure that version override in prow is removed once 1.16 release is cut

We need to ensure that once 1.16 release is cut, k8s version override in .prow.sh gets removed - #53

Resizer may run without initializing

resizeController.Run() waits for the informer caches to sync before spawning goroutines for syncPVCs - https://github.com/kubernetes-csi/external-resizer/blob/master/pkg/controller/controller.go#L250

If the cache syncing errors out, the error message is simply logged and resizer container never restarts.

root@422f45813bff2a241fdfeda9996a783b [ ~ ]# kubectl logs vsphere-csi-controller-5594766d5b-p6q2z -n kube-system -c csi-resizer -f
I1007 22:13:51.988959       1 main.go:79] Version : v1.0.0-rc2
I1007 22:13:51.995046       1 connection.go:153] Connecting to unix:///csi/csi.sock
I1007 22:13:52.000134       1 common.go:111] Probing CSI driver for readiness
I1007 22:13:52.013139       1 csi_resizer.go:77] CSI driver name: "csi.vsphere.vmware.com"
W1007 22:13:52.013753       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I1007 22:13:52.022003       1 controller.go:114] Register Pod informer for resizer csi.vsphere.vmware.com
I1007 22:13:52.026512       1 main.go:136] 1
I1007 22:13:52.027627       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/external-resizer-csi-vsphere-vmware-com...
I1007 22:13:52.079012       1 leader_election.go:172] new leader detected, current leader: vsphere-csi-controller-5594766d5b-5vh99
I1007 22:14:11.410440       1 leaderelection.go:253] successfully acquired lease kube-system/external-resizer-csi-vsphere-vmware-com
I1007 22:14:11.410624       1 leader_election.go:172] new leader detected, current leader: vsphere-csi-controller-5594766d5b-p6q2z
I1007 22:14:11.411511       1 leader_election.go:165] became leader, starting
I1007 22:14:11.411683       1 controller.go:238] Starting external resizer csi.vsphere.vmware.com
I1007 22:14:11.612176       1 controller.go:248] Cannot sync pod, pv or pvc caches
I1007 22:14:11.612201       1 controller.go:253] Shutting down external resizer csi.vsphere.vmware.com
^C

root@422f45813bff2a241fdfeda9996a783b [ ~ ]# kubectl get pod -n kube-system
NAME                                      READY   STATUS    RESTARTS   AGE
vsphere-csi-controller-5594766d5b-p6q2z   6/6     Running   0          6m30s

PVC used by a job doesn't get resize after the pod of the job completed

Summary:
We have a setup in which the external-resizer is used with the storage provider that only supports offline expansion (e.g., only supports PluginCapability_VolumeExpansion_OFFLINE). We deployed a job that uses a PVC provisioned by the storage provider. While the job pod is running, we resize the PVC by modifying spec.resources.requests.storage. The PVC cannot be resized while the pod is running as expected. However, after the job pod is completed, the PVC still doesn't get resized. external-resizerdoesn't send resizing gRPC call to the storage provider. The PVC is stuck in this state forever until we manually delete the job pod.

Reproduce steps:

Deploy external-resizer together with a storage provider (we use Longhorn)
Don't set the --handle-volume-inuse-error flag for the external-resizer . It means that by default, external-resizer will handle handle volume in use error in resizer controller, link

Deploy a job that uses a PVC as below. The job creates a pod that will sleep for 2 minutes and complete.

Click to open

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-job-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 1Gi
---
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
  namespace: default
spec:
  backoffLimit: 1
  template:
    metadata:
      name: test-job
    spec:
      containers:
        - name: test-job
          image: ubuntu:latest
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
          command: ["/bin/sh"]
          args: ["-c", "echo 'sleep for 120s then exit'; sleep 120"]
          volumeMounts:
            - mountPath: /data
              name: vol
      restartPolicy: OnFailure
      volumes:
        - name: vol
          persistentVolumeClaim:
            claimName: test-job-pvc

While the job pod become running, try to expand the PVC by editing the spec.resources.requests.storage
Observe that the resizing fail
Wait for the job pod to become completed.
Observer that that PVC stuck in the current state forever. It doesn't get resized because external-resizer doesn't attempt to make gRPC expanding call to the storage provider.

Expected Behavior:

Once the job pod is completed, the PVC is no longer consider to be in-used. Therefore external-resizer should attempt to make gRPC expanding call to the storage provider.

Propose:
We dig into the source code see that:

This checker prevent the external-resizer from retrying if the PVC has InUseErrors before AND it is in the ctrl.usedPVCs map
The problem is that the PVC is never removed from the ctrl.usedPVCs map when a pod move to completed phase. PVC is only removed when the pod is deleted, link

We think that the logic over here should be changed to handle the case when the pod become completed. I.e.,:

func (ctrl *resizeController) updatePod(oldObj, newObj interface{}) {
    pod := parsePod(newObj)
    if pod == nil {
	    return
    }
    
    if isPodTerminated(pod) {
	    ctrl.usedPVCs.removePod(pod)
    } else {
	    ctrl.usedPVCs.addPod(pod)
    }
}

Evn:

external-resizer v1.2.0
Longhorn v1.2.2

Update Klog to v2

External-resizer is still using klog v1 , Other sidecars is using klog v2 so this should be sync with rest Sidecar , i think that should be upgraded to klog v2

Ensure resizer can be configured with a timeout value

external-resizer crashing in azuredisk-csi-driver and azurefile-csi-driver

From kubernetes-sigs/azurefile-csi-driver#495

What happened:
After installing azurefile-csi-driver and azuredisk-csi-driver in a Kubernetes cluster, csi-resizer container, inside csi-azurefile-controller and csi-azuredisk-controller pods, is crashing every 1 or 2 minutes with the following message:

csi-resizer log:

...
I1211 12:27:26.339777       1 leaderelection.go:283] successfully renewed lease kube-system/external-resizer-file-csi-azure-com
I1211 12:27:31.349381       1 leaderelection.go:283] successfully renewed lease kube-system/external-resizer-file-csi-azure-com
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

runtime stack:
runtime.throw(0x15d27f3, 0xc)
    /usr/lib/go-1.14/src/runtime/panic.go:1112 +0x72
runtime.mlockGsignal(0xc000682a80)
    /usr/lib/go-1.14/src/runtime/os_linux_x86.go:72 +0x107
runtime.mpreinit(0xc000079180)
    /usr/lib/go-1.14/src/runtime/os_linux.go:341 +0x78
runtime.mcommoninit(0xc000079180)
    /usr/lib/go-1.14/src/runtime/proc.go:630 +0x108
runtime.allocm(0xc00004f800, 0x1672e98, 0x14f676dd7e26c)
    /usr/lib/go-1.14/src/runtime/proc.go:1390 +0x14e
runtime.newm(0x1672e98, 0xc00004f800)
    /usr/lib/go-1.14/src/runtime/proc.go:1704 +0x39
runtime.startm(0x0, 0xc000103201)
    /usr/lib/go-1.14/src/runtime/proc.go:1869 +0x12a
runtime.wakep(...)
    /usr/lib/go-1.14/src/runtime/proc.go:1953
runtime.resetspinning()
    /usr/lib/go-1.14/src/runtime/proc.go:2415 +0x93
runtime.schedule()
    /usr/lib/go-1.14/src/runtime/proc.go:2527 +0x2de
runtime.park_m(0xc000103200)
    /usr/lib/go-1.14/src/runtime/proc.go:2690 +0x9d
runtime.mcall(0x0)
    /usr/lib/go-1.14/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [select, 2 minutes]:
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1(0x13763e0, 0x0, 0xc0000eee40)
...

What you expected to happen:
The container should not fail so frequently.

How to reproduce it:
The failure started right after installing v0.7.0 of azurefile-csi-driver. I upgraded to v0.9.0 (for both, azurefile and azuredisk) with the same results. The Kubernetes cluster is composed of 3 master nodes and 3 workers running on Azure VMs (not AKS).

Anything else we need to know?:
Found a couple issues in golang/go repository that seems to be related:

runtime: mlock of signal stack failed: 12 [1.14 backport] #37807 (golang/go#37807)
runtime: mlock of signal stack failed: 12 (golang/go#37436)

Possibly upgrading golang version from 1.14 to 1.15 will solve the problem.

Environment:

CSI Driver version: v0.7.0 and v0.9.0
Kubernetes version (use kubectl version): v1.19.14
OS (e.g. from /etc/os-release): Ubuntu v20.04.1 LTS
Kernel (e.g. uname -a): 5.4.0-1032-azure #33-Ubuntu SMP Fri Nov 13 14:23:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools: Helm v3.4.2
Others:
- Master node size: Standard D2ds_v4 (2 vcpus, 8 GiB memory)
- Worker node size: Standard D16ds_v4 (16 vcpus, 64 GiB memory)

Complete log file: csi-resizer.log

csi-resizer logs a warning every time a new PVC is created

csi-resizer came to output the following warning message every time a new PVC is created from v1.2.0.

I0816 07:50:11.380497       1 controller.go:291] Started PVC processing "default/sample-volume-jtdql"
W0816 07:50:11.380594       1 controller.go:318] PV "" bound to PVC default/sample-volume-jtdql not found

Until v1.1.0, it had output the following info-level message when a new PVC is created.

I0816 07:24:52.092956       1 controller.go:281] Started PVC processing "default/sample-volume-fffx6"
I0816 07:24:52.092986       1 controller.go:304] No need to resize PVC "default/sample-volume-fffx6"

This is the side effect of this commit.
The case had been handled by this block until v1.1.0, but it is handled by here since v1.2.0.

I think the log level should be changed to info, or the empty check of pvc.Spec.VolumeName should be added before here.

Resizer sidecar doesn't consistently increase the time interval during exponential backoff on errors

csi-resizer v0.5 retries ControllerExpandVolume after random intervals of time instead of an increasing interval of time during exponential backoff on an error.
Refer to the event logs I got when I tried to perform online expansion when the CSI driver supports offline expansion only - https://gist.github.com/shalini-b/e4c8256cd46eb096e0efc3ae322ba644. The time logged for each event is not consistently increasing.

Steps to reproduce -

Use a CSI driver which uses csi-resizer v0.5.0
Perform any operation which raises an error in ControllerExpandVolume call. For example, try executing an online volume expansion in a driver which only supports offline expansion. This will ensure csi-resizer keeps retrying ControllerExpandVolume till it succeeds.
Check the events logged in csi-resizer while it is retrying with exponential backoff.

VolumeExpansion ONLINE/OFFLINE capability is not checked

When a CSI driver reports PluginCapability_VolumeExpansion_OFFLINE only, external-resizer still calls ControllerExpandVolume when the PVC is used by a Pod.

Instead, the extenal-resizer should check VolumeExpansion plugin capability and only invoke ControllerExpandVolume when PluginCapability_VolumeExpansion_ONLINE is supported by the CSI driver.

Use connection code from csi-utils library

xref #7

Perform resizing of migrated PVCs

Add support for NodeExpand

This is done as a part of xref - kubernetes/enhancements#780

Does NodeExpandVolume event only send to one node?

Only one node will receive NodeExpandVolume if you resize a rwx volume which been mount by multi node.

Add resizing support in csi-mock driver

Add general resize support
Add configurable support for node expansion
Add configurable support for controller expansion
Add configuration support for online/offline expansion

Problem with volumes that support node only expansion

It looks like if a CSI volume supports node only expansion and does not have controller EXPAND_VOLUME capability this resize controller will exit.

But problem is - in that case the PV will never be updated and currently node side expansion expeects PV to have correct size before calling node side expansion. Also even if Node side expansion handled updating PV, there would be resistance to updating PV from node.

I am thinking this controller should not exit if CSI volume does not support control-plane volume expansion but should do a noop expansion and update the PV regardless. This is how flexvolume resizing has been implemented.

cc @mlmhl @msau42 @chakri-nelluri

I deployed csi-driver-host-path for k8s 1.14 using the script in the following location.
https://github.com/kubernetes-csi/csi-driver-host-path/tree/master/deploy/kubernetes-1.14
kubectl create -f https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/examples/csi-storageclass.yaml
kubectl apply -f https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/examples/csi-pvc.yaml
https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/examples/csi-app.yaml
Modified csi-pvc.yaml to increase the PVC size to 2Gi.
external-resizer worked properly and expanded online PVC/PV size to 2Gi.
Deleted the pod created using csi-app.yaml. PVC is now offline.
Modified csi-pvc.yaml to change the PVC size to 3Gi.
PV was increased to 3Gi, but PVC remained at 2Gi.

root@k8s_ubuntu1:~/go/src/github.com/kubernetes-csi/csi-driver-host-path/examples# kubectl get pvc
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
csi-pvc   Bound    pvc-67c942df-df41-11e9-a872-000c296bb855   2Gi        RWO            csi-hostpath-sc   15h
root@k8s_ubuntu1:~/go/src/github.com/kubernetes-csi/csi-driver-host-path/examples# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS      REASON   AGE
pvc-67c942df-df41-11e9-a872-000c296bb855   3Gi        RWO            Delete           Bound    default/csi-pvc   csi-hostpath-sc            15h

Not able to pull v1.3.0 release image

#docker pull k8s.gcr.io/sig-storage/csi-resizer:v1.3.0
Error response from daemon: manifest for k8s.gcr.io/sig-storage/csi-resizer:v1.3.0 not found: manifest unknown: Failed to fetch "v1.3.0" from request "/v2/sig-storage/csi-resizer/manifests/v1.3.0".

not sure, its just a sync delay though

cc @gnufied