piraeusdatastore / piraeus-ha-controller Goto Github PK
View Code? Open in Web Editor NEWHigh Availability Controller for stateful workloads using storage provisioned by Piraeus
License: Apache License 2.0
High Availability Controller for stateful workloads using storage provisioned by Piraeus
License: Apache License 2.0
nit: Shouldn't we add an option to ignore other types of storage. Eg. if iscsi is used we can't grantee successful fencing for them.
Originally posted by @kvaps in piraeusdatastore/helm-charts#13 (comment)
After the piraeus-operator 1.3.0 release (https://github.com/piraeusdatastore/piraeus-operator/releases/tag/v1.3.0), I enabled the HA Controller in Helm as described. Since then the HA Controller has been restarted 150 times, which is seemingly high. A Nagios monitoring script is alerting after some 10s of pod restarts, so this alert would always fire after some time even if the pod is deleted to reset the restart counter. Please look into it if it is a bug and is fixable.
$ k get po
NAME READY STATUS RESTARTS AGE
piraeus-op-cs-controller-5db495d656-gnkv5 1/1 Running 4 5d
piraeus-op-csi-controller-6ccd9fbc44-cgw49 6/6 Running 0 3d22h
piraeus-op-csi-node-5j2f2 3/3 Running 0 3d22h
piraeus-op-csi-node-7rbsr 3/3 Running 0 3d22h
piraeus-op-csi-node-hdqhx 3/3 Running 0 3d22h
piraeus-op-etcd-0 1/1 Running 3 5d
piraeus-op-etcd-1 1/1 Running 3 5d
piraeus-op-etcd-2 1/1 Running 3 5d
piraeus-op-ha-controller-df776887b-j59ms 1/1 Running 150 3d22h
piraeus-op-ns-node-4whrm 1/1 Running 1 4d18h
piraeus-op-ns-node-b4zpq 1/1 Running 1 4d18h
piraeus-op-ns-node-pv429 1/1 Running 1 4d18h
piraeus-op-operator-7466ddd49c-h776t 1/1 Running 4 5d
Here are the logs of the most recent restarts, I hope it helps:
time="2020-12-30T09:31:28Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1
I1230 09:31:28.799096 1 leaderelection.go:243] attempting to acquire leader lease piraeus/piraeus-ha-controller...
I1230 09:31:28.870384 1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller
time="2020-12-30T09:31:28Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms
time="2020-12-30T09:31:28Z" level=info msg="gained leader status"
time="2020-12-30T10:03:12Z" level=fatal msg="failed to run HA Controller" error="pvc updates closed unexpectedly"
time="2020-12-30T10:03:12Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1
I1230 10:03:12.539253 1 leaderelection.go:243] attempting to acquire leader lease piraeus/piraeus-ha-controller...
I1230 10:03:12.578347 1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller
time="2020-12-30T10:03:12Z" level=info msg="gained leader status"
time="2020-12-30T10:03:12Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms
time="2020-12-30T10:40:24Z" level=fatal msg="failed to run HA Controller" error="pvc updates closed unexpectedly"
time="2020-12-30T10:40:25Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1
I1230 10:40:25.837975 1 leaderelection.go:243] attempting to acquire leader lease piraeus/piraeus-ha-controller...
I1230 10:40:25.870270 1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller
time="2020-12-30T10:40:25Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms
time="2020-12-30T10:40:25Z" level=info msg="gained leader status"
Related Helm values:
haController:
enabled: true
image: quay.io/piraeusdatastore/piraeus-ha-controller:v0.1.1
affinity: {}
tolerations: []
resources:
limits:
cpu: "0.2"
memory: "250Mi"
requests:
cpu: "0.1"
memory: "100Mi"
replicas: 1
The readme (https://github.com/piraeusdatastore/piraeus-ha-controller#deploy-your-stateful-workloads) gives an example of StatefulSet
usage of the linstor.csi.linbit.com/on-storage-lost: remove
label and mentions stateful applications. I was wondering if I could use the label for other kinds of pods, namely Deployment
, DeamonSet
or Pod
. I'm not that proficient from Golang but the source here (https://github.com/piraeusdatastore/piraeus-ha-controller/blob/56f633f9363be272bee40be9bc868c585f8695ae/pkg/hacontroller/controller.go) mentions only pods, so it seems to me I could use the label other than StatefulSets
.
As StatefulSet
members have their own PVC automatically created, it's natural that replicas don't interfere. Same for DeamonSet
members running on each node. What I'm more interested in is the Deployment
kind. I'm running most of my apps as Deployment
with replica count 1. Would the HA Controller work with these to speed up PV reattachment on another node when the original node is going down? (I could naturally test it by killing a node but I'd rather not ๐ ) Maybe a clarification of stateful apps is advisable in the HA Controller and operator readmes after this issue is answered.
Bonus question: What's the case with Deployment
s with replica count >1? (This is more of a general Piraeus usage question, not strictly HA Controller-related.) I've observed a small delay when a Deployment
had a pod running on one node, then I've deleted that pod, and the scheduler started another pod on another node. In this case, the replica count was more like 2, as one pod was in Terminating
state and the new one in ContainerInitializing
state. The new pod couldn't flip to Running
state until the first one terminated, and even there was a small delay of around 10-15s until the PV was assigned to the new node from the first one. The related PVC was defined with ReadWriteMany
. Would Piraeus be able to mount the same PV on two nodes at a time? (I suppose not, as there would be two "primary" mounts and it could interfere with the underlying DRBD replication. I think I've read something like this somewhere but not sure where.) Would it only work with Deployment
replica count >1 only if the replicas are all running on the same node?
I0919 15:12:47.660781 1 merged_client_builder.go:121] Using in-cluster configuration
I0919 15:12:47.719448 1 agent.go:92] setting up PersistentVolume informer
I0919 15:12:47.725963 1 agent.go:121] setting up Pod informer
I0919 15:12:47.726131 1 agent.go:140] setting up VolumeAttachment informer
I0919 15:12:47.752953 1 agent.go:179] version: v1.1.0
I0919 15:12:47.753039 1 agent.go:180] node: 4c
I0919 15:12:47.753064 1 agent.go:182] setting up event broadcaster
I0919 15:12:47.755290 1 agent.go:189] setting up periodic reconciliation ticker
I0919 15:12:47.758167 1 drbd.go:39] updating drbd state
I0919 15:12:47.764229 1 agent.go:224] starting reconciliation
I0919 15:12:47.764360 1 drbd.go:60] Checking if DRBD is loaded
I0919 15:12:47.764624 1 drbd.go:70] Command: drbdsetup status --json
I0919 15:12:47.764744 1 agent.go:247] managing node taints failed: own node does not exist
I0919 15:12:47.764783 1 agent.go:250] Own node taints synced
I0919 15:12:47.769233 1 reflector.go:219] Starting reflector *v1.PersistentVolume (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.769394 1 reflector.go:255] Listing and watching *v1.PersistentVolume from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.769413 1 reflector.go:219] Starting reflector *v1.Pod (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.769499 1 reflector.go:255] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.776946 1 reflector.go:219] Starting reflector *v1.VolumeAttachment (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.777261 1 reflector.go:255] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.787303 1 reflector.go:219] Starting reflector *v1.Node (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.787502 1 reflector.go:255] Listing and watching *v1.Node from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.867921 1 agent.go:214] drbd syncer done
I0919 15:12:47.868245 1 reflector.go:225] Stopping reflector *v1.Node (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.869509 1 reflector.go:225] Stopping reflector *v1.VolumeAttachment (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
E0919 15:12:47.869479 1 run.go:74] "command failed" err="failed to parse drbdsetup json: json: cannot unmarshal number 18446744073709551608 into Go struct field DrbdConnection.connections.ap-in-flight of type int"
Stream closed EOF for piraeus/ha-8682e2af-piraeus-ha-controller-fxp2z (piraeus-ha-controller)
Relevant portion of drbdsetup status --json
:
{
"peer-node-id": 0,
"name": "4b",
"connection-state": "Connecting",
"congested": false,
"peer-role": "Unknown",
"ap-in-flight": 18446744073709551608,
"rs-in-flight": 0,
"peer_devices": [
{
"volume": 0,
"replication-state": "Off",
"peer-disk-state": "DUnknown",
"peer-client": false,
"resync-suspended": "no",
"received": 0,
"sent": 0,
"out-of-sync": 0,
"pending": 0,
"unacked": 0,
"has-sync-details": false,
"has-online-verify-details": false,
"percent-in-sync": 100.00
} ]
} ]
Version - HA controller 1.1.0
Many times we have seen that taints are not removed from nodes so, pods are not scheduled. Moreover, the taints come back on nodes as soon as you remove them manually.
Taints mostly occur when nodes are in stage of rebooting for example, during node upgrade and reboot.
Additionally, both replicas of 2 resources also went into Outdated state.
For instance,
[~]# kubectl describe node | grep -i taint
Taints: drbd.linbit.com/lost-quorum:NoSchedule
Taints: drbd.linbit.com/force-io-error:NoSchedule
Taints: drbd.linbit.com/lost-quorum:NoSchedule
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode1.galwayan.com | 7000 | Unused | Ok | Outdated | 2022-09-12 14:20:06 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode2.galwayan.com | 7000 | Unused | Ok | Outdated | 2022-09-12 14:20:02 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode3.galwayan.com | 7000 | Unused | Ok | Diskless | 2022-09-12 14:20:04 |
Settings defined in storage class as per HA-controller requirements -
DrbdOptions/auto-quorum: suspend-io
DrbdOptions/Resource/on-no-data-accessible: suspend-io
DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
DrbdOptions/Net/rr-conflict: retry-connect
Hi, we faced with the problem when piraeus-ha-controller does not reconciling failing volume attachments due to "PV not associated to a PVC, nothing to do"
time="2021-06-04T15:09:03Z" level=trace msg=update name=nextcloud-nfs-server-provisioner-0 namespace=nextcloud-nfs-server-provisioner-0 resource=Pod
time="2021-06-04T15:09:05Z" level=trace msg="start reconciling failing volume attachments"
time="2021-06-04T15:09:05Z" level=trace msg="finished reconciling failing volume attachments"
time="2021-06-04T15:09:09Z" level=trace msg="Pod watch resource version updated" resource-version=78535013
time="2021-06-04T15:09:13Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://linstor-controller:3371/v1/resource-definitions/pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca/resources'"
time="2021-06-04T15:09:13Z" level=trace msg="lost pv" lostPV=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=trace msg="start reconciling failing volume attachments"
time="2021-06-04T15:09:13Z" level=info msg="processing failing pv" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=debug msg="PV not associated to a PVC, nothing to do" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=trace msg="finished reconciling failing volume attachments"
# kubectl get pvc -n nfs data-nextcloud-nfs-server-provisioner-0 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: linstor.csi.linbit.com
creationTimestamp: "2021-04-08T12:32:41Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app: nfs-server-provisioner
release: nextcloud
name: data-nextcloud-nfs-server-provisioner-0
namespace: nfs
resourceVersion: "27438361"
uid: efb31302-5feb-4dbe-93f5-8994eb08c6ca
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: linstor-1
volumeMode: Filesystem
volumeName: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 50Gi
phase: Bound
# kubectl get pv -o yaml pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: linstor.csi.linbit.com
creationTimestamp: "2021-04-08T12:32:43Z"
finalizers:
- kubernetes.io/pv-protection
- external-attacher/linstor-csi-linbit-com
name: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
resourceVersion: "27438405"
uid: 5064a61a-88a1-47a7-a0bd-80669bf857f8
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 50Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: data-nextcloud-nfs-server-provisioner-0
namespace: nfs
resourceVersion: "27438312"
uid: efb31302-5feb-4dbe-93f5-8994eb08c6ca
csi:
driver: linstor.csi.linbit.com
fsType: ext4
volumeAttributes:
storage.kubernetes.io/csiProvisionerIdentity: 1617814042512-8081-linstor.csi.linbit.com
volumeHandle: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
mountOptions:
- errors=remount-ro
persistentVolumeReclaimPolicy: Delete
storageClassName: linstor-1
volumeMode: Filesystem
status:
phase: Bound
# kubectl get volumeattachments.storage.k8s.io -o yaml
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
annotations:
csi.alpha.kubernetes.io/node-id: m1c29
creationTimestamp: "2021-06-04T15:01:46Z"
finalizers:
- external-attacher/linstor-csi-linbit-com
name: csi-0c36bf3aa3e14cee55d5e4f944e16a3e408d87aaad9ab86cc0255fdb08f40206
resourceVersion: "78528835"
uid: 94b6f060-5b73-49ed-a948-584f7c25e137
spec:
attacher: linstor.csi.linbit.com
nodeName: m1c29
source:
persistentVolumeName: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
status:
attached: true
any ideas?
The purpose of this issue is to track progress on making the HA Controller production ready (i.e. always enabled by default).
As of right now (v0.3.0), the HA Controller works. But only when specific conditions are met. Current issues include:
Overall, these issues meant that I'd hesitate to automatically enable the HA controller for all pods (see #12).
My current plan for the HA Controller is a complete rewrite based on the things learned from the current version:
One of the main issues with this is of course how we can communicate a "may_promote" event to the the second component. My idea here would be an annotation on the PV. The annotation is added whenever any node sees may_promote: yes
. It should be removed by the node that promotes.
The "cluster" component would watch for this annotation on PVs, and checks if failover needs to be triggered. The failover should happen immediately if a k8s node
has the volumeInUse
in it's status. The failover process uses the same steps as the current implementation (delete VolumeAttachment, delete Pod). However, it might not be necessary to force-delete the VA to trigger graceful detach from the CSI driver side (needs to be investigated). There should probably also be a (small) grace period for the pod to shut down in case the node in general is still online.
There is also the question on if the node agent should help things along if it detects a disconnect on the local node (forcing IO errors, unmounting, etc..) this has to be investigated further.
This is a proposal to increase the monitoring coverage of the HA controller, which currently only observes pods (StatefulSets) if they have been deployed with the label linstor.csi.linbit.com/on-storage-lost: remove
(or any other configured label). If you use Linstor in a default StorageClass all kinds of workloads may get deployed which will most likely not have the label for the HA controller to watch their storage's health state. In the event of a node outage the HA controller will only redeploy those workloads which have set the configured label, all others will remain in an unhealthy/unknown state.
IMHO it would be more convenient if the HA controller monitors all workloads/Statefulsets with DRBD volumes attached to it without any (manually) added label. If the discovery implementation STORK uses seems to be inappropriate, an admission controller (https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) for create and update events, which adds the label on every workload automatically, could be an alternative.
OFC this automatic discovery should be configurable to give administrators the freedom to decide which approach fits their needs best. But in order to replace STORK with the HA controller (which is definitely the faster and most accurate option) an auto-discovery for relevant workloads is a "must-have" feature.
We created a bunch of new pvcs at the same time (when setting the debug setting for piraeusdatastore/linstor-csi#172) and now the ha-controller is continuously evicting them due to lost quorum.
It seems that none of the nodes has been assigned to the volume on the linstor side as the primary:
โ pvc-b5a0c859-161b-498d-ac4f-00de2311a912 โ dedi1-node1.23-106-60-155.lon-01.uk โ 7008 โ Unused โ โ Unknown โ 2022-08-31 14:39:52 โ
โ pvc-b5a0c859-161b-498d-ac4f-00de2311a912 โ vm6-cplane1.23-106-61-231.lon-01.uk โ 7008 โ Unused โ โ Unknown โ โ
โ pvc-b5a0c859-161b-498d-ac4f-00de2311a912 โ vm9-node2.23-106-61-193.lon-01.uk โ 7008 โ Unused โ โ Unknown โ 2022-08-31 14:41:02 โ
This is causing the ha-controller to continuously delete the pods when they are re-created:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning VolumeWithoutQuorum 70s linstor.linbit.com/HighAvailabilityController Pod was evicted because attached volume lost quorum
Warning VolumeWithoutQuorum 60s linstor.linbit.com/HighAvailabilityController Pod was evicted because attached volume lost quorum
Warning FailedScheduling 79s default-scheduler 0/4 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {drbd.linbit.com/lost-quorum: }, 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
Normal Scheduled 74s default-scheduler Successfully assigned team-100/supabase-data-nfs-server-provisioner-0 to dedi1-node1.23-106-60-155.lon-01.uk
Warning FailedAttachVolume 70s attachdetach-controller AttachVolume.Attach failed for volume "pvc-b5a0c859-161b-498d-ac4f-00de2311a912" : volume attachment is being deleted
Could it be the same timeout affecting us with piraeusdatastore/linstor-csi#172 is causing none of the nodes to become primary?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.