piraeusdatastore / piraeus-ha-controller Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 7.0 237 KB

High Availability Controller for stateful workloads using storage provisioned by Piraeus

License: Apache License 2.0

Dockerfile 2.42% Makefile 1.20% Go 96.37%

piraeus-ha-controller's People

Contributors

Stargazers

Watchers

Forkers

kvaps wanzenbug sw250391 simonyangchao appsolo rafflescity alexzhc

piraeus-ha-controller's Issues

Add an option to ignore Pods with additional types of storage

nit: Shouldn't we add an option to ignore other types of storage. Eg. if iscsi is used we can't grantee successful fencing for them.

Originally posted by @kvaps in piraeusdatastore/helm-charts#13 (comment)

HA Controller restarting too many times

After the piraeus-operator 1.3.0 release (https://github.com/piraeusdatastore/piraeus-operator/releases/tag/v1.3.0), I enabled the HA Controller in Helm as described. Since then the HA Controller has been restarted 150 times, which is seemingly high. A Nagios monitoring script is alerting after some 10s of pod restarts, so this alert would always fire after some time even if the pod is deleted to reset the restart counter. Please look into it if it is a bug and is fixable.

$ k get po
NAME                                         READY   STATUS    RESTARTS   AGE
piraeus-op-cs-controller-5db495d656-gnkv5    1/1     Running   4          5d
piraeus-op-csi-controller-6ccd9fbc44-cgw49   6/6     Running   0          3d22h
piraeus-op-csi-node-5j2f2                    3/3     Running   0          3d22h
piraeus-op-csi-node-7rbsr                    3/3     Running   0          3d22h
piraeus-op-csi-node-hdqhx                    3/3     Running   0          3d22h
piraeus-op-etcd-0                            1/1     Running   3          5d
piraeus-op-etcd-1                            1/1     Running   3          5d
piraeus-op-etcd-2                            1/1     Running   3          5d
piraeus-op-ha-controller-df776887b-j59ms     1/1     Running   150        3d22h
piraeus-op-ns-node-4whrm                     1/1     Running   1          4d18h
piraeus-op-ns-node-b4zpq                     1/1     Running   1          4d18h
piraeus-op-ns-node-pv429                     1/1     Running   1          4d18h
piraeus-op-operator-7466ddd49c-h776t         1/1     Running   4          5d

Here are the logs of the most recent restarts, I hope it helps:

time="2020-12-30T09:31:28Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1                                                                                                                                                
I1230 09:31:28.799096       1 leaderelection.go:243] attempting to acquire leader lease  piraeus/piraeus-ha-controller...                                                                                                                 
I1230 09:31:28.870384       1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller                                                                                                                            
time="2020-12-30T09:31:28Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms                                                                                                                                   
time="2020-12-30T09:31:28Z" level=info msg="gained leader status"                                                                                                                                                                         
time="2020-12-30T10:03:12Z" level=fatal msg="failed to run HA Controller" error="pvc updates closed unexpectedly"                                                                                                                         

time="2020-12-30T10:03:12Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1                                                                                                                                                
I1230 10:03:12.539253       1 leaderelection.go:243] attempting to acquire leader lease  piraeus/piraeus-ha-controller...                                                                                                                 
I1230 10:03:12.578347       1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller                                                                                                                            
time="2020-12-30T10:03:12Z" level=info msg="gained leader status"                                                                                                                                                                         
time="2020-12-30T10:03:12Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms                                                                                                                                   
time="2020-12-30T10:40:24Z" level=fatal msg="failed to run HA Controller" error="pvc updates closed unexpectedly"                                                                                                                         
 
time="2020-12-30T10:40:25Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1                                                                                                                                                
I1230 10:40:25.837975       1 leaderelection.go:243] attempting to acquire leader lease  piraeus/piraeus-ha-controller...                                                                                                                 
I1230 10:40:25.870270       1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller                                                                                                                            
time="2020-12-30T10:40:25Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms                                                                                                                                   
time="2020-12-30T10:40:25Z" level=info msg="gained leader status"

Related Helm values:

haController:
  enabled: true
  image: quay.io/piraeusdatastore/piraeus-ha-controller:v0.1.1
  affinity: {}
  tolerations: []
  resources:
    limits:
      cpu: "0.2"
      memory: "250Mi"
    requests:
      cpu: "0.1"
      memory: "100Mi"
  replicas: 1

Applying the on-storage-lost label outside of StatefulSets

The readme (https://github.com/piraeusdatastore/piraeus-ha-controller#deploy-your-stateful-workloads) gives an example of StatefulSet usage of the linstor.csi.linbit.com/on-storage-lost: remove label and mentions stateful applications. I was wondering if I could use the label for other kinds of pods, namely Deployment, DeamonSet or Pod. I'm not that proficient from Golang but the source here (https://github.com/piraeusdatastore/piraeus-ha-controller/blob/56f633f9363be272bee40be9bc868c585f8695ae/pkg/hacontroller/controller.go) mentions only pods, so it seems to me I could use the label other than StatefulSets.

As StatefulSet members have their own PVC automatically created, it's natural that replicas don't interfere. Same for DeamonSet members running on each node. What I'm more interested in is the Deployment kind. I'm running most of my apps as Deployment with replica count 1. Would the HA Controller work with these to speed up PV reattachment on another node when the original node is going down? (I could naturally test it by killing a node but I'd rather not 😀 ) Maybe a clarification of stateful apps is advisable in the HA Controller and operator readmes after this issue is answered.

Bonus question: What's the case with Deployments with replica count >1? (This is more of a general Piraeus usage question, not strictly HA Controller-related.) I've observed a small delay when a Deployment had a pod running on one node, then I've deleted that pod, and the scheduler started another pod on another node. In this case, the replica count was more like 2, as one pod was in Terminating state and the new one in ContainerInitializing state. The new pod couldn't flip to Running state until the first one terminated, and even there was a small delay of around 10-15s until the PV was assigned to the new node from the first one. The related PVC was defined with ReadWriteMany. Would Piraeus be able to mount the same PV on two nodes at a time? (I suppose not, as there would be two "primary" mounts and it could interfere with the underlying DRBD replication. I think I've read something like this somewhere but not sure where.) Would it only work with Deployment replica count >1 only if the replicas are all running on the same node?

CrashLoopBackOff: failed to parse drbdsetup json: json: cannot unmarshal number

I0919 15:12:47.660781       1 merged_client_builder.go:121] Using in-cluster configuration
I0919 15:12:47.719448       1 agent.go:92] setting up PersistentVolume informer
I0919 15:12:47.725963       1 agent.go:121] setting up Pod informer
I0919 15:12:47.726131       1 agent.go:140] setting up VolumeAttachment informer
I0919 15:12:47.752953       1 agent.go:179] version: v1.1.0
I0919 15:12:47.753039       1 agent.go:180] node: 4c
I0919 15:12:47.753064       1 agent.go:182] setting up event broadcaster
I0919 15:12:47.755290       1 agent.go:189] setting up periodic reconciliation ticker
I0919 15:12:47.758167       1 drbd.go:39] updating drbd state
I0919 15:12:47.764229       1 agent.go:224] starting reconciliation
I0919 15:12:47.764360       1 drbd.go:60] Checking if DRBD is loaded
I0919 15:12:47.764624       1 drbd.go:70] Command: drbdsetup status --json
I0919 15:12:47.764744       1 agent.go:247] managing node taints failed: own node does not exist
I0919 15:12:47.764783       1 agent.go:250] Own node taints synced
I0919 15:12:47.769233       1 reflector.go:219] Starting reflector *v1.PersistentVolume (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.769394       1 reflector.go:255] Listing and watching *v1.PersistentVolume from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.769413       1 reflector.go:219] Starting reflector *v1.Pod (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.769499       1 reflector.go:255] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.776946       1 reflector.go:219] Starting reflector *v1.VolumeAttachment (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.777261       1 reflector.go:255] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.787303       1 reflector.go:219] Starting reflector *v1.Node (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.787502       1 reflector.go:255] Listing and watching *v1.Node from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.867921       1 agent.go:214] drbd syncer done
I0919 15:12:47.868245       1 reflector.go:225] Stopping reflector *v1.Node (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
I0919 15:12:47.869509       1 reflector.go:225] Stopping reflector *v1.VolumeAttachment (15m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167
E0919 15:12:47.869479       1 run.go:74] "command failed" err="failed to parse drbdsetup json: json: cannot unmarshal number 18446744073709551608 into Go struct field DrbdConnection.connections.ap-in-flight of type int"
Stream closed EOF for piraeus/ha-8682e2af-piraeus-ha-controller-fxp2z (piraeus-ha-controller)

Relevant portion of drbdsetup status --json:

    {
      "peer-node-id": 0,
      "name": "4b",
      "connection-state": "Connecting",
      "congested": false,
      "peer-role": "Unknown",
      "ap-in-flight": 18446744073709551608,
      "rs-in-flight": 0,
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Off",
          "peer-disk-state": "DUnknown",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 0,
          "sent": 0,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    } ]

Taints are not removed from nodes

Version - HA controller 1.1.0

Many times we have seen that taints are not removed from nodes so, pods are not scheduled. Moreover, the taints come back on nodes as soon as you remove them manually.
Taints mostly occur when nodes are in stage of rebooting for example, during node upgrade and reboot.
Additionally, both replicas of 2 resources also went into Outdated state.

For instance,

[~]# kubectl describe node | grep -i taint
Taints:             drbd.linbit.com/lost-quorum:NoSchedule
Taints:             drbd.linbit.com/force-io-error:NoSchedule
Taints:             drbd.linbit.com/lost-quorum:NoSchedule

| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode1.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Outdated | 2022-09-12 14:20:06 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode2.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Outdated | 2022-09-12 14:20:02 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode3.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Diskless | 2022-09-12 14:20:04 |

Settings defined in storage class as per HA-controller requirements -

  DrbdOptions/auto-quorum: suspend-io
  DrbdOptions/Resource/on-no-data-accessible: suspend-io
  DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
  DrbdOptions/Net/rr-conflict: retry-connect

PV not associated to a PVC, nothing to do

Hi, we faced with the problem when piraeus-ha-controller does not reconciling failing volume attachments due to "PV not associated to a PVC, nothing to do"

time="2021-06-04T15:09:03Z" level=trace msg=update name=nextcloud-nfs-server-provisioner-0 namespace=nextcloud-nfs-server-provisioner-0 resource=Pod
time="2021-06-04T15:09:05Z" level=trace msg="start reconciling failing volume attachments"
time="2021-06-04T15:09:05Z" level=trace msg="finished reconciling failing volume attachments"
time="2021-06-04T15:09:09Z" level=trace msg="Pod watch resource version updated" resource-version=78535013
time="2021-06-04T15:09:13Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://linstor-controller:3371/v1/resource-definitions/pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca/resources'"
time="2021-06-04T15:09:13Z" level=trace msg="lost pv" lostPV=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=trace msg="start reconciling failing volume attachments"
time="2021-06-04T15:09:13Z" level=info msg="processing failing pv" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=debug msg="PV not associated to a PVC, nothing to do" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=trace msg="finished reconciling failing volume attachments"

# kubectl get pvc -n nfs data-nextcloud-nfs-server-provisioner-0 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: linstor.csi.linbit.com
  creationTimestamp: "2021-04-08T12:32:41Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: nfs-server-provisioner
    release: nextcloud
  name: data-nextcloud-nfs-server-provisioner-0
  namespace: nfs
  resourceVersion: "27438361"
  uid: efb31302-5feb-4dbe-93f5-8994eb08c6ca
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: linstor-1
  volumeMode: Filesystem
  volumeName: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  phase: Bound

# kubectl get pv -o yaml pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: linstor.csi.linbit.com
  creationTimestamp: "2021-04-08T12:32:43Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-attacher/linstor-csi-linbit-com
  name: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
  resourceVersion: "27438405"
  uid: 5064a61a-88a1-47a7-a0bd-80669bf857f8
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: data-nextcloud-nfs-server-provisioner-0
    namespace: nfs
    resourceVersion: "27438312"
    uid: efb31302-5feb-4dbe-93f5-8994eb08c6ca
  csi:
    driver: linstor.csi.linbit.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1617814042512-8081-linstor.csi.linbit.com
    volumeHandle: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
  mountOptions:
  - errors=remount-ro
  persistentVolumeReclaimPolicy: Delete
  storageClassName: linstor-1
  volumeMode: Filesystem
status:
  phase: Bound

# kubectl get volumeattachments.storage.k8s.io -o yaml
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: m1c29
  creationTimestamp: "2021-06-04T15:01:46Z"
  finalizers:
  - external-attacher/linstor-csi-linbit-com
  name: csi-0c36bf3aa3e14cee55d5e4f944e16a3e408d87aaad9ab86cc0255fdb08f40206
  resourceVersion: "78528835"
  uid: 94b6f060-5b73-49ed-a948-584f7c25e137
spec:
  attacher: linstor.csi.linbit.com
  nodeName: m1c29
  source:
    persistentVolumeName: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
status:
  attached: true

any ideas?

Towards HA Controller v1.0.0

The purpose of this issue is to track progress on making the HA Controller production ready (i.e. always enabled by default).

Current state

As of right now (v0.3.0), the HA Controller works. But only when specific conditions are met. Current issues include:

When the LINSTOR API is temporarily unavailable, the HA Controller gets stuck to, and takes >1min to resume operations when the API starts working again. This might happen because the node that hosts the LINSTOR controller crashed. For that time, we do not receive any DRBD state changes, so the HA Controller is completely blind in that regard. Also, once it resumes watching the event stream it will again wait for the initial "settle period", which takes > 45s.
There seems to be a way to mess up the internal state (see #7), which causes the HA Controller not to trigger a fail over when it should.
The HA Controller only tries triggering fail overs if it is the leader, "followers" just assume the leader will execute the right steps. This means that fail over events are lost if the leader goes down at the wrong moment. The followers shouldn't just assume the leader does the right thing. Instead they should keep track of missed fail overs and execute them once they get to lead themselves.

Overall, these issues meant that I'd hesitate to automatically enable the HA controller for all pods (see #12).

Planned work

My current plan for the HA Controller is a complete rewrite based on the things learned from the current version:

No relying on the LINSTOR API to be always available. That effectively means we need to grab the DRBD events directly on the node. This is something that https://github.com/LINBIT/drbd-reactor already does, and my current plan is to leverage at least the event parsing part (integrating with the whole reactor project might be too much).
The actual failover is triggered by a second "cluster" component, which relies on the information exported by the node agents to know when there is a mismatch between K8s state and DRBD state.

One of the main issues with this is of course how we can communicate a "may_promote" event to the the second component. My idea here would be an annotation on the PV. The annotation is added whenever any node sees may_promote: yes. It should be removed by the node that promotes.

The "cluster" component would watch for this annotation on PVs, and checks if failover needs to be triggered. The failover should happen immediately if a k8s node has the volumeInUse in it's status. The failover process uses the same steps as the current implementation (delete VolumeAttachment, delete Pod). However, it might not be necessary to force-delete the VA to trigger graceful detach from the CSI driver side (needs to be investigated). There should probably also be a (small) grace period for the pod to shut down in case the node in general is still online.

There is also the question on if the node agent should help things along if it detects a disconnect on the local node (forcing IO errors, unmounting, etc..) this has to be investigated further.

Watch pods without label

This is a proposal to increase the monitoring coverage of the HA controller, which currently only observes pods (StatefulSets) if they have been deployed with the label linstor.csi.linbit.com/on-storage-lost: remove (or any other configured label). If you use Linstor in a default StorageClass all kinds of workloads may get deployed which will most likely not have the label for the HA controller to watch their storage's health state. In the event of a node outage the HA controller will only redeploy those workloads which have set the configured label, all others will remain in an unhealthy/unknown state.

IMHO it would be more convenient if the HA controller monitors all workloads/Statefulsets with DRBD volumes attached to it without any (manually) added label. If the discovery implementation STORK uses seems to be inappropriate, an admission controller (https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) for create and update events, which adds the label on every workload automatically, could be an alternative.
OFC this automatic discovery should be configurable to give administrators the freedom to decide which approach fits their needs best. But in order to replace STORK with the HA controller (which is definitely the faster and most accurate option) an auto-discovery for relevant workloads is a "must-have" feature.

Pods in terminating loop on creation

We created a bunch of new pvcs at the same time (when setting the debug setting for piraeusdatastore/linstor-csi#172) and now the ha-controller is continuously evicting them due to lost quorum.

It seems that none of the nodes has been assigned to the volume on the linstor side as the primary:

┊ pvc-b5a0c859-161b-498d-ac4f-00de2311a912 ┊ dedi1-node1.23-106-60-155.lon-01.uk ┊ 7008 ┊ Unused ┊       ┊    Unknown ┊ 2022-08-31 14:39:52 ┊
┊ pvc-b5a0c859-161b-498d-ac4f-00de2311a912 ┊ vm6-cplane1.23-106-61-231.lon-01.uk ┊ 7008 ┊ Unused ┊       ┊    Unknown ┊                     ┊
┊ pvc-b5a0c859-161b-498d-ac4f-00de2311a912 ┊ vm9-node2.23-106-61-193.lon-01.uk   ┊ 7008 ┊ Unused ┊       ┊    Unknown ┊ 2022-08-31 14:41:02 ┊

This is causing the ha-controller to continuously delete the pods when they are re-created:

Events:
  Type     Reason               Age   From                                           Message
  ----     ------               ----  ----                                           -------
  Warning  VolumeWithoutQuorum  70s   linstor.linbit.com/HighAvailabilityController  Pod was evicted because attached volume lost quorum
  Warning  VolumeWithoutQuorum  60s   linstor.linbit.com/HighAvailabilityController  Pod was evicted because attached volume lost quorum
  Warning  FailedScheduling     79s   default-scheduler                              0/4 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {drbd.linbit.com/lost-quorum: }, 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
  Normal   Scheduled            74s   default-scheduler                              Successfully assigned team-100/supabase-data-nfs-server-provisioner-0 to dedi1-node1.23-106-60-155.lon-01.uk
  Warning  FailedAttachVolume   70s   attachdetach-controller                        AttachVolume.Attach failed for volume "pvc-b5a0c859-161b-498d-ac4f-00de2311a912" : volume attachment is being deleted

Could it be the same timeout affecting us with piraeusdatastore/linstor-csi#172 is causing none of the nodes to become primary?

piraeusdatastore / piraeus-ha-controller Goto Github PK

piraeus-ha-controller's People

Contributors

Stargazers

Watchers

Forkers

piraeus-ha-controller's Issues

Add an option to ignore Pods with additional types of storage

HA Controller restarting too many times

Applying the on-storage-lost label outside of StatefulSets

CrashLoopBackOff: failed to parse drbdsetup json: json: cannot unmarshal number

Taints are not removed from nodes

PV not associated to a PVC, nothing to do

Towards HA Controller v1.0.0

Current state

Planned work

Watch pods without label

Pods in terminating loop on creation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent