Steps to reproduce the issue: Redis-operator and redis-cluster up

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for the issue. This looks like a bug <a class="user-mention notranslate" data-h

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State about operator-for-redis-cluster HOT 10 OPEN

ibm commented on July 17, 2024

Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State

from operator-for-redis-cluster.

Comments (10)

TANISH-18 commented on July 17, 2024

@cin can you please take a look?

from operator-for-redis-cluster.

cin commented on July 17, 2024

Thanks for the issue. This looks like a bug @TANISH-18. I don't currently have a openstack cluster to test against. Maybe there's a way to reproduce the issue w/out openstack. So you basically shutdown all the worker nodes and then brought them back up to get into this state? Do you see anything in the operator logs related to the issue? Also, what is the actual error on the pods? Thanks in advance!

from operator-for-redis-cluster.

TANISH-18 commented on July 17, 2024

So you basically shutdown all the worker nodes and then brought them back up to get into this state?

yes correct.
Do you see anything in the operator logs related to the issue?
yes one thing i notice is it trying to reconnect with the pods that are stuck in error state.
I1115 15:42:30.128018 7 connections.go:120] Cannot connect to 192.168.190.20:6379
I1115 15:42:30.128050 7 connections.go:196] Can't connect to 192.168.190.20:6379: dial tcp 192.168.190.20:6379: i/o timeout
I1115 15:42:30.131949 7 clusterinfo.go:163] Temporary inconsistency between nodes is possible. If the following inconsistency message persists for more than 20 mins, any cluster operation (scale, rolling update) should be avoided before the message is gone
I1115 15:42:30.131978 7 clusterinfo.go:164] Inconsistency from 192.168.190.54:6379:

It looks like it is trying to connect with pods in the error state. but it did not try to delete that pods or spawn the new pods. but one thing i observed as soon as i delete the redis pods stuck in error state by using kubectl delete commands new pods are getting spawned and all the pods were alive.

what is the actual error on the pods?

redis cluster pod status is in failed state and reason is Terminated.
Status: Failed
Reason: Terminated

from operator-for-redis-cluster.

cin commented on July 17, 2024

Oh, it sounds like the operator is not recognizing the failed state and is just waiting for the pods to "fix themselves" (which they won't in this case bc the pods aren't managed by any type of replicaset -- they're managed by the operator). I'll see if I can reproduce locally w/kind.

from operator-for-redis-cluster.

TANISH-18 commented on July 17, 2024

Hey @cin, did you get any chance to reproduce the issue ?

from operator-for-redis-cluster.

cin commented on July 17, 2024

Sorry I got pulled into some other things yesterday and didn't get a chance to test. Will make some time today.

from operator-for-redis-cluster.

cin commented on July 17, 2024

@TANISH-18, I was able to try this in one of our clusters (there's no good way to "restart" nodes in kind). I just rebooted all 3 nodes in the cluster at once. The pods all restarted and came back up fine. I think the main difference w/what I just did is that the pods were still there and were just restarted. I'm guessing your pods were recreated? Here's how my pods look after the reboot.

op-operator-for-redis-65d5f6fc78-qlgtg       1/1     Running   1 (8m22s ago)   27m   172.30.142.155   10.209.206.175
rc-node-for-redis-metrics-586bbb87c8-dtz42   1/1     Running   1 (8m22s ago)   26m   172.30.142.149   10.209.206.175
rediscluster-rc-node-for-redis-5gp9t         2/2     Running   2 (8m22s ago)   25m   172.30.142.152   10.209.206.175
rediscluster-rc-node-for-redis-cf479         2/2     Running   2 (8m46s ago)   26m   172.30.45.133    10.185.151.142
rediscluster-rc-node-for-redis-r94nt         2/2     Running   2 (8m35s ago)   26m   172.30.121.8     10.38.252.103

Did your operator or redis pods ever restart? Am I correct in thinking you got your cluster back to a healthy state by deleting the errored out pods? The prospect of automating that is a bit scary -- not because it's hard but because it's deleting resources. Since you've effectively lost all cache at that point, you can just reinstall the CR as well (easier than deleting all pods). That's not a great answer if you want to go autopilot on disaster recovery though. In the meantime, I'll see if I can get an openshift cluster approved.

from operator-for-redis-cluster.

TANISH-18 commented on July 17, 2024

@cin yes my operator pod restarts. deleting all the redis pods will not reproduce this issue. In that case even mine redis cluster pods came back. but the issue is with deleting all the K8 nodes from openstack and then start it after 5-10mins. basically the power outage case.

Anyways I got the fix. actually during reconciling operator is trying to connect with pods in error state. so we need to delete failed pods while polling for failed Redis pods. so once we delete all the pods in failed state. Redis cluster pods will come back.

from operator-for-redis-cluster.

cin commented on July 17, 2024

UPDATE: I went a step farther and deleted the only worker pool in my test cluster. The redis node pods completely went away as expected. The operator and metrics pods went into a pending state (also expected). I then recreated a new worker pool and everything came back w/out incident. I'm starting to think this is an openshift thing only. I wonder what's different...

from operator-for-redis-cluster.

cin commented on July 17, 2024

@TANISH-18 you may want to try out the latest version of the operator as #84 may have resolved this issue as well.

from operator-for-redis-cluster.

Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State about operator-for-redis-cluster HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent