Giter VIP home page Giter VIP logo

Comments (10)

TANISH-18 avatar TANISH-18 commented on July 17, 2024

@cin can you please take a look?

from operator-for-redis-cluster.

cin avatar cin commented on July 17, 2024

Thanks for the issue. This looks like a bug @TANISH-18. I don't currently have a openstack cluster to test against. Maybe there's a way to reproduce the issue w/out openstack. So you basically shutdown all the worker nodes and then brought them back up to get into this state? Do you see anything in the operator logs related to the issue? Also, what is the actual error on the pods? Thanks in advance!

from operator-for-redis-cluster.

TANISH-18 avatar TANISH-18 commented on July 17, 2024

So you basically shutdown all the worker nodes and then brought them back up to get into this state?

  • yes correct.
    Do you see anything in the operator logs related to the issue?

  • yes one thing i notice is it trying to reconnect with the pods that are stuck in error state.
    I1115 15:42:30.128018 7 connections.go:120] Cannot connect to 192.168.190.20:6379
    I1115 15:42:30.128050 7 connections.go:196] Can't connect to 192.168.190.20:6379: dial tcp 192.168.190.20:6379: i/o timeout
    I1115 15:42:30.131949 7 clusterinfo.go:163] Temporary inconsistency between nodes is possible. If the following inconsistency message persists for more than 20 mins, any cluster operation (scale, rolling update) should be avoided before the message is gone
    I1115 15:42:30.131978 7 clusterinfo.go:164] Inconsistency from 192.168.190.54:6379:

It looks like it is trying to connect with pods in the error state. but it did not try to delete that pods or spawn the new pods. but one thing i observed as soon as i delete the redis pods stuck in error state by using kubectl delete commands new pods are getting spawned and all the pods were alive.

what is the actual error on the pods?

  • redis cluster pod status is in failed state and reason is Terminated.
    Status: Failed
    Reason: Terminated

from operator-for-redis-cluster.

cin avatar cin commented on July 17, 2024

Oh, it sounds like the operator is not recognizing the failed state and is just waiting for the pods to "fix themselves" (which they won't in this case bc the pods aren't managed by any type of replicaset -- they're managed by the operator). I'll see if I can reproduce locally w/kind.

from operator-for-redis-cluster.

TANISH-18 avatar TANISH-18 commented on July 17, 2024

Hey @cin, did you get any chance to reproduce the issue ?

from operator-for-redis-cluster.

cin avatar cin commented on July 17, 2024

Sorry I got pulled into some other things yesterday and didn't get a chance to test. Will make some time today.

from operator-for-redis-cluster.

cin avatar cin commented on July 17, 2024

@TANISH-18, I was able to try this in one of our clusters (there's no good way to "restart" nodes in kind). I just rebooted all 3 nodes in the cluster at once. The pods all restarted and came back up fine. I think the main difference w/what I just did is that the pods were still there and were just restarted. I'm guessing your pods were recreated? Here's how my pods look after the reboot.

op-operator-for-redis-65d5f6fc78-qlgtg       1/1     Running   1 (8m22s ago)   27m   172.30.142.155   10.209.206.175
rc-node-for-redis-metrics-586bbb87c8-dtz42   1/1     Running   1 (8m22s ago)   26m   172.30.142.149   10.209.206.175
rediscluster-rc-node-for-redis-5gp9t         2/2     Running   2 (8m22s ago)   25m   172.30.142.152   10.209.206.175
rediscluster-rc-node-for-redis-cf479         2/2     Running   2 (8m46s ago)   26m   172.30.45.133    10.185.151.142
rediscluster-rc-node-for-redis-r94nt         2/2     Running   2 (8m35s ago)   26m   172.30.121.8     10.38.252.103

Did your operator or redis pods ever restart? Am I correct in thinking you got your cluster back to a healthy state by deleting the errored out pods? The prospect of automating that is a bit scary -- not because it's hard but because it's deleting resources. Since you've effectively lost all cache at that point, you can just reinstall the CR as well (easier than deleting all pods). That's not a great answer if you want to go autopilot on disaster recovery though. In the meantime, I'll see if I can get an openshift cluster approved.

from operator-for-redis-cluster.

TANISH-18 avatar TANISH-18 commented on July 17, 2024

@cin yes my operator pod restarts. deleting all the redis pods will not reproduce this issue. In that case even mine redis cluster pods came back. but the issue is with deleting all the K8 nodes from openstack and then start it after 5-10mins. basically the power outage case.

Anyways I got the fix. actually during reconciling operator is trying to connect with pods in error state. so we need to delete failed pods while polling for failed Redis pods. so once we delete all the pods in failed state. Redis cluster pods will come back.

from operator-for-redis-cluster.

cin avatar cin commented on July 17, 2024

UPDATE: I went a step farther and deleted the only worker pool in my test cluster. The redis node pods completely went away as expected. The operator and metrics pods went into a pending state (also expected). I then recreated a new worker pool and everything came back w/out incident. I'm starting to think this is an openshift thing only. I wonder what's different...

from operator-for-redis-cluster.

cin avatar cin commented on July 17, 2024

@TANISH-18 you may want to try out the latest version of the operator as #84 may have resolved this issue as well.

from operator-for-redis-cluster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.