Comments (10)
@cin can you please take a look?
from operator-for-redis-cluster.
Thanks for the issue. This looks like a bug @TANISH-18. I don't currently have a openstack cluster to test against. Maybe there's a way to reproduce the issue w/out openstack. So you basically shutdown all the worker nodes and then brought them back up to get into this state? Do you see anything in the operator logs related to the issue? Also, what is the actual error on the pods? Thanks in advance!
from operator-for-redis-cluster.
So you basically shutdown all the worker nodes and then brought them back up to get into this state?
-
yes correct.
Do you see anything in the operator logs related to the issue? -
yes one thing i notice is it trying to reconnect with the pods that are stuck in error state.
I1115 15:42:30.128018 7 connections.go:120] Cannot connect to 192.168.190.20:6379
I1115 15:42:30.128050 7 connections.go:196] Can't connect to 192.168.190.20:6379: dial tcp 192.168.190.20:6379: i/o timeout
I1115 15:42:30.131949 7 clusterinfo.go:163] Temporary inconsistency between nodes is possible. If the following inconsistency message persists for more than 20 mins, any cluster operation (scale, rolling update) should be avoided before the message is gone
I1115 15:42:30.131978 7 clusterinfo.go:164] Inconsistency from 192.168.190.54:6379:
It looks like it is trying to connect with pods in the error state. but it did not try to delete that pods or spawn the new pods. but one thing i observed as soon as i delete the redis pods stuck in error state by using kubectl delete commands new pods are getting spawned and all the pods were alive.
what is the actual error on the pods?
- redis cluster pod status is in failed state and reason is Terminated.
Status: Failed
Reason: Terminated
from operator-for-redis-cluster.
Oh, it sounds like the operator is not recognizing the failed state and is just waiting for the pods to "fix themselves" (which they won't in this case bc the pods aren't managed by any type of replicaset -- they're managed by the operator). I'll see if I can reproduce locally w/kind.
from operator-for-redis-cluster.
Hey @cin, did you get any chance to reproduce the issue ?
from operator-for-redis-cluster.
Sorry I got pulled into some other things yesterday and didn't get a chance to test. Will make some time today.
from operator-for-redis-cluster.
@TANISH-18, I was able to try this in one of our clusters (there's no good way to "restart" nodes in kind). I just rebooted all 3 nodes in the cluster at once. The pods all restarted and came back up fine. I think the main difference w/what I just did is that the pods were still there and were just restarted. I'm guessing your pods were recreated? Here's how my pods look after the reboot.
op-operator-for-redis-65d5f6fc78-qlgtg 1/1 Running 1 (8m22s ago) 27m 172.30.142.155 10.209.206.175
rc-node-for-redis-metrics-586bbb87c8-dtz42 1/1 Running 1 (8m22s ago) 26m 172.30.142.149 10.209.206.175
rediscluster-rc-node-for-redis-5gp9t 2/2 Running 2 (8m22s ago) 25m 172.30.142.152 10.209.206.175
rediscluster-rc-node-for-redis-cf479 2/2 Running 2 (8m46s ago) 26m 172.30.45.133 10.185.151.142
rediscluster-rc-node-for-redis-r94nt 2/2 Running 2 (8m35s ago) 26m 172.30.121.8 10.38.252.103
Did your operator or redis pods ever restart? Am I correct in thinking you got your cluster back to a healthy state by deleting the errored out pods? The prospect of automating that is a bit scary -- not because it's hard but because it's deleting resources. Since you've effectively lost all cache at that point, you can just reinstall the CR as well (easier than deleting all pods). That's not a great answer if you want to go autopilot on disaster recovery though. In the meantime, I'll see if I can get an openshift cluster approved.
from operator-for-redis-cluster.
@cin yes my operator pod restarts. deleting all the redis pods will not reproduce this issue. In that case even mine redis cluster pods came back. but the issue is with deleting all the K8 nodes from openstack and then start it after 5-10mins. basically the power outage case.
Anyways I got the fix. actually during reconciling operator is trying to connect with pods in error state. so we need to delete failed pods while polling for failed Redis pods. so once we delete all the pods in failed state. Redis cluster pods will come back.
from operator-for-redis-cluster.
UPDATE: I went a step farther and deleted the only worker pool in my test cluster. The redis node pods completely went away as expected. The operator and metrics pods went into a pending state (also expected). I then recreated a new worker pool and everything came back w/out incident. I'm starting to think this is an openshift thing only. I wonder what's different...
from operator-for-redis-cluster.
@TANISH-18 you may want to try out the latest version of the operator as #84 may have resolved this issue as well.
from operator-for-redis-cluster.
Related Issues (20)
- PVC for data volume HOT 1
- PodDisruptionBudget not updated when RedisCluster update HOT 5
- Max surge configuration HOT 5
- Is this project stale? HOT 1
- Reaching beta/ stable HOT 1
- Set Pod Anti-Affinity to distribute primaries and secondaries over Nodes HOT 20
- RedisCluster pod template annotations are not being passed to created pods HOT 1
- Support for Redis v7 and newer HOT 3
- Pod Disruption Budget should use Min Available not Max unavailable HOT 1
- Deleting master pod from one shard causes master updates in other shards HOT 4
- Readiness probe failure on cluster deployment HOT 10
- Update PDB to v1 from v1beta
- redis-cluster pods not distributing across multiple zones with topology spread constraint enabled for zone HOT 3
- [Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods HOT 4
- [Question] Autoscaling Cluster HOT 1
- [Resiliency] Pods stuck in Terminating are not removed by the operator HOT 1
- Redis cluster inservice upgrade is not working with zero downtime HOT 1
- Doesn't build on Windows HOT 8
- 3-node cluster without any zones? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from operator-for-redis-cluster.