Comments (20)
So you want to run this cross region (across multiple kubernetes clusters)?
It's a single k8s cluster.
We have multiple zones and each zone has multiple k8s nodes. Ideally the operator would schedule masters and replicas in different zones, but when it cannot do that i.e. it goes into best effort mode, the operator should also try to keep the master and replica on different k8s nodes within the zone.
so I'm not sure exactly what you're proposing.
maybe I can try creating a PR for it
Apologies for the late response, I was out sick.
from operator-for-redis-cluster.
Do you have zoneAwareReplication
enabled? We try not to put primaries and replicas in the same zone if possible.
from operator-for-redis-cluster.
@cin we do have zoneAwareReplication
enabled. How does the operator identify zones?
Cluster Status:
│ Status: │
│ Cluster: │
│ Label Selector Path: app.kubernetes.io/component=database,app.kubernetes.io/instance=cluster,app.kubernetes.io/name=node-for-redis,redis-operator.k8s.io/cluster-name=cluster-node-for-redis │
│ Max Replication Factor: 1 │
│ Min Replication Factor: 1 │
│ Nodes: │
│ Id: e40ed93492a443b683633c0c278461673cc613fc │
│ Ip: 192.168.250.197 │
│ Pod Name: rediscluster-cluster-node-for-redis-whcb7 │
│ Port: 6379 │
│ Role: Primary │
│ Slots: │
│ 0-1 │
│ 5464-10923 │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: a698d551e968af578b93c102808b9de063067b4f │
│ Ip: 192.168.76.69 │
│ Pod Name: rediscluster-cluster-node-for-redis-m87m2 │
│ Port: 6379 │
│ Primary Ref: 89b8f81f6e390caf9d4080b92e96c9bc6d5bc432 │
│ Role: Replica │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34 │
│ Ip: 192.168.76.70 │
│ Pod Name: rediscluster-cluster-node-for-redis-hh7ms │
│ Port: 6379 │
│ Role: Primary │
│ Slots: │
│ 2-5461 │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: a2906900b89c057b4d175c9803e8a4847ebd9c4c │
│ Ip: 192.168.20.197 │
│ Pod Name: rediscluster-cluster-node-for-redis-5rvsx │
│ Port: 6379 │
│ Primary Ref: e40ed93492a443b683633c0c278461673cc613fc │
│ Role: Replica │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: ff18d5f2ee4c073edf9763bad42a310a4829d8fb │
│ Ip: 192.168.76.71 │
│ Pod Name: rediscluster-cluster-node-for-redis-qfhtx │
│ Port: 6379 │
│ Primary Ref: cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34 │
│ Role: Replica │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: 89b8f81f6e390caf9d4080b92e96c9bc6d5bc432 │
│ Ip: 192.168.250.196 │
│ Pod Name: rediscluster-cluster-node-for-redis-t57vw │
│ Port: 6379 │
│ Role: Primary │
│ Slots: │
│ 5462-5463 │
│ 10924-16383 │
│ Zone: cnaDevPool-NoSRIOV
from operator-for-redis-cluster.
Here's the logic in the chart. We're using the topology.kubernetes.io/zone
label as the topologyKey
. Is that maybe not set or do you only have one zone and we maybe have a bug?
from operator-for-redis-cluster.
FWIW, you can define a host anti-affinity (or w/e affinity you'd like) by overriding the affinity
value in the chart.
from operator-for-redis-cluster.
We're using the
topology.kubernetes.io/zone
label as thetopologyKey
.
looks like all our nodes have the same label value for topology.kubernetes.io/zone
.
FWIW, you can define a host anti-affinity (or w/e affinity you'd like) by overriding the
affinity
value in the chart.
Ah awesome. This is super helpful I can try it out. Thanks for your help!
from operator-for-redis-cluster.
@cin how would this make sure that pods for the same shard are not scheduled in the same node? Wouldn't this apply generally to all pods?
from operator-for-redis-cluster.
Good point. An affinity won't help here. I will take a look at the scheduling code -- there's probably something we can do there. This has never come up for us as we generally don't share database nodes' resources. Just curious...how big is your cluster? Why aren't you using multiple zones for your workers?
from operator-for-redis-cluster.
Actually, can you link your rediscluster plugin output?
from operator-for-redis-cluster.
Actually, can you link your rediscluster plugin output?
yeah
plugin output:
POD NAME IP NODE ID ZONE USED MEMORY MAX MEMORY KEYS SLOTS
+ rediscluster-cluster-node-for-redis-hh7ms 192.168.76.70 10.164.33.104 cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34 cnaDevPool-NoSRIOV 20.56M 9.44G 2-5461
| rediscluster-cluster-node-for-redis-qfhtx 192.168.76.71 10.164.33.104 ff18d5f2ee4c073edf9763bad42a310a4829d8fb cnaDevPool-NoSRIOV 2.58M 9.44G
+ rediscluster-cluster-node-for-redis-t57vw 192.168.250.196 10.164.33.103 89b8f81f6e390caf9d4080b92e96c9bc6d5bc432 cnaDevPool-NoSRIOV 2.67M 9.44G 5462-5463 10924-16383
| rediscluster-cluster-node-for-redis-m87m2 192.168.76.69 10.164.33.104 a698d551e968af578b93c102808b9de063067b4f cnaDevPool-NoSRIOV 14.66M 9.44G
+ rediscluster-cluster-node-for-redis-whcb7 192.168.250.197 10.164.33.103 e40ed93492a443b683633c0c278461673cc613fc cnaDevPool-NoSRIOV 8.61M 9.44G 0-1 5464-10923
| rediscluster-cluster-node-for-redis-5rvsx 192.168.20.197 10.164.33.102 a2906900b89c057b4d175c9803e8a4847ebd9c4c cnaDevPool-NoSRIOV 2.62M 9.44G
NAME NAMESPACE PODS OPS STATUS REDIS STATUS NB PRIMARY REPLICATION ZONE SKEW
cluster-node-for-redis default 6/6/6 ClusterOK OK 3/3 1-1/1 0/0/BALANCED
from operator-for-redis-cluster.
It's interesting that the k8s scheduler put 3 pods on the 104 worker. After the pods are scheduled, that's when the operator kicks in to decide what type of Redis pod it's going to be. In this case, it had to go into best effort mode. One thing you could possibly do to help this would be to give your Redis pods more memory/CPU so no more than 2 Redis pods can be scheduled on the same worker. I think that's still not going to eliminate issues where two primaries or replicas get scheduled on the same worker. We should confirm this after bumping up the resources per Redis pod.
Do you have info level logging enabled by chance? We log quite a bit of info about what choices where made when determining the Redis node type (primary/replica).
TBH, I'm not sure you want your cluster setup this way (not saying we shouldn't look into making the operator behave better in this case -- especially for dev/qa/testing clusters). Is there a reason you're all in one zone? Are you sure you'll safely be able to share resources w/replicas and primaries running on the same worker? One or more nodes in this zone could go out for w/e reason. Obviously a zone outage would be certain downtime.
from operator-for-redis-cluster.
Is there a reason you're all in one zone?
This is a dev environment. The prod env might be different from this.
Do you have info level logging enabled by chance?
I do not, I can enable it and try again
Apart from this, do you think it would make sense to add labels to pods identifying which shard they belong to?
from operator-for-redis-cluster.
I just tried it several times w/out it and at least the primaries and replicas were all on different nodes. You definitely wouldn't want two primaries (or two replicas) on a worker node. I'll look at the code and see if there's anything we can do to ensure primaries and their replicas don't end up on the same worker.
Apart from this, do you think it would make sense to add labels to pods identifying which shard they belong to?
Do you mean, adding a label for which slots the primary holds? My only argument against doing that is that labels have a length limit and as nodes get added/removed their slots get shifted around (this list can actually grow pretty large). We added a lot of this information to the redis cluster plugin for this purpose. Until then it was hard to even figure out which nodes were primaries.
from operator-for-redis-cluster.
I'll look at the code and see if there's anything we can do to ensure primaries and their replicas don't end up on the same worker.
Thanks!
Do you mean, adding a label for which slots the primary holds?
I mean all the pods belonging to the same shard (primaries and replicas) should have a label like my-redis-shard=0
My only argument against doing that is that labels have a length limit and as nodes get added/removed their slots get shifted around (this list can actually grow pretty large).
Since we are just keeping track of just the current shard members I don't think this would grow big
from operator-for-redis-cluster.
Ah, it's all coming back to me now -- it's been a while since I looked at this part of the code. The way the redis node type selection is done is a bit odd. Pods are not part of a deployment or statefulset or anything. This gives a lot of flexibility in what you can do w/downed workers, outages, etc. Only after all pods have been created are primaries and replicas sorted out. It's all done in one pass so the algorithm suffers from the problem of not knowing where other primaries/replicas have been scheduled (or could be scheduled); so by the time it gets to the end of the list, there's only one pod that can become a replica for the primary in question (and it happens to be on the same worker sometimes). I think I can make it better by not allowing it to pick replicas on the same worker in the "optimal" selection phase. I will test some things out today and see if it helps.
In regard to the label, where do you get the shard ID? I am not aware of such a property in Redis. There's the ID string but that wouldn't help you identify the replicas easily. The other issues I have with this is we'd be adding the label after the pod has been created (bc we don't know what slots a pod will hold until after it's been scheduled), and it also feels like treating the pods more like pets than cattle.
from operator-for-redis-cluster.
Hey thanks again for looking into it. I was checking out the code and found the simple logic that is used for pod scheduling. For our dev environment, I replaced the topologyKey from topology.kubernetes.io/zone
to a node unique label, but for our prod env we want the pods to be spread over multiple zones as well as racks. To do that, we were thinking about adding a new parameter to the chart rackAwareReplication
which would essentially add another topology key in the RedisCluster CR and some changes to the scheduling logic mentioned above. Wdyt about that solution?
In regard to the label, where do you get the shard ID? I am not aware of such a property in Redis.
Yeah, there isn't such an id. I just meant the shards could be given a unique id in the operator which we could use to keep track of a shard's primaries and secondaries.
The other issues I have with this is we'd be adding the label after the pod has been created (bc we don't know what slots a pod will hold until after it's been scheduled), and it also feels like treating the pods more like pets than cattle.
In that case, the labels won't really help with scheduling.
from operator-for-redis-cluster.
I'm glad you have a work around for the scheduling behavior. :) I tend to treat zones as racks in our operators because you don't really get that level of control/detail on managed cloud services. In your prod configuration, if you use the zoneAwareReplication
feature along with the default zone label, pods should distribute properly across zones and replicas should end up in different zones than primaries (and other replicas). This should work even w/out the PR that I started.
In regard to the PR, things seem to be working much better. Deterministically placing (striping) the pods has helped immensely. You can still end up w/some weird situations when one node gets more pods scheduled than others, but that is to be expected (and can be fixed by altering the pod's resources to better fit the node). We should add some similar logic on scale down as it isn't picking the "ideal" pod(s) to remove. For example, if the RF is greater than or equal to the number of nodes, then the current algorithm will often times leave you with replicas on the primary node.
from operator-for-redis-cluster.
I'm glad you have a work around for the scheduling behavior.
Unfortunately, the workaround is just for the dev environment. In our prod env we would still need zoneAwareReplication
as well as rackAwareReplication
since we want our pods to be distributed over zones (regions) as well as racks.
In regard to the PR, things seem to be working much better.
That's awesome. I'll try to read through the scheduling logic. Thanks
from operator-for-redis-cluster.
I'm confused how your cluster is setup. So you want to run this cross region (across multiple kubernetes clusters)? There's no concept of racks in kubernetes, so I'm not sure exactly what you're proposing. https://kubernetes.io/docs/setup/best-practices/multiple-zones/ is the only thing I'm aware of in this regard.
from operator-for-redis-cluster.
the operator should also try to keep the master and replica on different k8s nodes within the zone.
It looks like your PR is actually trying to do this
from operator-for-redis-cluster.
Related Issues (20)
- PVC for data volume HOT 1
- PodDisruptionBudget not updated when RedisCluster update HOT 5
- Max surge configuration HOT 5
- Is this project stale? HOT 1
- Reaching beta/ stable HOT 1
- RedisCluster pod template annotations are not being passed to created pods HOT 1
- Support for Redis v7 and newer HOT 3
- Pod Disruption Budget should use Min Available not Max unavailable HOT 1
- Deleting master pod from one shard causes master updates in other shards HOT 4
- Readiness probe failure on cluster deployment HOT 10
- Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State HOT 10
- Update PDB to v1 from v1beta
- redis-cluster pods not distributing across multiple zones with topology spread constraint enabled for zone HOT 3
- [Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods HOT 4
- [Question] Autoscaling Cluster HOT 1
- [Resiliency] Pods stuck in Terminating are not removed by the operator HOT 1
- Redis cluster inservice upgrade is not working with zero downtime HOT 1
- Doesn't build on Windows HOT 8
- 3-node cluster without any zones? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from operator-for-redis-cluster.