ibm / operator-for-redis-cluster Goto Github PK

IBM Operator for Redis Cluster

Home Page: https://ibm.github.io/operator-for-redis-cluster

License: MIT License

Makefile 0.51% Go 96.34% Smarty 1.02% Shell 2.13%

operator-for-redis-cluster's Introduction

operator-for-redis-cluster

The goal of this project is to simplify the deployment and management of a Redis cluster in a Kubernetes environment. It started internally at Amadeus in 2016, where it was initially designed to run on Openshift. This is the third version of the Redis operator, which leverages the Operator SDK framework for operators.

Documentation

This project's documentation is hosted at https://ibm.github.io/operator-for-redis-cluster.

operator-for-redis-cluster's People

Contributors

Stargazers

Watchers

operator-for-redis-cluster's Issues

Deleting master pod from one shard causes master updates in other shards

We had a cluster deployed:

[root@platformtester-k8-master-1-a00061-4b2c09f7b009c1b8 scripts]# kubectl-rc
  POD NAME                             IP               NODE         ID                                        ZONE             USED MEMORY  MAX MEMORY  KEYS  SLOTS
  + rediscluster-node-for-redis-cjtgr  192.168.172.70   10.15.11.85  edb9266a55c8a2806cb860724984dd5431d7aae0  platform-zone-B  3.00M        2.80G             6554-9830
  | rediscluster-node-for-redis-bzlgd  192.168.193.99   10.15.11.82  a91df8f7488bec237b41ffc851f0a1fadb24cbd7  platform-zone-A  5.18M        2.80G           
  | rediscluster-node-for-redis-zksts  192.168.62.44    10.15.11.84  fe7a82302fe6f39f771c11d1e67211ea1008966d  platform-zone    2.95M        2.80G           
  + rediscluster-node-for-redis-fct4j  192.168.62.41    10.15.11.84  dfe280003c3a1e09310e6ae1ecf38c50331549b9  platform-zone    5.40M        2.80G             0-3276
  | rediscluster-node-for-redis-2pcfl  192.168.81.119   10.15.11.83  091f077dcb45c6dce7f56e1ab662af0edcf66bc7  platform-zone-A  2.90M        2.80G           
  | rediscluster-node-for-redis-d6zr4  192.168.172.117  10.15.11.85  3dd335dd3d359aff5c868e0ece5336edbcf4e255  platform-zone-B  3.94M        2.80G           
  + rediscluster-node-for-redis-jhd9k  192.168.81.118   10.15.11.83  ccafee5678d33943c2d60003a3f55ac413ddf193  platform-zone-A  8.22M        2.80G             3277-6553
  | rediscluster-node-for-redis-8wsfq  192.168.62.43    10.15.11.84  5c43965a7f444284d19af0e76f26fba01b8a7844  platform-zone    3.14M        2.80G           
  | rediscluster-node-for-redis-xwnpb  192.168.57.109   10.15.11.87  6522856556acb5fe7b34c2d7d498a88c22f5757e  platform-zone-C  3.79M        2.80G           
  + rediscluster-node-for-redis-p2fdg  192.168.57.103   10.15.11.87  24a87a775d54adfe2069fc20368704800707d3c2  platform-zone-C  10.95M       2.80G             13108-16383
  | rediscluster-node-for-redis-cz2rv  192.168.193.100  10.15.11.82  f24a67e5090cec59b0301b3934773d5975eb6619  platform-zone-A  3.26M        2.80G           
  | rediscluster-node-for-redis-d9qw9  192.168.81.117   10.15.11.83  cc82bc2e2cffefeaeb34983b224e602505e5c163  platform-zone-A  3.05M        2.80G           
  + rediscluster-node-for-redis-p8l5q  192.168.193.97   10.15.11.82  ff537ede00a7c508ae4d322c83a45edf6b99570a  platform-zone-A  5.74M        2.80G             9831-13107
  | rediscluster-node-for-redis-fcq8r  192.168.172.127  10.15.11.85  7763b26883837808d9a8550af4cfc85a3938b7e1  platform-zone-B  2.94M        2.80G           
  | rediscluster-node-for-redis-nqwpj  192.168.57.110   10.15.11.87  35275718f7c3fb032f439855505a88f3e6d458ca  platform-zone-C  3.45M        2.80G           

  NAME            NAMESPACE          PODS      OPS STATUS  REDIS STATUS  NB PRIMARY  REPLICATION  ZONE SKEW
  node-for-redis  fed-redis-cluster  15/15/15  ClusterOK   OK            5/5         2-2/2        1/2/BALANCED

And we were performing some resiliency tests on it. So, we deleted a master pod for one of the shards and check how it recovers.

[root@platformtester-k8-master-1-a00061-4b2c09f7b009c1b8 scripts]# kc delete pod rediscluster-node-for-redis-jhd9k 
pod "rediscluster-node-for-redis-jhd9k" deleted

The cluster was successfully able to recover, and a replica pod replaced the deleted master pod.

[root@platformtester-k8-master-1-a00061-4b2c09f7b009c1b8 scripts]# kubectl-rc
  POD NAME                             IP               NODE         ID                                        ZONE             USED MEMORY  MAX MEMORY  KEYS        SLOTS
  + rediscluster-node-for-redis-2pcfl  192.168.81.119   10.15.11.83  091f077dcb45c6dce7f56e1ab662af0edcf66bc7  platform-zone-A  635.45M      2.80G       db0=143194  0-3276
  | rediscluster-node-for-redis-d6zr4  192.168.172.117  10.15.11.85  3dd335dd3d359aff5c868e0ece5336edbcf4e255  platform-zone-B  633.02M      2.80G       db0=143194
  | rediscluster-node-for-redis-fct4j  192.168.62.41    10.15.11.84  dfe280003c3a1e09310e6ae1ecf38c50331549b9  platform-zone    637.19M      2.80G       db0=143194
  + rediscluster-node-for-redis-cjtgr  192.168.172.70   10.15.11.85  edb9266a55c8a2806cb860724984dd5431d7aae0  platform-zone-B  599.32M      2.80G       db0=143255  6554-9830
  | rediscluster-node-for-redis-bzlgd  192.168.193.99   10.15.11.82  a91df8f7488bec237b41ffc851f0a1fadb24cbd7  platform-zone-A  599.25M      2.80G       db0=143255
  | rediscluster-node-for-redis-zksts  192.168.62.44    10.15.11.84  fe7a82302fe6f39f771c11d1e67211ea1008966d  platform-zone    599.58M      2.80G       db0=143255
  + rediscluster-node-for-redis-d9qw9  192.168.81.117   10.15.11.83  cc82bc2e2cffefeaeb34983b224e602505e5c163  platform-zone-A  597.94M      2.80G       db0=142787  13108-16383
  | rediscluster-node-for-redis-cz2rv  192.168.193.100  10.15.11.82  f24a67e5090cec59b0301b3934773d5975eb6619  platform-zone-A  597.33M      2.80G       db0=142787
  | rediscluster-node-for-redis-p2fdg  192.168.57.103   10.15.11.87  24a87a775d54adfe2069fc20368704800707d3c2  platform-zone-C  597.23M      2.80G       db0=142787
  + rediscluster-node-for-redis-p8l5q  192.168.193.97   10.15.11.82  ff537ede00a7c508ae4d322c83a45edf6b99570a  platform-zone-A  602.21M      2.80G       db0=143685  9831-13107
  | rediscluster-node-for-redis-fcq8r  192.168.172.127  10.15.11.85  7763b26883837808d9a8550af4cfc85a3938b7e1  platform-zone-B  602.91M      2.80G       db0=143685
  | rediscluster-node-for-redis-nqwpj  192.168.57.110   10.15.11.87  35275718f7c3fb032f439855505a88f3e6d458ca  platform-zone-C  602.56M      2.80G       db0=143685
  + rediscluster-node-for-redis-xwnpb  192.168.57.109   10.15.11.87  6522856556acb5fe7b34c2d7d498a88c22f5757e  platform-zone-C  596.71M      2.80G       db0=142831  3277-6553
  | rediscluster-node-for-redis-8wsfq  192.168.62.43    10.15.11.84  5c43965a7f444284d19af0e76f26fba01b8a7844  platform-zone    596.57M      2.80G       db0=142831
  | rediscluster-node-for-redis-9nxs4  192.168.81.120   10.15.11.83  d83782fb30b062810acffedde93ea6304de95e3d  platform-zone-A  601.99M      2.80G       db0=142831

  NAME            NAMESPACE          PODS      OPS STATUS  REDIS STATUS  NB PRIMARY  REPLICATION  ZONE SKEW
  node-for-redis  fed-redis-cluster  15/15/15  ClusterOK   OK            5/5         2-2/2        2/1/BALANCED

But if you check, the masters for other shards changed as well. This issue does not occur every time. Is this expected behavior?

[Question] Authenticating users with the Redis AUTH command

Hello, thanks for this useful software !
Can I use this operator to authenticate users with the Redis AUTH command?
Thanks

PodDisruptionBudget not updated when RedisCluster update

Problems

I am using v0.3.14 and found a problem that MinAvailable of PodDisruptionBudget is not updated when NumberOfPrimaries and ReplicationFactor of RedisCluster are updated.

Proposal

Add processing to recalculate and update MinAvailable of PodDisruptionBudget in reconcil of RedisCluster.

3-node cluster without any zones?

Hi there,

Just reading through the source and relevant issues in here to try and determine the node selection criteria when creating replicas. We run a 3-node Kubernetes cluster (with redis running outside the cluster currently) but are looking to move this into this operator.

From what I can gather, it looks like the replica placement is based on the zone topology key - what happens in a 3-node cluster, where there is no such thing as zones? Is the controller smart enough to not attach a replica to a primary on the same node? Obviously this would be undesired behaviour as a node going down would take the replica with it.

Happy to make a PR if pointed in the right direction!

PVC for data volume

Currently data volume in rediscluster.yaml helm template is defined as emptyDir. It would be greate to be able to define PersistentVolumeClaims in values.yaml.

Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State

Steps to reproduce the issue:
Redis-operator and redis-cluster up and running .
shutoff all K8 instances from openstack and startinstance after 5 mins .
Redis cluster pods are stuck in error state.
once all k8 nodes comes back, Redis-operator is up and running and it is able to bring up only 1 Redis-cluster pod.

There are two issues once all K8 cluster nodes came up.

Redis operator is not able to delete the pods stuck in error state.
Redis operator is only able to bring up 1 cluster pod. from the logs we can see after Reconciling one cluster pod it is finishing the Reconciling and also setting the CRO state OK.

kc get pods -n fed-redis-cluster -o wide

rediscluster-node-for-redis-9fhdz 0/2 Error 2 91m
rediscluster-node-for-redis-hj59b 0/2 Error 2 93m
rediscluster-node-for-redis-j69st 0/2 Error 0 37m
rediscluster-node-for-redis-w2m79 2/2 Running 0 36m

This is issue from Redis operator side. it is seen while performing the Resiliency Run for power outage case.

Max surge configuration

Is it possible to specify how many new pods can be created at one time? Similar as maxSurge settings for rollingUpdate for k8s deployments.

redis-cluster pods not distributing across multiple zones with topology spread constraint enabled for zone

Hi @cin , i am deploying fed-redis-cluster. even after enabling zone level topology i can see that pods of same shards are not spread across multiple zones.

POD NAME IP NODE ID ZONE USED MEMORY MAX MEMORY KEYS SLOTS

rediscluster-node-for-redis-4b86s 192.168.184.25 10.31.48.73 efba48e4f7426eca137f517f3d99470d92ac1521 smf-zone-c3 3.72M 2.80G db0=79 10924-16383
| rediscluster-node-for-redis-hxb77 192.168.215.39 10.31.48.89 154c075ffef92f4ac02a8d42dc3753ea2efab97b smf-zone-c1 3.37M 2.80G db0=79
| rediscluster-node-for-redis-pj4lt 192.168.24.59 10.31.48.71 68f05605dddd5f464efe1cd4888d830aa35658c1 smf-zone-c1 3.34M 2.80G db0=79
rediscluster-node-for-redis-mnrkt 192.168.138.167 10.31.48.87 50fc7c3df8f41bfd93f8f309b07bd1e353786c0e smf-zone-c0 3.82M 2.80G db0=83 5462-10923
| rediscluster-node-for-redis-4kmqk 192.168.7.136 10.31.48.76 89c69df5fa0dd1f10407ae0299b9df59c67e688b smf-zone-c3 3.39M 2.80G db0=83
| rediscluster-node-for-redis-bwh6n 192.168.215.2 10.31.48.89 c44b6fc635a33b1d1755d9eb206283f71ea8a825 smf-zone-c1 3.40M 2.80G db0=83
rediscluster-node-for-redis-n2s7j 192.168.215.52 10.31.48.89 2f82fc3f4ebff0f10b919545e2f774ea765c11c2 smf-zone-c1 4.11M 2.80G db0=99 0-5461
| rediscluster-node-for-redis-pn4zf 192.168.3.232 10.31.48.74 5398b6e6fd8f5fc7f3d1e2181901330ae4399a9f smf-zone-c3 3.45M 2.80G db0=99
| rediscluster-node-for-redis-s5v8l 192.168.34.14 10.31.48.88 6fc8a525a165c6119d29a367a21ea8a09cfb9cd1 smf-zone-c0 3.45M 2.80G db0=100

topologySpreadConstraints:

labelSelector:
matchLabels:
app.kubernetes.io/component: database
app.kubernetes.io/name: node-for-redis
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/component: database
app.kubernetes.io/name: node-for-redis
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway

Pod Disruption Budget should use Min Available not Max unavailable

Expected Behavior

To prevent the Redis Cluster from going down, we need to make sure that only one of the pods that make up the Redis Cluster goes down.

Current Behavior

Currently set to 1 for Max unavailable in Pod Disruption Budget, but not working properly.

Name:             cluster-node-for-redis
Namespace:        default
Max unavailable:  1
Selector:         app.kubernetes.io/component=database,app.kubernetes.io/instance=cluster,app.kubernetes.io/name=node-for-redis,redis-operator.k8s.io/cluster-name=cluster-node-for-redis
Status:
    Allowed disruptions:  4
    Current:              6
    Desired:              2
    Total:                3

Possible Solution

Since the Kubernetes documentation states that "only Deployment, ReplicationController, ReplicaSet, and StatefulSet can use maxUnavailable", operator-for-redis-cluster
should use MinAvailable.

Name:           cluster-node-for-redis
Namespace:      default
Min available:  5
Selector:       app.kubernetes.io/component=database,app.kubernetes.io/instance=cluster,app.kubernetes.io/name=node-for-redis,redis-operator.k8s.io/cluster-name=cluster-node-for-redis
Status:
    Allowed disruptions:  1
    Current:              6
    Desired:              5
    Total:                6

[Resiliency] Pods stuck in Terminating are not removed by the operator

In some cases of k8s node failures, it can happen that a node has shut down but the pods on that node are stuck in a Terminating state. The operator tries to delete them, but since the k8s api doesn't do anything, they are still part of the redis cluster. This means that the operator will not spawn new node pods until the terminating pods are removed.

Is this project stale?

It seems like this project was ported from https://github.com/AmadeusITGroup/Redis-Operator.
Is it stale now as there seems to be not much activity?
Any alternatives you could recommend for Redis Cluster on K8s?

Open source Redis operator

Once the internal open source proposals are approved, migrate code to this repository.

[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods

Redis Cluster is not able to recover after the redis node pods are deleted sequentially. Example command:

for i in `kubectl get pods -n redis-cluster-ns --no-headers | awk '{print $1}'`; do kubectl delete pods -n redis-cluster-ns $i; sleep 10; done;

After new pods are spawned, they fail the readiness probe:

E0104 18:03:42.732541       1 redisnode.go:247] readiness check failed, err:Readiness failed, cluster slots response empty

[Question] Autoscaling Cluster

Hi, been using the project for awhile. Is it possible to automate scaling with an hpa?

Set Pod Anti-Affinity to distribute primaries and secondaries over Nodes

Hey folks, we were testing out the operator and saw that when deploying primary and secondary pods were not being distributed over nodes. Is there a way to set pod anti-affinity for redis node pods so that primary and secondary pods are not scheduled on the same node?

Redis cluster inservice upgrade is not working with zero downtime

While doing the Redis cluster in-service upgrade from older chart to newer chart. Redis operator is performing the Rolling update shard by shard which is fine.
but in one shard it is upgrading the master replicas at once due to which there is disconnection with the application.

can we improve the Rolling update logic such that first it upgrade the replicas of one shard then promote that replica to master and then we upgrade master and new master will join back as replica.?

Support for Redis v7 and newer

Hey team 👋 I was trying to upgrade my cluster from Redis version 6.2.x to 7.0.x. But it seems like v7 is not supported by the operator yet. This is the error that I get:

E0817 14:03:24.673063       1 redisnode.go:247] readiness check failed, err:Readiness failed, cluster slots response err: response returned from Conn: unmarshaling message off Conn: cannot decode resp array into string

Do we have any plans on supporting Redis v7?

Thanks!

Doesn't build on Windows

This is a GO application, it should be easy to build on Windows.

RedisCluster pod template annotations are not being passed to created pods

While trying to update RedisCluster pods' annotations via podAnnotations value in node-for-redis Helm chart, I discovered that they are not being passed to the resulting pods. Since podAnnotations value is directly passed to annotations in RedisCluster pod template, my hypothesis is that there is a bug in pod initialization code.

Reaching beta/ stable

Hello,

Is there any forecasts if/ when the operator might reach a beta/ stable status?
What might be missing for you to be comfortable to release it as a beta/ stable version?

Update PDB to v1 from v1beta

In order to support k8s 1.25, we need to update the code to use the v1 PDB.

Readiness probe failure on cluster deployment

I'm getting a cluster liveness probe failure when trying to deploy the cluster. How can I go about debugging that?

Haven't made any changes to the charts / values, just trying to deploy it as is, that's why I think it's a bug.

The only thing I added to the deployment was a namespace other than default. I built the docker images and pushed them to my cluster as well, and the operator chart installed smoothly. The node however is failing:

28m         Normal    Created             pod/rediscluster-cluster-node-for-redis-ntf9p   Created container redis-node
28m         Normal    Started             pod/rediscluster-cluster-node-for-redis-ntf9p   Started container redis-node
3m52s       Warning   Unhealthy           pod/rediscluster-cluster-node-for-redis-ntf9p   Liveness probe failed: HTTP probe failed with statuscode: 503
27m         Warning   Unhealthy           pod/rediscluster-cluster-node-for-redis-ntf9p   Readiness probe failed: HTTP probe failed with statuscode: 503