Comments (20)
@alok87 Yeah, this is working well on our end now. Thanks very much for all your work on this!
from k8s-worker-pod-autoscaler.
Not sure if this is a large enough snippet of logs but it does show the rapid scale down.
from k8s-worker-pod-autoscaler.
Tested using that and I am seeing some 40 seconds pauses (in line with our resync-period
) - However I'm also seeing every 200ms scale down's for certain deployments:
I0204 11:18:58.042487 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 61
I0204 11:18:58.243314 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 60
I0204 11:18:58.442599 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 59
I0204 11:18:58.642624 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 58
I0204 11:18:58.842958 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 57
<many more lines of this>
Seems once it has decided to do a scale down it still loops in the controller without respecting the resync-period
until the deployment is at the size WPA wants it to be at.
from k8s-worker-pod-autoscaler.
Just tested this and hitting an issue where it returns scale-up
when scaling down and therefore not respecting the cool down:
I0215 11:25:32.221965 1 controller.go:458] cpapi_live_transport current: 96 I0215 11:25:32.221969 1 controller.go:459] cpapi_live_transport qMsgs: 98, desired: 95 I0215 11:25:32.240002 1 controller.go:519] cpapi_live_transport scaleOp: scale-up
Looks like GetScaleOperation expects desiredWorkers
then currentWorkers
, but when it's called it's passing currentWorkers
then desiredWorkers
, so the GetScaleOperation
logic is inverted!
from k8s-worker-pod-autoscaler.
Specifying the secondsToProcessOneJob
will help here.
WPA will keep your minReplicas based on the RPM on the queue if secondsToProcessOneJob is specified.
Basically this is the logic, code:
workersBasedOnMessagesSent := int32(math.Ceil((secondsToProcessOneJob * messagesSentPerMinute) / 60))
if workersBasedOnMessagesSent > minWorkers {
return workersBasedOnMessagesSent
}
return minWorkers
- messageSentPerMinute in SQS is avg number of msg sent in 1 minute, code.
- minWorkers is the replicas you have specified as min in WPA spec.
kubectl edit wpa
- secondsToProcessOneJob is what you specify
Can you try this and let us know if this helps?
from k8s-worker-pod-autoscaler.
@michael-careplanner I think the resync-period
configuration is set to a very low value(like 200ms) which would explain the rapid scaling decisions. The default value is 20 seconds which works well for us. Please confirm if this is the case and if so is there any particular reason for setting it to a value lower than 20 seconds? Scaling up from 0 is anyways instantaneous due to the long-polling logic and I can't figure out another use-case for setting a small value for resync-period
.
from k8s-worker-pod-autoscaler.
@justjkk Looking at our config resync-period is actually being set to 40, up from the default of 20. Agreed that having it set to a very low value doesn't make too much sense, but looks like that isn't the case in our config.
@alok87 Thanks for your reply. I am planning to look at secondsToProcessOneJob
at some point, but I feel like that's a separate issue to this. I still don't think WPA should be making scale down decisions so close together, especially given the resync period we've set.
from k8s-worker-pod-autoscaler.
Possible to share the log the WPA for a particular queue with -v=4
verbosity? will help in debugging.
from k8s-worker-pod-autoscaler.
Resync-period is not being respected. I see the control loop executing every 200ms, instead of 20second.
I0204 08:25:22.941872 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 31
I0204 08:25:22.968042 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 30
I0204 08:25:23.148417 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 29
I0204 08:25:23.350866 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 28
I think event handlers needs to use AddEventHandlerWithResyncPeriod
wpa is using AddEventHandler
Made a change.
Can you try this image: practodev/workerpodautoscaler:v1.5.0-5-gfef9160
and send me the logs again if possible. Wanted to test if this fixes the problem!
from k8s-worker-pod-autoscaler.
Ok I think since we are updating the status in WPA every reconciliation loop, hence it is creating new event everytime without respecting resync.
Doing WPA status update only when changes happen and doing noop events can prevent this.
from k8s-worker-pod-autoscaler.
@michael-careplanner can you confirm the 200ms is happening in the queues/wpa objects with else:
if workerPodAutoScaler.Status.CurrentReplicas == currentWorkers &&
workerPodAutoScaler.Status.AvailableReplicas == availableWorkers &&
workerPodAutoScaler.Status.DesiredReplicas == desiredWorkers &&
workerPodAutoScaler.Status.CurrentMessages == queueMessages {
klog.V(4).Infof("%s/%s: WPA status is already up to date\n", namespace, name)
return
} else {
klog.V(4).Infof("%s/%s: Updating wpa status\n", namespace, name)
}
from k8s-worker-pod-autoscaler.
Yes, I see lots of messages like this while it's doing the 200ms reconciliation loop:
controller.go:770] live-api/cpapi-transport: Updating wpa status
Followed by a single controller.go:767] live-api/cpapi-transport: WPA status is already up to date
at the end of the fast loop. So looks like it's repeatedly executing that if statement above until it hits the truthy case, then it roughly respects the resync-period
before running again:
I0208 08:49:18.857614 1 controller.go:767] live-api/cpapi-transport: WPA status is already up to date
I0208 08:49:53.650846 1 controller.go:626] cpapi_live_transport min=1, max=300, targetBacklog=10
from k8s-worker-pod-autoscaler.
Yes then it is confirmed, the update status is leading to very fast re-queue. Cant stop doing updates to the status since queue messages and others details are updated with every update, also the dashboards are there on top of these status.
We may have to consider scale down cool-off after all: --scale-down-delay-after-scale-up
from k8s-worker-pod-autoscaler.
Hello @michael-careplanner
Added scale-down-delay-after-last-scale-activity
.
--scale-down-delay-after-last-scale-activity
scale down delay after last scale up or down (defaults to 600 seconds i.e 10mins)
I have written test cases for it. But have not tested it in any Kubernetes cluster yet.
Here is the image: practodev/workerpodautoscaler:v1.5.0-9-gbeb0731
from k8s-worker-pod-autoscaler.
@michael-careplanner Fixed it here.
New image with the above change:
practodev/workerpodautoscaler:v1.5.0-11-g7d12579
from k8s-worker-pod-autoscaler.
Tested that version, it now correctly infers the scale type, but it's still not respecting the cooldown - log:
I0216 09:48:55.389034 1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.389049 1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.389057 1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.389061 1 controller.go:687] cpapi_live_transport current=115, idle=-1
I0216 09:48:55.389066 1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.389073 1 controller.go:458] cpapi_live_transport current: 115
I0216 09:48:55.389079 1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 113
I0216 09:48:55.389091 1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.394090 1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.394109 1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.587844 1 controller.go:827] live-api/cpapi-transport: Updated wpa status
I0216 09:48:55.587944 1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.587958 1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.587965 1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.587977 1 controller.go:687] cpapi_live_transport current=113, idle=-1
I0216 09:48:55.587982 1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.587988 1 controller.go:458] cpapi_live_transport current: 113
I0216 09:48:55.587994 1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 111
I0216 09:48:55.588005 1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.597267 1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.597284 1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.788992 1 controller.go:827] live-api/cpapi-transport: Updated wpa status
I've had a good look over your PR but can't see where the bug is...
from k8s-worker-pod-autoscaler.
I am guessing the status update for lastScaleTime is not reflecting in your WPA object.
status:
LastScaleTime:
This should get updated after every scale activity. It should start with nil.
I am testing this one of our clusters.
from k8s-worker-pod-autoscaler.
@michael-careplanner Hey! Issue was scaleDownDelay
was being set as 0 due to a bug. I fixed it in f1be3a3.
Thanks for bringing out this issue!! I saw the same problem in one of our high throughput queue. Here is the dashboard image after this change, the scale down becomes smooth 👍🏼 (this worker did not set secondsToProcessOneJob
)
If someone still needs the old behaviour, would need to set the scaleDownDelay to a lower value than default of 10mins.
Below image has the fix, let me know how it works out for you!:
practodev/workerpodautoscaler:v1.5.0-12-gf1be3a3
from k8s-worker-pod-autoscaler.
@michael-careplanner This is good to release from our end. Working perfectly in our latest cluster. Did you get the chance to test it?
from k8s-worker-pod-autoscaler.
Release v1.6.0 https://github.com/practo/k8s-worker-pod-autoscaler/releases/tag/v1.6.0
from k8s-worker-pod-autoscaler.
Related Issues (20)
- Scaling with spiky queues HOT 6
- Deployment gets stuck MinimumReplicasUnavailable HOT 2
- Question: Gracefull shutdown? HOT 2
- Multi Queue support with one WPA object. HOT 4
- WPA Status is not updating in k8s1.19 - add support for k8s 1.19
- Scale based on: `numMessages +1` HOT 2
- How to achieve near realtime scheduling of pods? HOT 3
- What is the best way to access WPA controllers? HOT 6
- min=0, max=0 should not lead to scale up of deployments HOT 1
- Cannot get qMsgs if the WPA deleted and re-created HOT 3
- Queues scale temporarily for to -1 after autoscaler restart HOT 6
- Changes for Kubernetes 1.22 support
- does WPA support Amazon MQ, will it be provided in near future? HOT 1
- Does WPA kill pods if the queue length decreases? HOT 7
- The manifest for public.ecr.aws/practo/workerpodautoscaler:v1.6.0 is not found on the public ECR HOT 3
- Unable to fetch queue messages HOT 8
- Does this support multiple SQS Queues monitoring? HOT 1
- Does't work with localstack
- IRSA (IAM Roles for Service Accounts) Support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-worker-pod-autoscaler.