Comments (20)
Hi guys,
Just found this thread and I’m not sure if the approach you guys are talking about is the most natural way to scale Kafka consumers. Not that your ideas aren’t great, they just seem more aligned to something like Azure Service Bus than Kafka.
Queue length shouldn’t ideally be used to decide how many instances to spin up as that will compromise ordered processing of messages if you have multiple instances processing against a given partition. Rather the best way to handle this is to see how many partitions for the given topic the consumer is lagging against and to spin up an instance for each. The difficulty here is that in order to achieve the consumption at scale you really want for the function that is assigned to a given partition to hang around for awhile and process lots of messages rather than just one, thus taking advantage of librdkafka’s batching logic.
Have been running Kafka consumers very successfully in WebJobs but without any dynamic scaling. If you guys crack this problem in functions then this would be amazing. It might be worth paying attention to KNative’s approach but I suspect the way it works there is via an observer container that then decides how many worked containers to spin up. Without the observer I think you guys are going to have to be mega-creative to come up with a good solution that is a natural fit for Kafka.
Again apologies if all of this is obvious to you guys. Not trying to poke holes but am instead keen to see a great solution eventuate out of this. I’ll definitely use it!
Cheers
Johann
from azure-functions-kafka-extension.
Correct
from azure-functions-kafka-extension.
Do we need to expose statistics or additional information to the scaler or is it going directly to the source?
from azure-functions-kafka-extension.
Hi @fbeltrao, yup that sounds like a perfectly good and standard design. I will go through the code over the next few weeks and provide more feedback then too.
The main question is how will you decide whether to add new hosts or not? An extreme option is to always have one host per partition which will mean rapid processing but inefficient costing and you might as well go for continuous WebJobs. Alternatively you need some intelligence that decides that scale up is necessary and I suppose you have a few options below.
-Scale out based on the number of partitions for the consumer group which have consumer lag. As previously mentioned this could be done using the admin client if this function is supported, or via consumer statistics. This would be the ultimate option in terms of cost effectiveness yet supporting high throughput.
-Scale out based on CPU usage (doesn't really make much sense because CPU usage doesn't indicate that there is work to do on other partitions) or some other arbitrary measure.
-Scale out if the current set of hosts is busy since that is an indicator that there are active messages on the topic but is no guarantee that there will be any work for an additional host to do.
The other issue is the short lifespan of functions. I suppose your scaling engine will need to account for the fact that a function can only run for so long if it's running within a consumption based plan. The extension is going to have to have smarts to auto shutdown in a clean fashion and the engine is going to need to decide whether to instantly spin up another host because there's still more work to do.
from azure-functions-kafka-extension.
This is being worked out in a related and private repo. Will link to this issue from there.
--
Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.
from azure-functions-kafka-extension.
Does this mean we won't need a custom scaler if we plan on hosting this in Azure in an App Service (until Consumption is supported)?
from azure-functions-kafka-extension.
Awesome. Might just close this one then :)
Or, leave it open until we verify we can deploy this custom scaler in to an App Svc environment.
from azure-functions-kafka-extension.
The App Svc environment won't have a custom scaler - it will rely on auto-scaling which scales based on CPU and memory. So the short term solution (sans that private repo thing I linked above) is rely on CPU metrics to drive scaling.
Allowing you to use a customer scaler in something like App Service or even consumption is on the backlog of Functions, but in a way that it's hard to know what an ETA would be. I don't expect this calendar year. If there's enough customer demand probably the quickest route is we provide support on the Azure Functions consumption scale controller - that's a much more straight forward path. Not easy, but easier than customer scalers.
from azure-functions-kafka-extension.
Not sure I understand how CPU metrics will help figure out that our Function is falling behind.
from azure-functions-kafka-extension.
Also I'm sorry I thought this comment was in this repo but I added to the other. And the short answer is it won't.
Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.
from azure-functions-kafka-extension.
Hmmmm, This is why I was thinking that we'd need something that would check the queue length and scale our App Svc accordingly.
I know @anirudhgarg and I discussed this briefly when we met.
Even if I scale the App Svc (that hosts my Functions) manually how will this know to go spin up additional instances of the Function? Something needs to tell it to go spin up new instances, or tear down instances that exist (or does the App Svc plan not do this, and just keeps instances sitting idle waiting for work?)
from azure-functions-kafka-extension.
You aren't wrong - but there's no feature or existing way to do this today. So there's nothing we can build that could check the queue length and then use that detail to drive scaling decisions of an app service plan.
from azure-functions-kafka-extension.
Thanks @jcooper1982. Very good points. We currently only have one consumer per partition to ensure we keep ordered delivery intact. In terms of how many instances we need etc. this is where we're a bit stuck right now. Current thinking is that we need some kind of observer, but what should it observe on?
from azure-functions-kafka-extension.
Hi @ryancrawcour, consumer lag for the consumer group per partition is definitely the right thing to observe because then you can spin up one instance per partition which the consumer is lagging against, ensuring you only have just enough active functions running at a given time. Is it actually possible to build a custom observer for Kafka in the functions engine?
I believe there is an admin client in the latest Confluent SDK, though I haven’t played with it yet, which you might be able to get consumer lag out of. Alternatively you could get it from consumer statistics.
A trick I’ve done before (have written a Kafka .Net SDK wrapping confluent) is to have multiple threads within a consumer with logic to stage messages in memory if they share a key with a message already being processed. This obviously requires some throttling controls to ensure you don’t overrun memory but does offer you the best of all worlds. I’ll be happy to contribute (probably not for a few weeks) if you guys are keen to enable this as an option. This was an optional feature in my SDK as well.
from azure-functions-kafka-extension.
Thanks @jcooper1982 we'd value your contributions whenever you are able.
from azure-functions-kafka-extension.
Hi @jcooper1982 thanks for your feedback!
Current design:
- Let librdkafka manage partition assignment
- Adding new hosts will rearrange partitions. Scaling up to
Topic.PartitionCount
- Order is guaranteed at partition level
- Single consumer per partition. Consumer has 1+ partitions
I believe this is a good general implementation, assuming you have well distributed partition load and processing.
However, if partitions are unbalanced we would have to manage partition assignments manually and spawn new hosts to handle lagging partitions in isolation. I am not sure it makes sense, as the problem source is most likely a sub-optimal partition strategy.
Does it make sense?
from azure-functions-kafka-extension.
I wouldn't worry about the function execution time, because the limit your are talking about is for the custom code that will be triggered by this extension. AFAIK the listening part does not have a limit, and will be shutdown only if the scale controller decides to scale it down.
from azure-functions-kafka-extension.
Fantastic. Will dive into this a bit deeper soon and will hopefully have a better appreciation for the underlying architecture of Azure functions by then since I only understand this at a conceptual level right now.
Will also speak with my (soon to be new) employer about what level of collaboration they are willing to commit, since the SDK I wrote for them will likely have quite a few reusable parts here that could add a lot of value.
from azure-functions-kafka-extension.
Just updating this - we have a way now to create a custom trigger to run in VNet enabled plans (premium). We should be able to make this trigger run well in the Premium plan (Linux and Windows) now
from azure-functions-kafka-extension.
Azure Functions Premium now supports scaling based on Kafka topic length.
Support in KEDA exists
from azure-functions-kafka-extension.
Related Issues (20)
- Kafka triggered function [Java] not deserializing Avro with an array structure
- Kafkaoutput binding for azure function : Java HOT 2
- KafkaTrigger (.NET) does not work with Avro schema that has the has an array of records HOT 5
- Support for v2 programming model for Azure Functions using Python HOT 7
- Target scaler is not working (.NET) HOT 1
- Inconsistency between consumer and producer config HOT 1
- How to do exception handling in kafka output trigger if trigger fails to write to kafka HOT 2
- Distributed Tracing and kafka trigger
- Kafka trigger parameterization issues HOT 2
- Confluent Packages out of date causing runtime error
- Support for Bring your own certificates (.pfx)
- SchemaRegistryUrl attribute does not work with Java Azure Functions
- Enable support for rich datatypes in dotnet - isolated based apps using Kafka extension
- Kafka Trigger, Exception in Kafka subscriber, System.ObjectDisposedException at Confluent.Kafka.Impl.SafeKafkaHandle.ThrowIfHandleClosed
- Combine Key, Value, Header, TimeStamp in the same .Net object
- Make Kafka Output a first class typed citizen
- Support for librdkafka compression.codec and compression.type
- Kafka extension V3.9.0 issue HOT 1
- A lot of redundant logs produced
- Azure Function (Java ) to Confluent Kafka output binding error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azure-functions-kafka-extension.