Giter VIP home page Giter VIP logo

Comments (20)

jcooper1982 avatar jcooper1982 commented on June 20, 2024 2

Hi guys,

Just found this thread and I’m not sure if the approach you guys are talking about is the most natural way to scale Kafka consumers. Not that your ideas aren’t great, they just seem more aligned to something like Azure Service Bus than Kafka.

Queue length shouldn’t ideally be used to decide how many instances to spin up as that will compromise ordered processing of messages if you have multiple instances processing against a given partition. Rather the best way to handle this is to see how many partitions for the given topic the consumer is lagging against and to spin up an instance for each. The difficulty here is that in order to achieve the consumption at scale you really want for the function that is assigned to a given partition to hang around for awhile and process lots of messages rather than just one, thus taking advantage of librdkafka’s batching logic.

Have been running Kafka consumers very successfully in WebJobs but without any dynamic scaling. If you guys crack this problem in functions then this would be amazing. It might be worth paying attention to KNative’s approach but I suspect the way it works there is via an observer container that then decides how many worked containers to spin up. Without the observer I think you guys are going to have to be mega-creative to come up with a good solution that is a natural fit for Kafka.

Again apologies if all of this is obvious to you guys. Not trying to poke holes but am instead keen to see a great solution eventuate out of this. I’ll definitely use it!

Cheers
Johann

from azure-functions-kafka-extension.

jeffhollan avatar jeffhollan commented on June 20, 2024 1

Correct

from azure-functions-kafka-extension.

fbeltrao avatar fbeltrao commented on June 20, 2024 1

Do we need to expose statistics or additional information to the scaler or is it going directly to the source?

from azure-functions-kafka-extension.

jcooper1982 avatar jcooper1982 commented on June 20, 2024 1

Hi @fbeltrao, yup that sounds like a perfectly good and standard design. I will go through the code over the next few weeks and provide more feedback then too.

The main question is how will you decide whether to add new hosts or not? An extreme option is to always have one host per partition which will mean rapid processing but inefficient costing and you might as well go for continuous WebJobs. Alternatively you need some intelligence that decides that scale up is necessary and I suppose you have a few options below.

-Scale out based on the number of partitions for the consumer group which have consumer lag. As previously mentioned this could be done using the admin client if this function is supported, or via consumer statistics. This would be the ultimate option in terms of cost effectiveness yet supporting high throughput.
-Scale out based on CPU usage (doesn't really make much sense because CPU usage doesn't indicate that there is work to do on other partitions) or some other arbitrary measure.
-Scale out if the current set of hosts is busy since that is an indicator that there are active messages on the topic but is no guarantee that there will be any work for an additional host to do.

The other issue is the short lifespan of functions. I suppose your scaling engine will need to account for the fact that a function can only run for so long if it's running within a consumption based plan. The extension is going to have to have smarts to auto shutdown in a clean fashion and the engine is going to need to decide whether to instantly spin up another host because there's still more work to do.

from azure-functions-kafka-extension.

jeffhollan avatar jeffhollan commented on June 20, 2024

This is being worked out in a related and private repo. Will link to this issue from there.

--

Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.

from azure-functions-kafka-extension.

ryancrawcour avatar ryancrawcour commented on June 20, 2024

Does this mean we won't need a custom scaler if we plan on hosting this in Azure in an App Service (until Consumption is supported)?

from azure-functions-kafka-extension.

ryancrawcour avatar ryancrawcour commented on June 20, 2024

Awesome. Might just close this one then :)
Or, leave it open until we verify we can deploy this custom scaler in to an App Svc environment.

from azure-functions-kafka-extension.

jeffhollan avatar jeffhollan commented on June 20, 2024

The App Svc environment won't have a custom scaler - it will rely on auto-scaling which scales based on CPU and memory. So the short term solution (sans that private repo thing I linked above) is rely on CPU metrics to drive scaling.

Allowing you to use a customer scaler in something like App Service or even consumption is on the backlog of Functions, but in a way that it's hard to know what an ETA would be. I don't expect this calendar year. If there's enough customer demand probably the quickest route is we provide support on the Azure Functions consumption scale controller - that's a much more straight forward path. Not easy, but easier than customer scalers.

from azure-functions-kafka-extension.

ryancrawcour avatar ryancrawcour commented on June 20, 2024

Not sure I understand how CPU metrics will help figure out that our Function is falling behind.

from azure-functions-kafka-extension.

jeffhollan avatar jeffhollan commented on June 20, 2024

Also I'm sorry I thought this comment was in this repo but I added to the other. And the short answer is it won't.

Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.

from azure-functions-kafka-extension.

ryancrawcour avatar ryancrawcour commented on June 20, 2024

Hmmmm, This is why I was thinking that we'd need something that would check the queue length and scale our App Svc accordingly.

I know @anirudhgarg and I discussed this briefly when we met.

Even if I scale the App Svc (that hosts my Functions) manually how will this know to go spin up additional instances of the Function? Something needs to tell it to go spin up new instances, or tear down instances that exist (or does the App Svc plan not do this, and just keeps instances sitting idle waiting for work?)

from azure-functions-kafka-extension.

jeffhollan avatar jeffhollan commented on June 20, 2024

You aren't wrong - but there's no feature or existing way to do this today. So there's nothing we can build that could check the queue length and then use that detail to drive scaling decisions of an app service plan.

from azure-functions-kafka-extension.

ryancrawcour avatar ryancrawcour commented on June 20, 2024

Thanks @jcooper1982. Very good points. We currently only have one consumer per partition to ensure we keep ordered delivery intact. In terms of how many instances we need etc. this is where we're a bit stuck right now. Current thinking is that we need some kind of observer, but what should it observe on?

from azure-functions-kafka-extension.

jcooper1982 avatar jcooper1982 commented on June 20, 2024

Hi @ryancrawcour, consumer lag for the consumer group per partition is definitely the right thing to observe because then you can spin up one instance per partition which the consumer is lagging against, ensuring you only have just enough active functions running at a given time. Is it actually possible to build a custom observer for Kafka in the functions engine?

I believe there is an admin client in the latest Confluent SDK, though I haven’t played with it yet, which you might be able to get consumer lag out of. Alternatively you could get it from consumer statistics.

A trick I’ve done before (have written a Kafka .Net SDK wrapping confluent) is to have multiple threads within a consumer with logic to stage messages in memory if they share a key with a message already being processed. This obviously requires some throttling controls to ensure you don’t overrun memory but does offer you the best of all worlds. I’ll be happy to contribute (probably not for a few weeks) if you guys are keen to enable this as an option. This was an optional feature in my SDK as well.

from azure-functions-kafka-extension.

ryancrawcour avatar ryancrawcour commented on June 20, 2024

Thanks @jcooper1982 we'd value your contributions whenever you are able.

from azure-functions-kafka-extension.

fbeltrao avatar fbeltrao commented on June 20, 2024

Hi @jcooper1982 thanks for your feedback!

Current design:

  • Let librdkafka manage partition assignment
  • Adding new hosts will rearrange partitions. Scaling up to Topic.PartitionCount
  • Order is guaranteed at partition level
  • Single consumer per partition. Consumer has 1+ partitions

I believe this is a good general implementation, assuming you have well distributed partition load and processing.

However, if partitions are unbalanced we would have to manage partition assignments manually and spawn new hosts to handle lagging partitions in isolation. I am not sure it makes sense, as the problem source is most likely a sub-optimal partition strategy.

Does it make sense?

from azure-functions-kafka-extension.

fbeltrao avatar fbeltrao commented on June 20, 2024

I wouldn't worry about the function execution time, because the limit your are talking about is for the custom code that will be triggered by this extension. AFAIK the listening part does not have a limit, and will be shutdown only if the scale controller decides to scale it down.

from azure-functions-kafka-extension.

jcooper1982 avatar jcooper1982 commented on June 20, 2024

Fantastic. Will dive into this a bit deeper soon and will hopefully have a better appreciation for the underlying architecture of Azure functions by then since I only understand this at a conceptual level right now.

Will also speak with my (soon to be new) employer about what level of collaboration they are willing to commit, since the SDK I wrote for them will likely have quite a few reusable parts here that could add a lot of value.

from azure-functions-kafka-extension.

jeffhollan avatar jeffhollan commented on June 20, 2024

Just updating this - we have a way now to create a custom trigger to run in VNet enabled plans (premium). We should be able to make this trigger run well in the Premium plan (Linux and Windows) now

from azure-functions-kafka-extension.

ryancrawcour avatar ryancrawcour commented on June 20, 2024

Azure Functions Premium now supports scaling based on Kafka topic length.
Support in KEDA exists

from azure-functions-kafka-extension.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.