Giter VIP home page Giter VIP logo

python-openai-loadbalancer's People

Contributors

simonkurtz-msft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

python-openai-loadbalancer's Issues

Investigate Streaming

I have not yet focused on streaming. As I only change URL and Host header, I don't immediately expect issues with streaming, but it's to be tested.

Replace random with a truer uniform selection

Consider using numpy for a uniform distribution to selected an available backend. Part of the challenge is that the availableBackends array changes in size, making it difficult to have consistency in obtaining a uniform distribution.

Support for authentication keys

Examples in the code/tests leverage using the managed identity authorization with via token_provider.
If the current solution supports connecting to endpoints with API keys, ti would be good to update documentation.

If the solution does not support connecting to endpoints via API keys at the moment, this is a feature request :)

Account for `retry-after-ms` header

Presently, the module only accounts for the retry-after header containing a value in seconds. Azure OpenAI Provisioned Throughput (PTU) deployments also pass back the retry-after-ms header containing a value in milliseconds. This value may be preferential to seconds.

From the PTU documentation:

*A 429 response indicates that the allocated PTUs are fully consumed at the time of the call. The response includes the retry-after-ms and retry-after headers that tell you the time to wait before the next call will be accepted. *

As such, there is only a longer delay when not using retry-after-ms when there is simply no PTU left to consume (overall and/or tokens-per-minute?). This may be an acceptable situation that may not surface too often.


One approach may be to convert everything internal to the module to milliseconds. These resources provide the starting point for the enhancement:

cc @kristapratico

Add Support for OpenAI

Currently, the code only targets Azure OpenAI. It would be great to also cover OpenAI scenarios.

Use lowest retryAfter when no backends are available

We keep track of the LastRetryAfter value in load_balanced.py. When no backends are available, and the last attempted backend possibly has a high retryAfter value, we would wait more than we need to as other backends may have become available again.

What I believe we should do then is return the nearest upcoming retry_after datetime property in the backends collection. The spec allows for either a datetime or seconds. I expect the httpx client to honor both.

Add Support for Multiple Models

Presently, the backends are model agnostic. That means that every model being used by the implementer of this code must reside on every Azure OpenAI instance that is defined in the backend. This could be limiting because it would require a lowest common denominator. Take these backends, for example:

  • Backend 1 supports model A
  • Backend 2 supports models A & B
  • Backend 3 supports model A
  • Backend 4 supports model B
  • Backend 5 supports models A & B

Today, the backend pool can only use backends 2 and 5.

If the backend list could take model into consideration, the following would apply per model:

  • Model A: backends 1, 2, 3, and 5
  • Model B: backends 2, 4, and 5

I am interested to hear whether there is value in being able to specify backends per model or whether this is a potential solution in search of a problem.

Add Unit Tests

The repo presently does not have any unit tests, but they are needed to mock handling of situations with (Azure) OpenAI backends.

Add python coverage as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.