simonkurtz-msft / python-openai-loadbalancer Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 0.0 529 KB

Smart Python OpenAI Load Balancer using priority endpoints and request retries. | Python package at link below:

Home Page: https://pypi.org/project/openai-priority-loadbalancer

License: MIT License

Python 97.92% PowerShell 2.08%

azureopenai load-balancer openai python

python-openai-loadbalancer's People

Contributors

Stargazers

Watchers

python-openai-loadbalancer's Issues

LoadBalancer does not yet support async

The initial implementation focused on the synchronous flow. It wouldn't take much to expand load_balancer.py to add support for async, I suspect.

Investigate Streaming

I have not yet focused on streaming. As I only change URL and Host header, I don't immediately expect issues with streaming, but it's to be tested.

Add types throughout loadbalancer code

openai_priority_loadbalancer is not fully typed yet. Prevent use of implicit or explicit any by using proper types.

Replace random with a truer uniform selection

Consider using numpy for a uniform distribution to selected an available backend. Part of the challenge is that the availableBackends array changes in size, making it difficult to have consistency in obtaining a uniform distribution.

Support for authentication keys

Examples in the code/tests leverage using the managed identity authorization with via token_provider.
If the current solution supports connecting to endpoints with API keys, ti would be good to update documentation.

If the solution does not support connecting to endpoints via API keys at the moment, this is a feature request :)

Account for `retry-after-ms` header

Presently, the module only accounts for the retry-after header containing a value in seconds. Azure OpenAI Provisioned Throughput (PTU) deployments also pass back the retry-after-ms header containing a value in milliseconds. This value may be preferential to seconds.

From the PTU documentation:

*A 429 response indicates that the allocated PTUs are fully consumed at the time of the call. The response includes the retry-after-ms and retry-after headers that tell you the time to wait before the next call will be accepted. *

As such, there is only a longer delay when not using retry-after-ms when there is simply no PTU left to consume (overall and/or tokens-per-minute?). This may be an acceptable situation that may not surface too often.

One approach may be to convert everything internal to the module to milliseconds. These resources provide the starting point for the enhancement:

cc @kristapratico

Add Support for OpenAI

Currently, the code only targets Azure OpenAI. It would be great to also cover OpenAI scenarios.

Use lowest retryAfter when no backends are available

We keep track of the LastRetryAfter value in load_balanced.py. When no backends are available, and the last attempted backend possibly has a high retryAfter value, we would wait more than we need to as other backends may have become available again.

What I believe we should do then is return the nearest upcoming retry_after datetime property in the backends collection. The spec allows for either a datetime or seconds. I expect the httpx client to honor both.

Add Support for Multiple Models

Presently, the backends are model agnostic. That means that every model being used by the implementer of this code must reside on every Azure OpenAI instance that is defined in the backend. This could be limiting because it would require a lowest common denominator. Take these backends, for example:

Backend 1 supports model A
Backend 2 supports models A & B
Backend 3 supports model A
Backend 4 supports model B
Backend 5 supports models A & B

Today, the backend pool can only use backends 2 and 5.

If the backend list could take model into consideration, the following would apply per model:

Model A: backends 1, 2, 3, and 5
Model B: backends 2, 4, and 5

I am interested to hear whether there is value in being able to specify backends per model or whether this is a potential solution in search of a problem.

Add Unit Tests

The repo presently does not have any unit tests, but they are needed to mock handling of situations with (Azure) OpenAI backends.

Add python coverage as well.

simonkurtz-msft / python-openai-loadbalancer Goto Github PK

python-openai-loadbalancer's People

Contributors

Stargazers

Watchers

python-openai-loadbalancer's Issues

LoadBalancer does not yet support async

Investigate Streaming

Add types throughout loadbalancer code

Replace random with a truer uniform selection

Support for authentication keys

Account for `retry-after-ms` header

Add Support for OpenAI

Use lowest retryAfter when no backends are available

Add Support for Multiple Models

Add Unit Tests

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent