Comments (4)
For anyone facing this issue in large scale deployment here are some tips and tricks
-
You cannot deploy as single k8s deployment across multiple nodes due to EBS only supported per node and cant be shared, so my approach was make a new deployment with suffix 1,2,3 and add nodeSelector to each of my K8S nodes
-
Until streaming issue is fixed by @chenhunghan and collaborators you cannot have more than one concurrent streams per instance, else the tokens get mixed up and not seperated by request id domain
Current Solution to 2
Use NGINX-PLUS (required) in front of the LLM nodes to setup load HTTP load balancing across all the nodes and ensure that one node has only one active connection at any given moment here is my conf files
nginx.conf
user nginx;
worker_processes 1; #as max connections count is only locally check per worker process more than 1 will cause thesame issue by sending multiple request to thesame node during streaming, so set cap to 1
error_log /var/log/nginx/error.log notice;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 5;
#gzip on;
include /etc/nginx/conf.d/*.conf;
}
}
default.conf
upstream llm {
least_conn;
server ip:port max_conns=1; #K8S Node 1, you can have only node and it will still ensure 1 active connection at any time
server ip:port max_conns=1; #K8S Node 2
server ip:port max_conns=1; #K8S Node 3 #register new nodes to scale horizontally
queue 200 timeout=70;
}
server {
listen 3000;
location / {
proxy_set_header Host llm-gateway.domain.com;
proxy_pass http://llm;
}
}
This solution unfortunately limits throughput, the ideal fix is streaming to work properly for simulataneuos connections
from ialacol.
Interesting issue, this can be the LLM's state didn't reset when the second request arrive, we do pass the reset parameter to to reset LLM, so this might be a bug from upstream ctransformer or exllama.
I don't have a solution now, if it's possible to change your service architecture, consider put a queue for your request. For example deploy a queue service in front of ialacol. The script B wait for A to be process and immediately start after A is done. THREADS
won't help unfortunately because that's for CPU, does not apply to GPTQ models.
If you need to handle requests in parallel, my suggestion is to use 2 EC2 instances and a load balancer (https://github.com/ialacol/text-inference-batcher can be used as load balancer as well) to distribute the requests so one ialacol instance only process one request at the same time.
from ialacol.
Thank you for the prompt response(no pun intended) @chenhunghan , below is my setup using kubernetes with 6 nodes of g5.48xlarge with GPTQ, each has two pods of 13B GPTQ(for load balancing with a node) so total 2 * 6 = 12 pods runnings, and node has a dns for inference in this format llm-1.domain.com, llm-2 etc.
what i was planning to do is use nginx to http load balance the request to each upstream node based on least connections criteria to ensure fair spread but now based on the issue above i guess i have to somehow set nginx to allow one max active connection to a server at any moment, i am guessing after streaming is done then the connection opens up again?
![Screenshot 2023-10-11 at 3 45 23 PM](https://private-user-images.githubusercontent.com/84636940/274261279-d20e4921-0c9e-49a4-9152-0981439b5244.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDIxNzg1MDksIm5iZiI6MTcwMjE3ODIwOSwicGF0aCI6Ii84NDYzNjk0MC8yNzQyNjEyNzktZDIwZTQ5MjEtMGM5ZS00OWE0LTkxNTItMDk4MTQzOWI1MjQ0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEwVDAzMTY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk2NDk4MmNlOWYwYTQ5ZTkxNTI2NWNkOGE5ZmE2MmZhN2JmYmIyMmQ3NjQyZWM5YzE5M2I0MjkwMmNiZWQzMjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.XqDGeRpppJXqQ_B1TDSmt8cAwW_DTkL7Fg_BZ_WvuRc)
from ialacol.
what i was planning to do is use nginx to http load balance the request to each upstream node based on least connections criteria to ensure fair spread but now based on the issue above i guess i have to somehow set nginx to allow one max active connection to a server at any moment, i am guessing after streaming is done then the connection opens up again?
Yes, exactly Least Connection and max one connection.
from ialacol.
Related Issues (20)
- Quickly get started with ialacol. HOT 2
- Helm install fails HOT 2
- Storage Class value named differently in PVC templates and documentation examples HOT 1
- Images with :cuda and :metal tags
- Error when trying to use Starchat Beta HOT 1
- Error when trying to use Falcon-7B HOT 4
- Add langchain instructions
- GPU_LAYERS mapping is missing in Helm template
- Issue with GPU-accelerated LLAMA 2 HOT 3
- Support GPTQ model
- Add default Liveness, Readiness and Startup probes
- Allow to mount existing pvc
- Deployment fails to respond with errors HOT 1
- Downloading models fail with timeouts, retry is not enabled. HOT 3
- Support GPTQ via Transformer instead of Exllama/ctransformer
- Auto detecting threads HOT 2
- Plan to support AWQ models HOT 1
- Unable to download HG model from specific branch in helm chart HOT 5
- Usage gpu_layers with ialacol-metal provides an error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ialacol.