Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you for the prompt response(no pun intended) <a class="user-mention notranslate"

Mix streamings and threads count for GPTQ Models bug about ialacol HOT 4 CLOSED

chenhunghan commented on June 15, 2024

Mix streamings and threads count for GPTQ Models bug

from ialacol.

Comments (4)

thearchitectxy commented on June 15, 2024 2

For anyone facing this issue in large scale deployment here are some tips and tricks

You cannot deploy as single k8s deployment across multiple nodes due to EBS only supported per node and cant be shared, so my approach was make a new deployment with suffix 1,2,3 and add nodeSelector to each of my K8S nodes
Until streaming issue is fixed by @chenhunghan and collaborators you cannot have more than one concurrent streams per instance, else the tokens get mixed up and not seperated by request id domain

Current Solution to 2
Use NGINX-PLUS (required) in front of the LLM nodes to setup load HTTP load balancing across all the nodes and ensure that one node has only one active connection at any given moment here is my conf files

nginx.conf

user  nginx;
worker_processes  1; #as max connections count is only locally check per worker process more than 1 will cause thesame issue by sending multiple request to thesame node during streaming, so set cap to 1

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;


events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    keepalive_timeout  5;

    #gzip  on;

    include /etc/nginx/conf.d/*.conf;
}
  }

default.conf

  upstream llm {
        least_conn;
        server ip:port max_conns=1; #K8S Node 1, you can have only node and it will still ensure 1 active connection at any time
        server ip:port max_conns=1; #K8S Node 2
        server ip:port max_conns=1; #K8S Node 3 #register new nodes to scale horizontally
        queue 200 timeout=70;
    }

    server {
        listen 3000;

        location / {
            proxy_set_header Host llm-gateway.domain.com;
            proxy_pass http://llm;
        }
    }

This solution unfortunately limits throughput, the ideal fix is streaming to work properly for simulataneuos connections

from ialacol.

chenhunghan commented on June 15, 2024

Interesting issue, this can be the LLM's state didn't reset when the second request arrive, we do pass the reset parameter to to reset LLM, so this might be a bug from upstream ctransformer or exllama.

I don't have a solution now, if it's possible to change your service architecture, consider put a queue for your request. For example deploy a queue service in front of ialacol. The script B wait for A to be process and immediately start after A is done. THREADS won't help unfortunately because that's for CPU, does not apply to GPTQ models.

If you need to handle requests in parallel, my suggestion is to use 2 EC2 instances and a load balancer (https://github.com/ialacol/text-inference-batcher can be used as load balancer as well) to distribute the requests so one ialacol instance only process one request at the same time.

from ialacol.

thearchitectxy commented on June 15, 2024

Thank you for the prompt response(no pun intended) @chenhunghan , below is my setup using kubernetes with 6 nodes of g5.48xlarge with GPTQ, each has two pods of 13B GPTQ(for load balancing with a node) so total 2 * 6 = 12 pods runnings, and node has a dns for inference in this format llm-1.domain.com, llm-2 etc.
what i was planning to do is use nginx to http load balance the request to each upstream node based on least connections criteria to ensure fair spread but now based on the issue above i guess i have to somehow set nginx to allow one max active connection to a server at any moment, i am guessing after streaming is done then the connection opens up again?

from ialacol.

chenhunghan commented on June 15, 2024

what i was planning to do is use nginx to http load balance the request to each upstream node based on least connections criteria to ensure fair spread but now based on the issue above i guess i have to somehow set nginx to allow one max active connection to a server at any moment, i am guessing after streaming is done then the connection opens up again?

Yes, exactly Least Connection and max one connection.

from ialacol.

Mix streamings and threads count for GPTQ Models bug about ialacol HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent