Giter VIP home page Giter VIP logo

Comments (4)

thearchitectxy avatar thearchitectxy commented on June 15, 2024 2

For anyone facing this issue in large scale deployment here are some tips and tricks

  1. You cannot deploy as single k8s deployment across multiple nodes due to EBS only supported per node and cant be shared, so my approach was make a new deployment with suffix 1,2,3 and add nodeSelector to each of my K8S nodes

  2. Until streaming issue is fixed by @chenhunghan and collaborators you cannot have more than one concurrent streams per instance, else the tokens get mixed up and not seperated by request id domain

Current Solution to 2
Use NGINX-PLUS (required) in front of the LLM nodes to setup load HTTP load balancing across all the nodes and ensure that one node has only one active connection at any given moment here is my conf files

nginx.conf

user  nginx;
worker_processes  1; #as max connections count is only locally check per worker process more than 1 will cause thesame issue by sending multiple request to thesame node during streaming, so set cap to 1

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;


events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    keepalive_timeout  5;

    #gzip  on;

    include /etc/nginx/conf.d/*.conf;
}
  }
  

default.conf

  upstream llm {
        least_conn;
        server ip:port max_conns=1; #K8S Node 1, you can have only node and it will still ensure 1 active connection at any time
        server ip:port max_conns=1; #K8S Node 2
        server ip:port max_conns=1; #K8S Node 3 #register new nodes to scale horizontally
        queue 200 timeout=70;
    }

    server {
        listen 3000;

        location / {
            proxy_set_header Host llm-gateway.domain.com;
            proxy_pass http://llm;
        }
    }
    

This solution unfortunately limits throughput, the ideal fix is streaming to work properly for simulataneuos connections

from ialacol.

chenhunghan avatar chenhunghan commented on June 15, 2024

Interesting issue, this can be the LLM's state didn't reset when the second request arrive, we do pass the reset parameter to to reset LLM, so this might be a bug from upstream ctransformer or exllama.

I don't have a solution now, if it's possible to change your service architecture, consider put a queue for your request. For example deploy a queue service in front of ialacol. The script B wait for A to be process and immediately start after A is done. THREADS won't help unfortunately because that's for CPU, does not apply to GPTQ models.

If you need to handle requests in parallel, my suggestion is to use 2 EC2 instances and a load balancer (https://github.com/ialacol/text-inference-batcher can be used as load balancer as well) to distribute the requests so one ialacol instance only process one request at the same time.

from ialacol.

thearchitectxy avatar thearchitectxy commented on June 15, 2024

Thank you for the prompt response(no pun intended) @chenhunghan , below is my setup using kubernetes with 6 nodes of g5.48xlarge with GPTQ, each has two pods of 13B GPTQ(for load balancing with a node) so total 2 * 6 = 12 pods runnings, and node has a dns for inference in this format llm-1.domain.com, llm-2 etc.
what i was planning to do is use nginx to http load balance the request to each upstream node based on least connections criteria to ensure fair spread but now based on the issue above i guess i have to somehow set nginx to allow one max active connection to a server at any moment, i am guessing after streaming is done then the connection opens up again?

Screenshot 2023-10-11 at 3 45 23 PM

from ialacol.

chenhunghan avatar chenhunghan commented on June 15, 2024

what i was planning to do is use nginx to http load balance the request to each upstream node based on least connections criteria to ensure fair spread but now based on the issue above i guess i have to somehow set nginx to allow one max active connection to a server at any moment, i am guessing after streaming is done then the connection opens up again?

Yes, exactly Least Connection and max one connection.

from ialacol.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.