Giter VIP home page Giter VIP logo

talis's Introduction

What is TALIS?

Triton accelerated LLaMA inference server (TALIS) attempts to become a simple, fast and robust solution for serving LLaMA models via API with an emphasis on inference speed.

This is a supersuper early version of TALIS. It may not work out "out of the box". For now it supports:

  • Running GPTQ quantized 65B LLaMA models on 2x 24GB VRAM Nvidia GPUs on Linux.

What can it do?

For now it enables 65B-LLaMA models to run primerily on dual RTX 3090 or RTX 4090 GPU's with decent speed. Some benchmarks my come soon, but the gist is it can run 65B-LLaMA models at over 10 tps (tokens per second) on two RTX 4090's with a max sequence length of 1525 tokens on a Linux headless server.

How to use

For now this is geared towards people familiar with Linux and Python. If you are not, you can still use it, but you will have to do some research on your own.

The requirements may or may not be correct. Sorry. (Reach out if you have issues.)

(Very) Basic Example

The following will let you parse inputs to the model and get outputs from the model via the command line.

  1. Specify the settings in the "load_config.py" file:
# example load_config.py

MODEL_DIR = "/path/to/your/model/dir"
CHECKPOINTS = "/path/to/your/checkpoints.safetensors"
WBITS = 4
GROUPSIZE = 128
GEN_CONFIG = "gen_default.json"
DEVICE_MAP = "device_map_standard.json"
  1. Start the python script from within the repo directory:
python3 llama_inference.py

What is Planned? (In order of priority)

  • Provide an actual server and API
  • Support more LLaMA model-sizes and GPU's
  • Provide docker support
  • Provide a simple web interface
  • (maybe) substitute Huggingface libs for more lightweight solutions (watching this closely)

Acknowledgements

This code is based on GPTQ and GPTQ-forLLaMa.

Triton GPTQ kernel code is based on GPTQ-triton.

The user GitHub user emvw7yf who provided the llama-accelerate-path patch, which gave a 5x speedup and really made the whole project viable.

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

talis's People

Contributors

qwopqwop200 avatar tpoisonooo avatar mastertaffer avatar itslogic avatar thireus avatar aljungberg avatar dhaladom avatar oobabooga avatar tonynazzal avatar musabgultekin avatar sgsdxzy avatar jeremy-costello avatar diegomontoya avatar dependabot[bot] avatar yellowrosecx avatar cauyxy avatar usbhost avatar tobbez avatar johnsmith0031 avatar johnrobinsn avatar lunderberg avatar wlsdml1114 avatar dariosucic avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.