Giter VIP home page Giter VIP logo

calm's Introduction

๐Ÿ˜Œ calm

This is an implementation of language model inference, aiming to get maximum single-GPU single-batch hardware utilization for LLM architectures with a minimal implementation and no dependencies1.

The goal of this project is experimentation and prototyping; it does not aim to be production ready or stable. It is heavily work in progress.

If you need support for a wide range of models, computing devices or quantization methods, you're probably looking for llama.cpp or ๐Ÿค— Transformers. If you need to run inference for multiple batches, you're probably looking for vLLM.

Parts of this code are based on Andrej Karpathy's llama2.c.

Running

To build and run calm, you need to download and convert a model, build the code using make2 and run it:

git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
python tools/convert.py mistral-7b-instruct.calm Mistral-7B-Instruct-v0.2/
make && ./build/run mistral-7b-instruct.calm -i "Q: What is the meaning of life?" -t 0

Before running Python you may want to install the dependencies via pip install -r tools/requirements.txt. When using git, git-lfs is required and the download size may be larger than necessary; you can use tools/download.py instead:

python tools/download.py Mistral-7B-Instruct-v0.2/ mistralai/Mistral-7B-Instruct-v0.2

Supported models

calm currently supports the following model architectures:

  • Llama-like (RMSNorm normalization, SiLU activation, sequential attention mixing and FFN, RoPE)
  • Phi (LayerNorm normalization, GELU activation, parallel attention mixing, partial RoPE)

It has been tested on following models:

  • Llama architecture
    • TinyLlama 1.1B (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
    • Llama2 7B (meta-llama/Llama-2-7b-chat-hf)
    • Llama2 13B (meta-llama/Llama-2-13b-chat-hf)
    • LLaMA Pro 8B (TencentARC/LLaMA-Pro-8B-Instruct)
    • Yi 34B (01-ai/Yi-34B-Chat)
  • Mistral architecture
    • Mistral 7B (mistralai/Mistral-7B-Instruct-v0.2)
    • SOLAR 10.7B (upstage/SOLAR-10.7B-Instruct-v1.0)
  • Qwen architecture
    • Qwen 7B (Qwen/Qwen-7B-Chat)
    • Qwen 14B (Qwen/Qwen-14B-Chat)
  • Phi architecture
    • Phi1.5 (microsoft/phi-1_5)
    • Phi2 (microsoft/phi-2)

Supported formats

Model weights support fp16 and fp8 formats; the weight type is specified at conversion time via --dtype argument to convert.py.

fp16 corresponds to 16-bit floating point (e5m10) and is the default option; note that some models store weights in bf16 which will be automatically converted.

fp8 corresponds to 8-bit floating point (e5m2). Using fp8 carries a ~0.5% perplexity penalty at almost double the inference speed and half the model size. e4m3 variant of fp8 would result in a much smaller perplexity penalty (~0.1%) with basic tensor scaling, but it's currently not used because of performance issues wrt floating-point conversion.

KV cache is using fp16.

Performance

As of December 2023, with Mistral 7B model and fp16 weights, calm reaches ~63.5 tok/s (921 GB/s) on short sequences and ~60 tok/s (904 GB/s) at the end of the 4096 token context, when using NVidia GeForce RTX 4090. When using fp8 weights on the same hardware and model, the performance is ~119 tok/s (863 GB/s) on short sequences and ~108 tok/s (840 GB/s) at the end of the context.

As of early January 2023, with Phi2 2.7B model and fp16 weights, calm reaches ~166 tok/s (927 GB/s) on short sequences and ~131 tok/s (815 GB/s) at the end of the 2048 token context, when using NVidia GeForce RTX 4090. When using fp8 weights on the same hardware and model, the performance is ~301 tok/s (851 GB/s) on short sequences and ~207 tok/s (710 GB/s) at the end of the context.

Currently prompts are processed serially, one token at a time; in the future, prompt processing will need to be parallelized to avoid the bandwidth bottleneck.

Currently weights support fp16 and fp8 formats; in the future, 4-bit quantization is planned. This will allow running inference at higher tok/s, however the main metric is bandwidth utilization and the goal is to keep it as close to peak as possible at all supported weight formats.

RTX 4090 has a peak bandwidth of ~1008 GB/s, however it's unclear if a peak higher than ~950 GB/s is attainable in practice3. The code has not been heavily tuned for datacenter-grade hardware (A100/H100) or earlier NVidia architectures yet.

Model files

calm uses ๐Ÿค— Safetensors to store model files. Note that the models require conversion (see below), because calm stores model hyperparameters in .safetensors metadata and may expect a particular set of tensor names or weight order within tensors that is not always compatible with the source. Tokenizer data is stored as tensors inside the model file as well.

Footnotes

  1. CUDA runtime and compiler is used for GPU acceleration, but no CUDA or C libraries are used. Python conversion scripts use safetensors and torch, see tools/requirements.txt. โ†ฉ

  2. Linux is the only supported OS at the moment. โ†ฉ

  3. Based on testing a specific Gigabyte GeForce RTX 4090 where both individual kernels from this repository and cuBLAS peak at about ~955 GB/s. โ†ฉ

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.