Giter VIP home page Giter VIP logo

gpt-fast's Introduction

gpt-fast

Simple and efficient pytorch-native transformer text generation.

Featuring:

  1. Very low latency
  2. <1000 lines of python
  3. No dependencies other than PyTorch and sentencepiece
  4. int8/int4 quantization
  5. Speculative decoding
  6. Tensor parallelism
  7. Supports Nvidia and AMD GPUs

This is NOT intended to be a "framework" or "library" - it is intended to show off what kind of performance you can get with native PyTorch :) Please copy-paste and fork as you desire.

For an in-depth walkthrough of what's in this codebase, see this blog post.

Examples

In the spirit of keeping the repo minimal, here are various examples of extensions you can make to gpt-fast as PRs.

Supported Models

LLaMA family

Please check the rest of this page about benchmark of LLaMA family models.

Mixtral 8x7B

We also supported Mixtral 8x7B which is a high-quality sparse mixture of experts (MoE) model, the average token generation rates are:

1 GPU 2 GPU 4 GPU 8 GPU
baseline(bfloat16) OOM 96.67 155.35 227.82
int8 97.92 155.03 216.87 279.35

Note that the benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh topology. Note that all benchmarks are run at batch size=1, making the reported tokens/s numbers equivalent to "tokens/s/user". In addition, they are run with a very small prompt length (just 5 tokens).

For more details about Mixtral 8x7B, please check this page or this note.

Community

Projects inspired by gpt-fast in the community:

  • gpt-blazing: applies the same performance optimization strategy to more models (e.g., baichuan2).
  • gptfast: applies a subset of the performance optimizations to all Huggingface models
  • gpt-accelera: extends gpt-fast to SFT/RM/PPO training and batched inference to optimize the throughput

Installation

Download PyTorch nightly Install sentencepiece and huggingface_hub

pip install sentencepiece huggingface_hub

To download llama models, go to https://huggingface.co/meta-llama/Llama-2-7b and go through steps to obtain access. Then login with huggingface-cli login

Downloading Weights

Models tested/supported

tinyllamas/stories{15,42,100}
openlm-research/open_llama_7b
meta-llama/Llama-2-7b-chat-hf
meta-llama/Llama-2-13b-chat-hf
meta-llama/Llama-2-70b-chat-hf
codellama/CodeLlama-7b-Python-hf
codellama/CodeLlama-34b-Python-hf
mistralai/Mistral-7B-v0.1
mistralai/Mistral-7B-Instruct-v0.1
mistralai/Mistral-7B-Instruct-v0.2

For example, to convert Llama-2-7b-chat-hf

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
./scripts/prepare.sh $MODEL_REPO

Benchmarks

Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh topology. Note that all benchmarks are run at batch size=1, making the reported tokens/s numbers equivalent to "tokens/s/user". In addition, they are run with a very small prompt length (just 5 tokens).

Model Technique Tokens/Second Memory Bandwidth (GB/s)
Llama-2-7B Base 104.9 1397.31
8-bit 155.58 1069.20
4-bit (G=32) 196.80 862.69
Llama-2-70B Base OOM
8-bit 19.13 1322.58
4-bit (G=32) 25.25 1097.66

Speculative Sampling

Verifier: Llama-70B (int4), Draft: Llama-7B (int4): 48.4 tok/s

Tensor Parallelism

Model Number of GPUs Tokens/Second Memory Bandwidth (GB/s)
Llama-2-7B 1 104.9 1397.31
2 168.84 1181.99
4 254.02 955.83
8 328.43 704.10
Llama-2-70B 1 OOM
2 21.32 1481.87
4 38.01 1340.76
8 62.50 1135.29

Tensor Parallelism + Quantization

Model Technique Tokens/Second Memory Bandwidth (GB/s)
Llama-2-70B Base 62.50 1135.29
8-bit 80.44 752.04
4-bit (G=32) 90.77 548.10

AMD

Benchmarks run on one GCD of a MI-250x.

Model Technique Tokens/Second Memory Bandwidth (GB/s)
Llama-2-7B Base 76.33 1028.70
8-bit 101.86 700.06

Generate Text

Model definition in model.py, generation code in generate.py.

python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"

To squeeze out a little bit more performance, you can also compile the prefill with --compile_prefill. This will increase compilation times though.

Quantization

Choose device to use by

# The current support devices: cuda, cpu
export DEVICE=cuda

Int8 Weight-Only Quantization

To generate this version of the model

# Spits out model at checkpoints/$MODEL_REPO/model_int8.pth
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8

To run with int8, just pass the int8 checkpoint to generate.py.

python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --device $DEVICE

Int4 Weight-Only Quantization

To generate int4 version of model

# Spits out model at checkpoints/$MODEL_REPO/model_int4.g32.$DEVICE.pth
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 --groupsize 32

To run with int4, just pass the int4 checkpoint to generate.py.

python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g32.pth --compile

Speculative Sampling

To generate with speculative sampling (DRAFT_MODEL_REPO should point to a smaller model compared with MODEL_REPO).

In this example, the "smaller" model is just the int8 quantized version of the model.

export DRAFT_MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --draft_checkpoint_path checkpoints/$DRAFT_MODEL_REPO/model_int8.pth

Note: Running on an A100 80GB, albeit power-limited to 330 watts. Empirically, seems like peak bandwidth is about 1700 GB/s.

Tensor Parallelism

ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth

Experimental

Evaluation

We use the EleutherAI evaluation harness to evaluate our model accuracy. To evaluate the accuracy, make sure the evaluation harness is installed and pass your model checkpoint and desired tasks to eval.py.

python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --compile --tasks hellaswag winogrande

Note: Generative tasks are currently not supported for gpt-fast

Installation Instructions for the evaluation harness: https://github.com/EleutherAI/lm-evaluation-harness/tree/master#install

GPTQ

We have a pure pytorch implementation of GPTQ that utilizes torch._dynamo.export to access the model structure. You can generate a GPTQ quantized version of int4 quantization by using the same command to quantize it but adding 'gptq' to the quantization mode i.e.

# Spits out model at checkpoints/$MODEL_REPO/model_int4-gptq.g32.pth
python quantize.py --mode int4-gptq --calibration_tasks wikitext --calibration_seq_length 2048

You can then eval or generate text with this model in the same way as above.

License

gpt-fast is released under the BSD 3 license.

Acknowledgements

Thanks to:

  • Lightning AI for supporting pytorch and work in flash attention, int8 quantization, and LoRA fine-tuning.
  • GGML for driving forward fast, on device inference of LLMs
  • Karpathy for spearheading simple, interpretable and fast LLM implementations
  • MLC-LLM for pushing 4-bit quantization performance on heterogeneous hardware

gpt-fast's People

Contributors

artyom17 avatar chillee avatar cpuhrsch avatar edward-sun avatar eltociear avatar hdcharles avatar hmosousa avatar huntzhan avatar jerryzh168 avatar kit1980 avatar malfet avatar mdk8888 avatar michaelfeil avatar mikekgfb avatar mingfeima avatar yanboliang avatar yifuwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpt-fast's Issues

Understanding why TorchInductor cannot speed-up huggingface transformer inference

Problem

torch.compile() shows an impressive ~2x speed-up for this code repo, but when applying to huggingface transformers there is barely no speed-up. I want to understand why, and then figure out how TorchInductor can also benefit HF models (related issue #9)

Comparing HF's model.generate() vs gpt-fast under the same setting (same prompt, output length, sampling, data type, ...), I found that (on RTX 4090):

  • In eager mode without compile(), HF generate() (39.4 token/s) is faster than gpt-fast (28 token/s)
  • In compiled mode, HF generate() has almost no speed-up (still 39.4 token/s); gpt-fast gets much faster (68.5 token/s)

The blog mentions statically allocating KV cache, but isn't this also implemented in the HF llama model?

Benchmark code

GPT-fast

cd gpt-fast
export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
./scripts/prepare.sh $MODEL_REPO

python generate.py --prompt "Q: What is the largest animal?\nA:"  --max_new_tokens 134 --num_samples 1 --checkpoint_path checkpoints/$MODEL_REPO/model.pth
python generate.py --compile --prompt "Q: What is the largest animal?\nA:" --max_new_tokens 134 --num_samples 1 --checkpoint_path checkpoints/$MODEL_REPO/model.pth

--max_new_tokens 134 is to match HF's output length, as this gpt-fast repo continues to generate text even when hitting the end token </s>.

HuggingFace

Run the script below by

python ./hf_generate.py --compile --do_sample
import time
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import set_seed


def print_separater():
    print("=" * 20, "\n")

def get_model_and_tokenizer(model_path, device, dtype):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=dtype,
        device_map=device
    )
    model.tokenizer = tokenizer
    return model, tokenizer

def benchmark_throughput(model, model_inputs, args):
    device = model.device
    set_seed(args.seed)

    if device == "cuda":
        torch.cuda.synchronize()
    t0 = time.time()
    greedy_output = model.generate(
        **model_inputs,
        max_new_tokens=args.max_new_tokens,
        do_sample=args.do_sample,
        top_k=args.top_k,
        temperature=args.temperature,
    )
    if device == "cuda":
        torch.cuda.synchronize()
    t1 = time.time()

    time_elasped = t1 - t0
    num_tokens = greedy_output.numel() - model_inputs['input_ids'].numel()

    print("Output:\n" + 100 * '-')
    print(model.tokenizer.decode(greedy_output[0], skip_special_tokens=False))

    print("Generated Tokens:", num_tokens)
    print("Time Elasped (s):", time_elasped)
    throughput = num_tokens/ time_elasped

    return throughput

def main(args):
    print("torch and transformer version:", torch.__version__, transformers.__version__)
    print(torch.__config__.parallel_info())
    print(f"device: {args.device}, dtype: {args.dtype}")
    print(f"model: {args.model_path}")
    print_separater()

    model, tokenizer = get_model_and_tokenizer(args.model_path, args.device, args.dtype)
    model_inputs = tokenizer(args.prompt, return_tensors='pt').to(args.device)

    warm_up_tokens = 20
    set_seed(args.seed)
    warm_up_output = model.generate(**model_inputs, max_new_tokens=warm_up_tokens)

    throughput = benchmark_throughput(model, model_inputs, args)
    print("throughput eager (token/s):", throughput)

    if args.compile:
        t0 = time.time()
        compiled_model = torch.compile(
            model,
            backend=args.dynamo_backend,
            mode=args.dynamo_mode,
            dynamic=None,
            fullgraph=True,
            disable=False
            )
        t1 = time.time()
        print("Compile time (s):", t1 - t0)

        set_seed(args.seed)
        warm_up_output_compiled = compiled_model.generate(
            **model_inputs, max_new_tokens=warm_up_tokens)
        print("Warm-up result agree:", torch.equal(warm_up_output, warm_up_output_compiled))
        print_separater()

        throughput_compiled = benchmark_throughput(compiled_model, model_inputs, args)
        print("throughput compiled (token/s):", throughput_compiled)

        print_separater()
        print("compile speed-up:", throughput_compiled / throughput)

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description='Your CLI description.')

    parser.add_argument('--device', type=str,
                        default="cuda")
    parser.add_argument('--dtype', default=torch.float16)
    parser.add_argument('--model_path', type=str,
                        default="meta-llama/Llama-2-7b-chat-hf", help='HF model name or path.')
    parser.add_argument('--prompt', type=str,
                        default="Q: What is the largest animal?\nA:", help='Input prompt.')
    parser.add_argument('--max_new_tokens', type=int,
                        default=256, help='Maximum number of new tokens.')
    parser.add_argument('--do_sample', action='store_true',
                        help='Whether to use sampling. Default is greedy search.')
    parser.add_argument('--top_k', type=int,
                        default=200, help='Top-k for sampling.')
    parser.add_argument('--temperature', type=float,
                        default=0.8, help='Temperature for sampling.')
    parser.add_argument('--compile', action='store_true',
                        help='Whether to compile the model.')
    parser.add_argument('--dynamo_backend', type=str,
                        default="inductor", help='torch._dynamo.list_backends()')
    parser.add_argument('--dynamo_mode', type=str,
                        default="default", help='["default", "reduce-overhead", "max-autotune"]')
    parser.add_argument('--seed', type=int, default=42, help='Random seed.')

    args = parser.parse_args()
    main(args)

The default sampling settings are the same as this repo's generate.py

Output results

gpt-fast:

Loading model ...
Time to load model: 6.07 seconds
Q: What is the largest animal?\nA: The largest animal on Earth is the blue whale. On average, an adult blue whale can grow up to 82 feet (25 meters) in length and weigh around 150-170 tons (136,000-152,000 kilograms). However, the largest blue whale ever recorded was a female that was found in 1947 off the coast of Iceland, which measured around 108 feet (33 meters) in length and weighed an estimated 210 tons (182,000 kilograms).
Time for inference 1: 4.78 sec total, 28.02 tokens/sec
Bandwidth achieved: 377.67 GB/s
==========
Average tokens/sec: 28.02
Memory used: 13.59 GB

For eager, output texts are the same as Huggingface, although random seed settings are different from HF script.

Time to load model: 6.26 seconds
Compilation time: 26.94 seconds
Q: What is the largest animal?\nA: The largest animal on Earth is the blue whale. It can grow up to 33 meters (108 feet) in length and weigh up to 180 metric tons (200 tons).t is important to note that the size of a blue whale can vary greatly depending on its age, sex, and other factors. Adult blue whales typically range in length from 18 to 25 meters (59 to 82 feet), with an average length of around 19 meters (62 feet).

Other large animals include:

1. Fin Whale: The fin whale
Time for inference 1: 1.95 sec total, 68.56 tokens/sec
Bandwidth achieved: 923.91 GB/s
==========
Average tokens/sec: 68.56
Memory used: 13.85 GB

With Inductor, the output texts becomes different (not sure due to random seed or float-point issues), although still sensible.

Huggingface:

Output:
----------------------------------------------------------------------------------------------------
<s> Q: What is the largest animal?
A: The largest animal on Earth is the blue whale. On average, an adult blue whale can grow up to 82 feet (25 meters) in length and weigh around 150-170 tons (136,000-152,000 kilograms). However, the largest blue whale ever recorded was a female that was found in 1947 off the coast of Iceland, which measured around 108 feet (33 meters) in length and weighed an estimated 210 tons (182,000 kilograms).</s>
Generated Tokens: 134
Time Elasped (s): 3.39901065826416
throughput eager (token/s): 39.42323619203725
Compile time (s): 0.0032820701599121094
Warm-up result agree: True
==================== 

Output:
----------------------------------------------------------------------------------------------------
<s> Q: What is the largest animal?
A: The largest animal on Earth is the blue whale. On average, an adult blue whale can grow up to 82 feet (25 meters) in length and weigh around 150-170 tons (136,000-152,000 kilograms). However, the largest blue whale ever recorded was a female that was found in 1947 off the coast of Iceland, which measured around 108 feet (33 meters) in length and weighed an estimated 210 tons (182,000 kilograms).</s>
Generated Tokens: 134
Time Elasped (s): 3.404815673828125
throughput compiled (token/s): 39.356021834021995
==================== 

compile speed-up: 0.9982950573187892

Environment

  • torch-2.3.0.dev20231217+cu121
  • transformers-4.36.1
  • tokenizers-0.15.0
  • accelerate-0.25.0

Torch installed by

pip install --upgrade --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121

which grabs https://download.pytorch.org/whl/nightly/cu121/torch-2.3.0.dev20231217%2Bcu121-cp310-cp310-linux_x86_64.whl

Similar results with torch 2.1.2+cu121 #46 (comment)

TypeError: __init__() got an unexpected keyword argument 'mmap'

Hi:
I am running the code on AMD GPUs (MI210s) and have encountered some issues with 'mmap'. I cannot find a solution online to resolve it. Several codes have 'mmap=True' which causes lots of issues, and I cannot reproduce the results. Please help! Thanks in advance, Yao Fehlis (AMD)

Compatible with AutoGPTQ?

Hi,

Does the GPTQ converted models the same as AutoGPTQ? Do they share the same configuration settting such as group size actor order and such?

There are plenty of GPTQ models out ther on hf already that could potentially benifit from this repo already!

Thanks

cc @TheBloke @PanQiWei

KeyError: 'model.layers.{}.self_attn.W_pack.weight'

device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
Model config {'block_size': 2048, 'vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'dim': 4096, 'intermediate_size': 11008, 'n_local_heads': 32, 'head_dim': 128, 'rope_base': 10000, 'norm_eps': 1e-05}
/mnt/user/wangchenpeng/venv/fast/lib/python3.8/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Traceback (most recent call last):
File "scripts/convert_hf_checkpoint.py", line 106, in
convert_hf_checkpoint(
File "/mnt/user/wangchenpeng/venv/fast/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "scripts/convert_hf_checkpoint.py", line 76, in convert_hf_checkpoint
new_key = weight_map[abstract_key]
KeyError: 'model.layers.{}.self_attn.W_pack.weight'

RuntimeError: cutlassF: no kernel found to launch!

root@md:/home/projects/gpt-fast# CUDA_VISIBLE_DEVICES=0 python3 generate.py --compile --checkpoint_path /models/huggingface_models/meta-Llama-2-7b-hf/model_int8.pth --max_new_tokens 100
Loading model ...
Using int8 weight-only quantization!
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Time to load model: 2.33 seconds
Traceback (most recent call last):
File "/home/projects/gpt-fast/generate.py", line 407, in
main(
File "/home/projects/gpt-fast/generate.py", line 346, in main
y, metrics = generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/projects/gpt-fast/generate.py", line 167, in generate
next_token = prefill(model, prompt.view(1, -1), input_pos, **sampling_kwargs)
File "/home/projects/gpt-fast/generate.py", line 52, in prefill
logits = model(x, input_pos)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/gpt-fast/model.py", line 118, in forward
x = layer(x, input_pos, freqs_cis, mask)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/gpt-fast/model.py", line 137, in forward
h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/gpt-fast/model.py", line 186, in forward
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
RuntimeError: cutlassF: no kernel found to launch!

GPU: NVIDIA V100

conda list:

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
asttokens 2.0.5 pyhd3eb1b0_0
astunparse 1.6.3 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
backcall 0.2.0 pyhd3eb1b0_0
beautifulsoup4 4.12.2 py310h06a4308_0
blas 1.0 mkl
boltons 23.0.0 py310h06a4308_0
brotlipy 0.7.0 py310h7f8727e_1002
bzip2 1.0.8 h7b6447c_0
c-ares 1.19.0 h5eee18b_0
ca-certificates 2023.08.22 h06a4308_0
certifi 2023.7.22 py310h06a4308_0
cffi 1.15.1 py310h5eee18b_3
chardet 4.0.0 py310h06a4308_1003
charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.0.4 py310h06a4308_0
cmake 3.26.4 h96355d8_0
conda 23.9.0 py310h06a4308_0
conda-build 3.27.0 py310h06a4308_0
conda-content-trust 0.2.0 py310h06a4308_0
conda-index 0.3.0 py310h06a4308_0
conda-libmamba-solver 23.7.0 py310h06a4308_0
conda-package-handling 2.2.0 py310h06a4308_0
conda-package-streaming 0.9.0 py310h06a4308_0
cryptography 41.0.3 py310hdda0065_0
cuda-cudart 11.8.89 0 nvidia
cuda-cupti 11.8.87 0 nvidia
cuda-libraries 11.8.0 0 nvidia
cuda-nvrtc 11.8.89 0 nvidia
cuda-nvtx 11.8.86 0 nvidia
cuda-runtime 11.8.0 0 nvidia
decorator 5.1.1 pyhd3eb1b0_0
dnspython 2.4.2 pypi_0 pypi
exceptiongroup 1.0.4 py310h06a4308_0
executing 0.8.3 pyhd3eb1b0_0
expat 2.5.0 h6a678d5_0
expecttest 0.1.6 pypi_0 pypi
ffmpeg 4.3 hf484d3e_0 pytorch
filelock 3.9.0 py310h06a4308_0
fmt 9.1.0 hdb19cb5_0
freetype 2.12.1 h4a9f257_0
fsspec 2023.9.2 pypi_0 pypi
giflib 5.2.1 h5eee18b_3
gmp 6.2.1 h295c915_3
gmpy2 2.1.2 py310heeb90bb_0
gnutls 3.6.15 he1e5248_0
hypothesis 6.87.1 pypi_0 pypi
icu 58.2 he6710b0_3
idna 3.4 py310h06a4308_0
intel-openmp 2023.1.0 hdb19cb5_46305
ipython 8.15.0 py310h06a4308_0
jedi 0.18.1 py310h06a4308_1
jinja2 3.1.2 py310h06a4308_0
jpeg 9e h5eee18b_1
jsonpatch 1.32 pyhd3eb1b0_0
jsonpointer 2.1 pyhd3eb1b0_0
krb5 1.20.1 h143b758_1
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libarchive 3.6.2 h6ac8c49_2
libcublas 11.11.3.6 0 nvidia
libcufft 10.9.0.58 0 nvidia
libcufile 1.7.2.10 0 nvidia
libcurand 10.3.3.141 0 nvidia
libcurl 8.1.1 h251f7ec_1
libcusolver 11.4.1.48 0 nvidia
libcusparse 11.7.5.86 0 nvidia
libdeflate 1.17 h5eee18b_1
libedit 3.1.20221030 h5eee18b_0
libev 4.33 h7f8727e_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libiconv 1.16 h7f8727e_2
libidn2 2.3.4 h5eee18b_0
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
liblief 0.12.3 h6a678d5_0
libmamba 1.4.1 h2dafd23_1
libmambapy 1.4.1 py310h2dafd23_1
libnghttp2 1.52.0 h2d74bed_1
libnpp 11.8.0.86 0 nvidia
libnvjpeg 11.9.0.86 0 nvidia
libpng 1.6.39 h5eee18b_0
libsolv 0.7.22 he621ea3_0
libssh2 1.10.0 hdbd6064_2
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.19.0 h5eee18b_0
libtiff 4.5.1 h6a678d5_0
libunistring 0.9.10 h27cfd23_0
libuuid 1.41.5 h5eee18b_0
libuv 1.44.2 h5eee18b_0
libwebp 1.3.2 h11a3e52_0
libwebp-base 1.3.2 h5eee18b_0
libxml2 2.10.3 hcbfbd50_0
llvm-openmp 14.0.6 h9e868ea_0
lz4-c 1.9.4 h6a678d5_0
markupsafe 2.1.1 py310h7f8727e_0
matplotlib-inline 0.1.6 py310h06a4308_0
mkl 2023.1.0 h213fc3f_46343
mkl-service 2.4.0 py310h5eee18b_1
mkl_fft 1.3.8 py310h5eee18b_0
mkl_random 1.2.4 py310hdb19cb5_0
more-itertools 8.12.0 pyhd3eb1b0_0
mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.3.0 py310h06a4308_0
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
networkx 3.1 py310h06a4308_0
numpy 1.26.0 py310h5f9d8c6_0
numpy-base 1.26.0 py310hb5e798b_0
openh264 2.1.1 h4ff587b_0
openssl 3.0.11 h7f8727e_2
packaging 23.1 py310h06a4308_0
parso 0.8.3 pyhd3eb1b0_0
patch 2.7.6 h7b6447c_1001
patchelf 0.17.2 h6a678d5_0
pcre2 10.37 he7ceb23_1
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 9.4.0 py310h6a678d5_1
pip 23.2.1 py310h06a4308_0
pkginfo 1.9.6 py310h06a4308_0
pluggy 1.0.0 py310h06a4308_1
prompt-toolkit 3.0.36 py310h06a4308_0
psutil 5.9.0 py310h5eee18b_0
ptyprocess 0.7.0 pyhd3eb1b0_2
pure_eval 0.2.2 pyhd3eb1b0_0
py-lief 0.12.3 py310h6a678d5_0
pybind11-abi 4 hd3eb1b0_1
pycosat 0.6.4 py310h5eee18b_0
pycparser 2.21 pyhd3eb1b0_0
pygments 2.15.1 py310h06a4308_1
pyopenssl 23.2.0 py310h06a4308_0
pysocks 1.7.1 py310h06a4308_0
python 3.10.13 h955ad1f_0
python-etcd 0.4.5 pypi_0 pypi
python-libarchive-c 2.9 pyhd3eb1b0_1
pytorch 2.1.0 py3.10_cuda11.8_cudnn8.7.0_0 pytorch
pytorch-cuda 11.8 h7e8668a_5 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2023.3.post1 py310h06a4308_0
pyyaml 6.0 py310h5eee18b_1
readline 8.2 h5eee18b_0
reproc 14.2.4 h295c915_1
reproc-cpp 14.2.4 h295c915_1
requests 2.31.0 py310h06a4308_0
rhash 1.4.3 hdbd6064_0
ruamel.yaml 0.17.21 py310h5eee18b_0
ruamel.yaml.clib 0.2.6 py310h5eee18b_1
sentencepiece 0.1.99 pypi_0 pypi
setuptools 68.0.0 py310h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sortedcontainers 2.4.0 pypi_0 pypi
soupsieve 2.5 py310h06a4308_0
sqlite 3.41.2 h5eee18b_0
stack_data 0.2.0 pyhd3eb1b0_0
sympy 1.12 pypi_0 pypi
tbb 2021.8.0 hdb19cb5_0
tk 8.6.12 h1ccaba5_0
tomli 2.0.1 py310h06a4308_0
toolz 0.12.0 py310h06a4308_0
torchaudio 2.1.0 py310_cu118 pytorch
torchelastic 0.2.2 pypi_0 pypi
torchtriton 2.1.0 py310 pytorch
torchvision 0.16.0 py310_cu118 pytorch
tqdm 4.65.0 py310h2f386ee_0
traitlets 5.7.1 py310h06a4308_0
truststore 0.8.0 py310h06a4308_0
types-dataclasses 0.6.6 pypi_0 pypi
typing-extensions 4.8.0 pypi_0 pypi
typing_extensions 4.7.1 py310h06a4308_0
tzdata 2023c h04d1e81_0
urllib3 1.26.16 py310h06a4308_0
wcwidth 0.2.5 pyhd3eb1b0_0
wheel 0.41.2 py310h06a4308_0
xz 5.4.2 h5eee18b_0
yaml 0.2.5 h7b6447c_0
yaml-cpp 0.7.0 h295c915_1
zlib 1.2.13 h5eee18b_0
zstandard 0.19.0 py310h5eee18b_0
zstd 1.5.5 hc292b87_0

Can we run gpt-fast from Windows Command Prompt or Powershell?

For this code:

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
./scripts/prepare.sh $MODEL_REPO

I could change to:
set MODEL_REPO=meta-llama/Llama-2-7b-chat-hf

But I'm not sure how to proceed further. Is there a way to run gpt-fast from Windows Command Prompt or Powershell?

Screenshot 2023-12-13 015631

AttributeError: torch._inductor.config.fx_graph_cache does not exist

Quantize the model to int8 and it gave this error:

ubuntu@ip-172-31-19-240:~/gpt-fast$ python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
Loading model ...
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Quantizing model weights for int8 weight-only symmetric per-channel quantization
Writing quantized weights to checkpoints/openlm-research/open_llama_7b/model_int8.pth
Quantization complete took 24.35 seconds

ubuntu@ip-172-31-19-240:~/gpt-fast$ python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth
Traceback (most recent call last):
File "/home/ubuntu/gpt-fast/generate.py", line 18, in
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/config_utils.py", line 72, in setattr
raise AttributeError(f"{self.name}.{name} does not exist")
AttributeError: torch._inductor.config.fx_graph_cache does not exist

System:

Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1049-aws x86_64v)

  • Please note that Amazon EC2 P2 Instance is not supported on current DLAMI.
  • Supported EC2 instances: P5, P4d, P4de, P3, P3dn, G5, G4dn, G3.
  • To activate pre-built pytorch environment, run: 'source activate pytorch'
  • To activate base conda environment upon login, run: 'conda config --set auto_activate_base true'
  • NVIDIA driver version: 535.104.12
  • CUDA version: 12.1

Problems with interactive mode

I want to use an interactive mode and use this command

python generate.py --compile --interactive --draft_checkpoint_path checkpoints/$DRAFT_MODEL_REPO/model_int8.pth --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --speculate_k 3

However, I got the following error, could you please help check about it?

Loading model ...
Using int8 weight-only quantization!
Using int8 weight-only quantization!
Time to load model: 17.47 seconds
Compilation time: 121.34 seconds
What is your prompt? do you know meta?
Traceback (most recent call last):
File "/share/edc/home/xuandongzhao/safety/gpt-fast/generate.py", line 404, in
main(
File "/share/edc/home/xuandongzhao/safety/gpt-fast/generate.py", line 343, in main
y, metrics = generate(
File "/local/home/xuandongzhao/anaconda3/envs/safety/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/share/edc/home/xuandongzhao/safety/gpt-fast/generate.py", line 185, in generate
callback(i)
File "/share/edc/home/xuandongzhao/safety/gpt-fast/generate.py", line 326, in callback
buffer.append(tokenizer.decode([period_id] + x.tolist())[1:])
TypeError: can only concatenate list (not "int") to list

What is `torch.ops.aten._convert_weight_to_int4pack` ?

I'm using torch.version = 2.1.0a0+32f93b1

which doesn't have

AttributeError: '_OpNamespace' 'aten' object has no attribute '_convert_weight_to_int4pack'

What exactly does this do, and is it defined elsewhere?

Unfortunately upgrading to the latest torch-dev breaks flash-attention2

What would it take to support other models like deepseek coder?

This is an amazing project and it would be great to support other models.

I've been looking at using deepseek with gpt-fast.
Deepseek is in the Llama2 family.
I've gotten as far as converting the model and replacing the tokenizer.
I can run the model, but the output isn't correct.
I think that there are some differences in architecture, but I can't tell if they are a problem.

I think I have the correct parameters:
"deepseek-coder-6.7b-base":dict(block_size=16384, vocab_size=32256, intermediate_size=11008, norm_eps=1e-6, rope_base = 100000)

So the model coverts and runs but the output is gibberish. Could there be something wrong in the conversion step. I can't tell what all the key mapping is for, so I don't know if that's working correctly.

Any suggestions on what to do next?

Downloads the whole hf repo

The scripts/download.py downloads the whole hf repo including the *.safetensors files, even though it seems only the *.bin files are required.

Too long input texts cuase device-side assert triggered

<frozen importlib._bootstrap_external>:843: _call_with_frames_removed: block: [8,0,0], thread: [58,0,0] Assertion `index out of bounds: 0 <= tmp68 < 1504` failed.
<frozen importlib._bootstrap_external>:843: _call_with_frames_removed: block: [8,0,0], thread: [59,0,0] Assertion `index out of bounds: 0 <= tmp68 < 1504` failed.
<frozen importlib._bootstrap_external>:843: _call_with_frames_removed: block: [8,0,0], thread: [60,0,0] Assertion `index out of bounds: 0 <= tmp68 < 1504` failed.
<frozen importlib._bootstrap_external>:843: _call_with_frames_removed: block: [8,0,0], thread: [61,0,0] Assertion `index out of bounds: 0 <= tmp68 < 1504` failed.
<frozen importlib._bootstrap_external>:843: _call_with_frames_removed: block: [8,0,0], thread: [62,0,0] Assertion `index out of bounds: 0 <= tmp68 < 1504` failed.
<frozen importlib._bootstrap_external>:843: _call_with_frames_removed: block: [8,0,0], thread: [63,0,0] Assertion `index out of bounds: 0 <= tmp68 < 1504` failed.
...
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When I use llama-2-7b-hf and input ten samples, each with a length of around 2048 after tokenization, I still encounter the above error, even though I have set the block size to 4096.

If I shorten the length, it works.

torch.compile() with flash decoding ops

I'm trying to replace F.scaled_dot_product_attention with flash decoding kernel for faster inference.

However, while the flash decoding function works well in the eager mode, I cannot make it work with torch.compile(). It seems that torch.comile() does not support such third-party operators. How can I overcome this problem?

My code is like:

...
from xformers import _C_flashattention as  flash_attn_cuda
...

# y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0) ->
out, *_ = flash_attn_cuda.fwd(q, k, v, ...)

And the error message with --compile option is:

...
  File "/home/coder/miniconda3/envs/gpt-fast/lib/python3.8/site-packages/torch/_dynamo/variables/base.py", line 340, in call_method
    raise unimplemented(f"call_method {self} {name} {args} {kwargs}")
  File "/home/coder/miniconda3/envs/gpt-fast/lib/python3.8/site-packages/torch/_dynamo/exc.py", line 193, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call_method UserDefinedObjectVariable(fwd) __call__ [TensorVariable(), TensorVariable(), TensorVariable(), ConstantVariable(NoneType), ConstantVariable(float), ConstantVariable(float), ConstantVariable(bool), ConstantVariable(int), ConstantVariable(int), ConstantVariable(bool), ConstantVariable(NoneType)] {}

from user code:
   File "benchmark.py", line 64, in decode_one_token
    logits = model(x, input_pos) # [B, 1, vocab_size]
  File "/home/coder/miniconda3/envs/gpt-fast/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coder/projects/gpt-fast/model.py", line 388, in forward
    x = layer(x, input_pos, freqs_cis, mask)
  File "/home/coder/miniconda3/envs/gpt-fast/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coder/projects/gpt-fast/model.py", line 407, in forward
    h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
  File "/home/coder/miniconda3/envs/gpt-fast/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coder/projects/gpt-fast/model.py", line 477, in forward
    y, lse = flash_attn_forward(q, k, v)
  File "/home/coder/projects/gpt-fast/model.py", line 34, in flash_attn_forward
    out, *_ = flash_attn_cuda.fwd(

How to cache the compilation result?

torch.compile always re-compiles a function from scratch in a new Python session, which takes a lot of time.
I'm wondering if there's a way to cache the compilation result in the file system (like gcc/clang) to speed up the development & debugging process.
@Chillee

gpt-fast/generate.py

Lines 16 to 18 in db7b273

torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.triton.unique_kernel_names = True
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future

GPTQ quantization not working

Running quantize.py with --mode int4-gptq does not seem to work:

  • code tries to import lm-evaluation-harness which is not included/documented/used
  • import in eval.py is incorrect, should probably be from model import Transformer as LLaMA instead of from model import LLaMA
  • after fixing two above issues, next one is a circular import
  • after fixing that, import lm_eval should be replaced with import lm_eval.base
  • there is one other circular import
  • there are a few other missing imports from lm_eval
  • and a few other errors

Overall here are the fixes I had to apply to make it run: lopuhin@86d990b

Based on this, could you please check if the right version of the code was included for GPTQ quantization?

slight performance improving(ㄒoㄒ)

I only got a little improvement than the native code. Was there any I missed?

Commands

cli 1:
time python generate.py --compile --compile_prefill --checkpoint_path /root/gpt-fast/codellama-34b-python/model_int8.pth --prompt "def quicksort(arr):" --max_new_tokens 32 --num_samples 50

cli 2:
time python generate.py --checkpoint_path /root/gpt-fast/codellama-34b-python/model_int8.pth --prompt "def quicksort(arr):" --max_new_tokens 32 --num_samples 50

Results

result of cli 1: 4.45tokens/sec & 151.52GB/s for bandwidth
result of cli 2: 4.24tokens/sec & 144.55GB/s for bandwidth

relative improvement(compile vs not compile):
speed: 4.9%
memory bandwidth: 4.8%

Env

gpu: 1*L40S
docker: python:3.9
pytorch installation: pip install torch

repeat sentence and non-complete sentence in the end

Firstly, thanks for your wonderful work. I have a question here. For production ready purpose of gpt-fast. Currently, repeat penalty or stop string and more function is not including in this repository. This will cause LLM generate repeat sentences and generate non-complete sentence in the end. Is there any plan to support these features?

Tensor parallel hangs on call to model

I'm trying to run the TP code, using this script.

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
export OMP_NUM_THREADS=16
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO
#time torchrun --standalone --nproc_per_node=2 generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "def quicksort(arr):" --max_new_tokens 200 --num_samples 50 --temperature 0

python -m torch.distributed.launch --nproc_per_node=2 \
                                   --nnodes=1 \
                                   --node_rank=0 \
                                   --master_addr="127.0.0.1" \
                                   --master_port=29501 \
                                   generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "def quicksort(arr):" --max_new_tokens 200 --num_samples 50 --temperature 0

It runs until the first call to the model in prefill.
It stops there and hangs.

What is the know configuration that TP works under?

torch.version
'2.2.0.dev20231213+cu121'

~ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
|  0%   44C    P2              94W / 480W |   7508MiB / 24564MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:17:00.0 Off |                  Off |
| 30%   26C    P2             100W / 480W |   7508MiB / 24564MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Debug output

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
/home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:29501.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:29501.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 29501.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:299] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:35774.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:35774.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host 127.0.0.1:29501
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
/home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:299] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:35790.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:35790.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host 127.0.0.1:29501
[I ProcessGroupNCCL.cpp:785] [Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: OFF, ID=94527839849424
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:299] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:35804.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:35804.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host 127.0.0.1:29501
[I ProcessGroupNCCL.cpp:785] [Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: OFF, ID=94408215489568
Loading model ...
Applying tensor parallel to model ...
Time to load model: 2.36 seconds
no compile
Generating sample 1 of 50
generate
cache setup done
prefill
[rank0]:[I ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[7, 4096], TensorDtypes=BFloat16, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[7, 4096], TensorDtypes=BFloat16, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I ProcessGroupNCCL.cpp:1854] NCCL_DEBUG: N/A

@Chillee chilli
Any help greatly appreciated.

Does `gpt-fast` work on V100 GPUs?

Everything works on my A6000s and A100s, but not on the older V100 (says compute capability is low). Are there plans to add support for the legacy devices? Thanks!

Expert parallelism / MoE example would be awesome :)

I loved seeing the blog post with a simple, standalone implementation of many techniques used in production to speed up LLMs. Would love to see this extended to MoE like Mixtral, which at the moment seem fairly annoying to use and hack on. Curious how torch.compile can help with these, and possible issues that might arise like graph breaks due to gating.

Normal Inference seems to output more tokens per second.

Hi,

I just did a quick implementation of gpt-fast and did an inference on the Llama-2 7B. I seem to get around 65 tokens per second on average without quantization.

I also did a quick inference on the normal implementation of the model and I seem to get around 150 tokens per second on average.(My calc: Total no of tokens/total time taken)

Is there anything I am missing.

My hardware spec:
GPU: NVIDIA GeForce RTX 4090
Cpu cores: 32

Is the "no of CPU cores" a hyper parameter in gpt-fast? Is there a threshold for the "no of CPU cores", below which only the CPU overhead occurs and gpt-fast helps only in such cases?

Please correct me If I am wrong.

Thanks in advance.

Bug convert HF model

Bug Report

Description:

I encountered a bug when attempting to convert a model from Hugging Face (HF) using the provided code implementation. The issue appears to be related to counting parameters in the PyTorch model.

Code Implementation:

import re
import torch
from transformers import LlamaForCausalLM, AutoTokenizer

from models.model_configs import transformer_configs
from models.llama import ModelArgs

#Load model
model_path = "nickypro/tinyllama-15M-fp32"
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype="auto", use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

tiny_llama_15M_config = ModelArgs.from_name('tinyllama-15M')
#ModelArgs(block_size=256, vocab_size=32000, n_layer=6, n_head=6, dim=288, intermediate_size=768, n_local_heads=6, #head_dim=48, rope_base=10000, norm_eps=1e-05)
mymodel = LLama(tiny_llama_15M_config)

#Convert HF
weight_map = {
    "model.embed_tokens.weight": "tok_embeddings.weight",
    "model.layers.{}.self_attn.q_proj.weight": "layers.{}.attention.wq.weight",
    "model.layers.{}.self_attn.k_proj.weight": "layers.{}.attention.wk.weight",
    "model.layers.{}.self_attn.v_proj.weight": "layers.{}.attention.wv.weight",
    "model.layers.{}.self_attn.o_proj.weight": "layers.{}.attention.wo.weight",
    'model.layers.{}.self_attn.rotary_emb.inv_freq': None,
    'model.layers.{}.mlp.gate_proj.weight': 'layers.{}.feed_forward.w1.weight',
    "model.layers.{}.mlp.up_proj.weight": "layers.{}.feed_forward.w3.weight",
    "model.layers.{}.mlp.down_proj.weight": "layers.{}.feed_forward.w2.weight",
    "model.layers.{}.input_layernorm.weight": "layers.{}.attention_norm.weight",
    "model.layers.{}.post_attention_layernorm.weight": "layers.{}.ffn_norm.weight",
    "model.norm.weight": "norm.weight",
    "lm_head.weight": "output.weight",
}

def permute(w, n_head, dim, head_dim):
    dim = dim
    return (
        w.view(n_head, 2, head_dim // 2, dim)
        .transpose(1, 2)
        .reshape(head_dim * n_head, dim)
    )

hf_path = "tests/tiny_15M_fp32.pt"
torch.save(model.state_dict(), hf_path)
checkpoint = torch.load(hf_path)

final_result = {}
for key, value in checkpoint.items():
    if "layers" in key:
        abstract_key = re.sub(r'(\d+)', '{}', key)
        layer_num = re.search(r'\d+', key).group(0)
        new_key = weight_map[abstract_key]
        if new_key is None:
            continue
        new_key = new_key.format(layer_num)
    else:
        new_key = weight_map[key]

    final_result[new_key] = value

for key in tuple(final_result.keys()):
    if "wq" in key:
        q = final_result[key]
        k = final_result[key.replace("wq", "wk")]
        v = final_result[key.replace("wq", "wv")]
        q = permute(q, tiny_llama_15M_cofig.n_head, tiny_llama_15M_cofig.dim, tiny_llama_15M_cofig.head_dim)
        k = permute(k, tiny_llama_15M_cofig.n_local_heads, tiny_llama_15M_cofig.dim, tiny_llama_15M_cofig.head_dim)
        final_result[key.replace("wq", "wqkv")] = torch.cat([q, k, v])
        del final_result[key]
        del final_result[key.replace("wq", "wk")]
        del final_result[key.replace("wq", "wv")]

torch.save(final_result,  "tests/model_converted.pth")
mymodel.load_state_dict(torch.load("tests/model_converted.pth"))

#Input
inputs = tokenizer("I am a student at", return_tensors="pt", return_attention_mask=False)
inp = inputs['input_ids']
T = inp.shape[1]
input_pos = torch.arange(0, T)

# Run HF
hf_outputs = model(**inputs)

#Run implement
mymodel.setup_caches(1, tiny_llama_15M_cofig.block_size)
output = mymodel(inp, input_pos)

# Output
hf_outputs.logits
tensor([[[ -6.7908,   0.8281,  -6.7904,  ...,  -6.7907,  -6.7907,  -6.7905],
         [ -8.2606,  -0.2434,  -8.2608,  ...,  -8.2607,  -8.2609,  -8.2608],
         [-10.8138,  -3.1881, -10.8137,  ..., -10.8138, -10.8139, -10.8138],
         [-11.4940,  -0.7831, -11.4936,  ..., -11.4939, -11.4938, -11.4937],
         [-11.8310,  -2.4853, -11.8308,  ..., -11.8310, -11.8310, -11.8310],
         [ -6.9855,   0.3798,  -6.9853,  ...,  -6.9855,  -6.9853,  -6.9854]]],
       grad_fn=<UnsafeViewBackward0>)

output
tensor([[[ -6.7908,   0.8281,  -6.7904,  ...,  -6.7907,  -6.7907,  -6.7905],
         [ -8.2573,  -0.2481,  -8.2575,  ...,  -8.2574,  -8.2576,  -8.2575],
         [-10.8115,  -3.1968, -10.8114,  ..., -10.8116, -10.8116, -10.8115],
         [-11.4960,  -0.7812, -11.4957,  ..., -11.4959, -11.4959, -11.4957],
         [-11.8348,  -2.4808, -11.8346,  ..., -11.8348, -11.8348, -11.8348],
         [ -6.9774,   0.3842,  -6.9772,  ...,  -6.9774,  -6.9772,  -6.9773]]],
       grad_fn=<UnsafeViewBackward0>)

diff  = torch.sum(abs(hf_outputs.logits - output), -1)
tensor([[  0.0000,  99.5280,  80.2527,  58.3247,  99.8532, 236.9683]],
       grad_fn=<SumBackward1>)

I think the bug lies in the Key-Value (KV) cache, as the output for the first token remains unchanged

index out of bounds for --compile_prefill with int4 and int8

Running 13b chat model on L4 GPU with

python generate.py --checkpoint_path .../model_int4.g32.pth --compile --compile_prefill

An error happens

Traceback (most recent call last):                                                                                                                                                                                                                                            
  File "/home/user/gpt-fast/generate.py", line 407, in <module>                                                                                                                                                                                         
    main(                                                                                                                                                                                                                                                                     
  File "/home/user/gpt-fast/generate.py", line 346, in main                                                                                                                                                                                             
    y, metrics = generate(                                                                                                                                                                                                                                                    
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                                                                                                
    return func(*args, **kwargs)                                                                                                                                                                                                                                              
  File "/home/user/gpt-fast/generate.py", line 190, in generate                                                                                                                                                                                         
    generated_tokens, _ = decode_n_tokens(model, next_token.view(1, -1), input_pos, max_new_tokens - 1, callback=callback, **sampling_kwargs)                                                                                                                                 
  File "/home/user/gpt-fast/generate.py", line 62, in decode_n_tokens                                                                                                                                                                                   
    next_token, next_prob = decode_one_token(                                                                                                                                                                                                                                 
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn                                                                                                                                            
    return fn(*args, **kwargs)                                                                                                                                                                                                                                                
  File "/home/user/gpt-fast/generate.py", line 52, in decode_one_token                                                                                                                                                                                  
    def decode_one_token(model: Transformer, x: torch.Tensor, input_pos: torch.Tensor, **sampling_kwargs) -> Tuple[torch.Tensor, torch.Tensor]:                                                                                                                               
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn                                                                                                                                            
    return fn(*args, **kwargs)                                                                                                                                                                                                                                                
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner                                                                                                                                       
    return fn(*args, **kwargs)                                                                                                                                                                                                                                                
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 899, in forward                                                                                                                                   
    return compiled_fn(full_args)                                                                                                                                                                                                                                             
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 81, in g                                                                                                                                   
    return f(*args)                                                                                                                                                                                                                                                           
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 94, in runtime_wrapper                                                                                                          
    all_outs = call_func_at_runtime_with_args(                                                                                                                                                                                                                                
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in call_func_at_runtime_with_args                                                                                                     
    out = normalize_as_list(f(args))                                                                                                                                                                                                                                          
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 118, in rng_functionalization_wrapper                                                                               
    return compiled_fw(args)                                                                                                                                                                                                                                                  
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 861, in __call__                                                                                                                                      
    return self.get_current_callable()(inputs)                                                                                                                                                                                                                                
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 665, in run                                                                                                                                          
    return compiled_fn(new_inputs)                                                                                                                                                                                                                                            
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 380, in deferred_cudagraphify                                                                                                                   
    fn, out = cudagraphify(model, inputs, new_static_input_idxs, *args, **kwargs)                                                                                                                                                                                             
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 408, in cudagraphify
    return manager.add_function(                                                                                                       
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1941, in add_function
    return fn, fn(inputs)                                                                                                              
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1755, in run
    out = self._run(new_inputs, function_id)                                                                                           
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1796, in _run
    return self.run_eager(new_inputs, function_id)                                                                                     
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1911, in run_eager
    return node.run(new_inputs)                                                                                                        
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 611, in run
    out = self.wrapped_function.model(new_inputs)                                                                                      
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 889, in _run_from_cache
    return compiled_graph.compiled_artifact(inputs)                                                                                    
  File "/var/tmp/torchinductor_konstantin_scrapinghub_com/vm/cvm4fmeqocm6isnl6ojvwbbu2fgaupg577dne7od5jgoyyzjij4y.py", line 2469, in call
    triton_poi_fused_mul_silu_6.run(buf201, buf200, 13824, grid=grid(13824), stream=stream0)                    
  File "/home/user/gpt-fast/venv/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 568, in run
    return launcher(                                                                                                                   
  File "<string>", line 8, in launcher                                                                                                 
RuntimeError: Triton Error [CUDA]: device-side assert triggered                                                                        
unknown:0: unknown: block: [0,0,0], thread: [128,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [129,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [130,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [131,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
...

Library versions:

pytorch-triton==2.1.0+6e4932cda8
torch==2.2.0.dev20231201+cu121

It works fine without --compile_prefill

Device-side assertions’ error when speculative decoding with different length of prompts.

I am running the speculative sampling task with the ‘compile’ mode of the generate.py script. The original speculative decoding version of gpt-fast decodes one prompt several times, but I want to decode different prompts. I have observed that when I decode with prompts of different lengths, I encounter an ‘enable device-side assertions’ error. The following are the error messages:

unknown:0: unknown: block: [33,0,0], thread: [0,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
unknown:0: unknown: block: [33,0,0], thread: [1,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
unknown:0: unknown: block: [33,0,0], thread: [2,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
unknown:0: unknown: block: [33,0,0], thread: [3,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
...
unknown:0: unknown: block: [59,0,0], thread: [47,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
unknown:0: unknown: block: [59,0,0], thread: [48,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
unknown:0: unknown: block: [59,0,0], thread: [49,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
unknown:0: unknown: block: [59,0,0], thread: [50,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.
unknown:0: unknown: block: [59,0,0], thread: [51,0,0] Assertion `index out of bounds: 0 <= tmp62 < 216` failed.

question about position embedding

Hi, Thanks for the great work!

I have a question. Does thhe codes in L172-L173(model.py) doing Position Embedding? Like the red box in the pic below.
image

If it is the pos emb, it seems apply in side the Attiontion class, which means every TransformerBlock will do the pos emb. Is it right? I learned from the paper , it seams Transformer only do pos-emb once before send Q/K/V matirx to attention. Is it designed by you for some purpose like this? Or am I wrong understanding about the code in model.py?
image

Thanks.

Inductor Op Lowering

Great blogpost!

Is there any documentation on how inductor lowers the ops in the fx graph to actual kernels -- specifically the optimization / tuning that determines the actual kernel implementations that are codegen'ed?

For example, in the blogpost, you mention that the GEMV kernels generated by torch.compile are faster than handwritten / proprietary kernels from cuBlas and FlashAttention.

I'd like to better understand the lowering passes that enables this:

  • Stepping through the compilation process in the debugger gets a bit muddled through the various layers of abstractions (more likely that I need to get better at debugging)
  • I've reviewed select_algorithm.py, triton_heuristics.py, the mm-specific kernels directory within inductor, etc. but am having trouble putting it all together.

Any suggestions / resources to illuminate this process would be greatly appreciated.

Thanks!

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

Some context

I am using AMD MI100 GPUs and I can get ~33 tokens/second for Llama 2 70B using

  • compile
  • tensor parallelism of 8
  • int8 quantization
time torchrun --standalone --nproc_per_node=8 generate.py --compile
 --checkpoint_path checkpoints/Llama-2-70b-chat-hf/model_int8.pth --max_new_tokens 100

Issue

This drops to ~7 tokens/second when I try to include speculative decoding with this command

time torchrun --standalone --nproc_per_node=8 generate.py --compile --draft_checkpoint_path checkpoints/meta-llama/Llama-2-7b-chat-hf/model_int8.pth --checkpoint_path checkpoints/meta-llama/Llama-2-70b-chat-hf/model_int8.pth --max_new_tokens 100 --speculate_k 5

The only clue I have to go on here is this output during compilation (that I only see when using speculative decoding)
skipping cudagraphs due to ['mutated inputs']

Error(s) in loading state_dict for Transformer

I am running preparation script for CodeLlama: ./scripts/prepare.sh codellama/CodeLlama-13b-Instruct-hf
And I got following error:

RuntimeError: Error(s) in loading state_dict for Transformer:
	size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32016, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
	size mismatch for output.weight: copying a param with shape torch.Size([32016, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).

'Triton Error [CUDA]: device kernel image is invalid' while compiling

GPU: V100
CUDA: 11.8

has changed all the torch.bfloat16 to torch.float16 as stated in #49 . Is there something i still missing?

Error Log:
root@84bb9affda66:/workspace/gpt-fast/gpt-fast-main# CUDA_VISIBLE_DEVICES=1 python generate.py --compile --checkpoint_path /datasets/llm/Llama-2-7b-chat-gpt-fast/model.pth --prompt "Hello, my name is"
Loading model ...
Time to load model: 4.16 seconds
Traceback (most recent call last):
File "/workspace/gpt-fast/gpt-fast-main/generate.py", line 410, in
main(
File "/workspace/gpt-fast/gpt-fast-main/generate.py", line 348, in main
y, metrics = generate(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/gpt-fast/gpt-fast-main/generate.py", line 193, in generate
generated_tokens, _ = decode_n_tokens(model, next_token.view(1, -1), input_pos, max_new_tokens - 1, callback=callback, **sampling_kwargs)
File "/workspace/gpt-fast/gpt-fast-main/generate.py", line 65, in decode_n_tokens
next_token, next_prob = decode_one_token(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 417, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 580, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 384, in _convert_frame_assert
return _compile(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 641, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 246, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 522, in compile_inner
out_code = transform_code_object(code, transform)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 151, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 487, in transform
tracer.run()
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2093, in run
super().run()
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 780, in run
and self.step()
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 743, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2211, in RETURN_VALUE
self.output.compile_subgraph(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 940, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1085, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 246, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1157, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1138, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/init.py", line 1697, in call
return compile_fx(model
, inputs
, config_patches=self.config)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 949, in compile_fx
return compile_fx(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1165, in compile_fx
return aot_autograd(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
cg = aot_module_simplified(gm, example_inputs, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 887, in aot_module_simplified
compiled_fn = create_aot_dispatcher_function(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 246, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 600, in create_aot_dispatcher_function
compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 427, in aot_wrapper_dedupe
return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 632, in aot_wrapper_synthetic_base
return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 97, in aot_dispatch_base
compiled_fw = compiler(fw_module, updated_flat_args)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 246, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1097, in fw_compiler_base
return inner_compile(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/debug.py", line 305, in inner
return fn(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 313, in compile_fx_inner
compiled_graph = FxGraphCache.load(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 801, in load
compiled_graph = compile_fx_fn(gm, example_inputs, **fx_kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 547, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1140, in compile_to_fn
return self.compile_to_module().call
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 246, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1092, in compile_to_module
mod = PyCodeCache.load_by_key_path(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1891, in load_by_key_path
exec(code, mod.dict, mod.dict)
File "/tmp/torchinductor_root/p7/cp76nmwkua2aiamfmc4zo5xzrahbgyn65ucwwsiznboqg4vejgrv.py", line 1679, in
async_compile.wait(globals())
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2470, in wait
scope[key] = result.result()
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2314, in result
kernel = self.kernel = _load_kernel(self.kernel_name, self.source_code)
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2290, in _load_kernel
kernel.precompile()
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 189, in precompile
compiled_binary, launcher = self._precompile_config(
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 344, in _precompile_config
binary._init_handles()
File "/opt/conda/envs/gpt-fast-new/lib/python3.10/site-packages/triton/compiler/compiler.py", line 683, in _init_handles
mod, func, n_regs, n_spills = fn_load_binary(self.metadata["name"], self.asm[bin_path], self.shared, device)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Triton Error [CUDA]: device kernel image is invalid

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

pip list:
Package Version


certifi 2023.11.17
charset-normalizer 3.3.2
filelock 3.9.0
fsspec 2023.12.2
huggingface-hub 0.20.1
idna 3.6
Jinja2 3.1.2
MarkupSafe 2.1.3
mpmath 1.2.1
networkx 3.0rc1
numpy 1.26.2
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11 8.7.0.84
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.3.0.86
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusparse-cu11 11.7.5.86
nvidia-nccl-cu11 2.19.3
nvidia-nvtx-cu11 11.8.86
packaging 23.2
pip 23.3.1
pytorch-triton 2.2.0+e28a256d71
PyYAML 6.0.1
requests 2.31.0
sentencepiece 0.1.99
setuptools 68.2.2
sympy 1.11.1
torch 2.3.0.dev20231225+cu118
tqdm 4.66.1
typing_extensions 4.8.0
urllib3 2.1.0
wheel 0.41.2

About benchmark results

Amaizing work! How was the benchmark results obtained? Is it just the generation speed measured when the GPU power limited 330W?

AMD quantize

trying to quantize and no model is generated
my hardware is amd

Loading model ...
Quantizing model weights for int8 weight-only symmetric per-channel quantization
Morto

AttributeError: 'LlamaForCausalLM' object has no attribute 'setup_caches'

Thanks for your valuable efforts for implementing such tricks!
I faced an error as you see in the title using following code.
Does everyone use setup_cachesmethod?
I am suspicious that I use wrong way.
My environments are belows:

  • transformers==4.35.2
  • torch==2.0.1+cu117
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    load_in_4bit=True,
    torch_dtype="auto",
    use_cache=True,
    )

model = torch.compile(model, mode="reduce-overhead")

with torch.device(model.device):
    model.setup_caches(
        max_batch_size=1,
        max_seq_length=512,
        )

thanks in advance.

Unexpected key(s) in state_dict: "rope.freqs".

Hi all,
i'm using llama-2-7b-chat model
i tried to run generate.py with following command line parameters
--compile --checkpoint_path llama/llama-2-7b-chat/consolidated.00.pth --prompt "Hello, my name is"
Got below output/ error

Loading model ...
Traceback (most recent call last):
File "generate.py", line 407, in
main(
File "generate.py", line 275, in main
model = _load_model(checkpoint_path, device, precision, use_tp)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "generate.py", line 226, in _load_model
model.load_state_dict(checkpoint, assign=True)
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transformer:
Unexpected key(s) in state_dict: "rope.freqs".
python-BaseException

Process finished with exit code 1

I know that rope.freqs is present in pth file but what value should i add in model.py under transformer_configs dict?
Thanks

8 (or 2 more) X A100 GPUs Model Output is Garbled and Failure to Terminate the Program Properly (One GPU is Correct)

Model Output is Garbled When using Multi A100 GPUs (8 (or 2 more) X A100) and Failure to Terminate the Program Properly

Environment

  • Ubuntu 20.04
  • Python 3.10.12
  • Pytorch 2.1.2+cu118
  • sentencepiece 0.1.99

GPU

  • 8 X A100 PCIE with P2P NVLink

Model

  • meta-llama/Llama-2-7b-chat-hf

Runing Command - Tensor Parallelism

  • No Quantization
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth
  • Int8 Quantization
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth

Errors

  • Model Output is Garbled When using Multi A100 GPUs (8 (or 2) X A100)
  • Failure to Terminate the Program Properly (program stucked)

截屏2023-12-21 下午5 23 25
截屏2023-12-21 下午5 23 36

Is torch.empty_like truely random?

Edit: nvm i was stupid.


Hi, I saw in the generate.py, you guys used below code to do gumbel sampling

def multinomial_sample_one_no_sync(probs_sort): # Does multinomial sampling without a cuda synchronization
q = torch.empty_like(probs_sort).exponential_(1)
return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)

Are you sure that empty_like behaves the same as uniform random variable? I mean, it seems like it just take what ever value is in the stack at the moment, and can very likely follow some pattern and not behave pseudorandomly. Would this affect the sampling and cause bias?

Mistral support

Would it be hard to adapt this code for Mistral? I tried open orca version and set vocab_size in config to 32002. But shapes did not match:

File "/experiments/dev/nsherstnev/gpt-fast/scripts/convert_hf_checkpoint.py", line 61, in permute
    w.view(n_head, 2, config.head_dim // 2, dim)
RuntimeError: shape '[32, 2, 64, 4096]' is invalid for input of size 4194304

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.