Giter VIP home page Giter VIP logo

nm-vllm's People

Contributors

afeldman-nm avatar andy-neuma avatar beginlner avatar chenxu2048 avatar dhuangnm avatar esmeetu avatar gesanqiu avatar hermitsun avatar hmellor avatar hongxiayang avatar jeanniefinks avatar liuxiaoxuanpku avatar mgoin avatar mspronesti avatar nikolaborisov avatar pcmoritz avatar robertgshaw2-neuralmagic avatar ronensc avatar sanster avatar simon-mo avatar tjtanaa avatar twaka avatar varun-sundar-rabindranath avatar woosukkwon avatar wrran avatar yard1 avatar yunfeng-scale avatar zhaoyang-star avatar zhuohan123 avatar zspo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nm-vllm's Issues

[Doc]: Are we capable of automatically converting GGUF models into Marlin, similar to our support for GPTQ?

πŸ“š The doc issue

Marlin quantization inference efficiency is truly remarkable, significantly outperforming AWQ or GPTQ in harnessing the acceleration of tensor parallelism. Moreover, its concurrent performance is also far superior to that of AWQ or GPTQ.
We have an excellent feature that automatically converts GPTQ to Marlin during deployment. I wonder if such a feature could be applied to the quantization mode of GGUF?

Suggest a potential alternative/fix

No response

[Usage]:

Your current environment

python collect_env.py
Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
Nvidia driver version: 552.22
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             20
On-line CPU(s) list:                0-19
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i9-13900H
CPU family:                         6
Model:                              186
Thread(s) per core:                 2
Core(s) per socket:                 10
Socket(s):                          1
Stepping:                           2
BogoMIPS:                           5990.40
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          480 KiB (10 instances)
L1i cache:                          320 KiB (10 instances)
L2 cache:                           12.5 MiB (10 instances)
L3 cache:                           24 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.2.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

Hello,

Trying to run the openai inference server with: neuralmagic/Meta-Llama-3-8B-Instruct-FP8

Followed the instructions here:

https://github.com/neuralmagic/nm-vllm

Ran the model as such, after installing all of the dependencies in a fresh conda env:

(nm-vllm) unix@rog-zephyrus:/code/structure$ pip install nm-vllm[sparse]
(nm-vllm) unix@rog-zephyrus:/code/structure$ python -m vllm.entrypoints.openai.api_server --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8 --sparsity sparse_w16a16
INFO 05-11 23:41:26 api_server.py:149] vLLM API server version 0.2.0

warnings.warn(
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/unix/miniconda3/envs/nm-vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in
engine = AsyncLLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/unix/miniconda3/envs/nm-vllm/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 331, in from_engine_args
engine_configs = engine_args.create_engine_configs()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/unix/miniconda3/envs/nm-vllm/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 405, in create_engine_configs
model_config = ModelConfig(
^^^^^^^^^^^^
File "/home/unix/miniconda3/envs/nm-vllm/lib/python3.11/site-packages/vllm/config.py", line 133, in init
self._verify_quantization()
File "/home/unix/miniconda3/envs/nm-vllm/lib/python3.11/site-packages/vllm/config.py", line 234, in _verify_quantization
raise ValueError(
ValueError: Unknown quantization method: fp8. Must be one of ['awq', 'gptq', 'squeezellm', 'marlin'].

am I doing something wrong here? I also tried to run the model in standard vllm, v0.4.2. Performance was great, but about 30% of responses were bizarre, lots of responses containing "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!".

How to get the sparsed model?

Very interesting project! I just saw the example below. I am curious that how to get the sparsed model based on one dense model? There are lots of sparse methods, such as SparseGPT, DejaVu, SliceGPT, etc,. Which method has nn-vllm supported?

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
    sparsity="sparse_w16a16",
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

[Misc]: Move from using `PYBIND11_MODULE` macro to bind C++/CUDA kernels to python to using `TORCH_LIBRARY` macro

Anything you want to discuss about vllm.

Motivation

Currently vLLM uses PYBIND11_MODULE macro to bind C++/CUDA to Python with the binding code being found in csrc/pybind.cpp. This means calls to this kernel bypasses the torch dispatcher (more information on the torch dispatcher can be found here and here). While bypassing the torch dispatcher works, using the torch dispatcher has a few distinct advantages, namely:

  1. Better integration with the Pytorch profiler
  2. A more natural way to support CPU only inference or other hardware in the future

With regards to 1, at Neural Magic we are working on more indepth profiling tools within vLLM using the Pytorch profiler, by using torch dispatcher (i.e. registering the C++/CUDA kernels using TORCH_LIBRARY macro instead of we can PYBIND11_MODULE) we can provide richer traces since the profiler will be able to capture metadata (namely type and shape information) for the inputs to each operation (kernel). Below is an example of the traces we are generating (Note this is work in progress):

name                                                         | cpu_time_us  | cuda_time_us | pct_cuda_... | trace                                                       
========================================================================================================================================================================
LlamaForCausalLM                                             |      9424.95 |     31087.00 |        93.80 |                                                             
|- LlamaModel                                                |      9403.02 |     31087.00 |        93.80 |                                                             
||- VocabParallelEmbedding(weight=bfloat16[32064, 4096])     |        93.30 |         7.00 |         0.02 |                                                             
|||- void at::native::(anonymous namespace)::indexSelectL... |         0.00 |         7.00 |         0.02 | index_select(bfloat16[32064, 4096], 0, int64[128]) <- emb...
||- LlamaDecoderLayer                                        |      1555.91 |       760.00 |         2.29 |                                                             
|||- RMSNorm(weight=bfloat16[4096])                          |       271.12 |         6.00 |         0.02 |                                                             
||||- void vllm::rms_norm_kernel<c10::BFloat16>(c10::BFlo... |         0.00 |         6.00 |         0.02 |                                                             
|||- LlamaAttention                                          |      1003.32 |       173.00 |         0.52 |                                                             
||||- QKVParallelLinear(weight=bfloat16[6144, 4096])         |       173.64 |        95.00 |         0.29 |                                                             
|||||- ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stage... |         0.00 |        95.00 |         0.29 | mm(bfloat16[128, 4096], bfloat16[4096, 6144]) <- matmul(b...
||||- RotaryEmbedding                                        |        19.37 |         4.00 |         0.01 |                                                             
|||||- void vllm::rotary_embedding_kernel<c10::BFloat16, ... |         0.00 |         4.00 |         0.01 |                                                             
||||- Attention                                              |       534.66 |        15.00 |         0.05 |                                                             
|||||- void vllm::reshape_and_cache_kernel<__nv_bfloat16,... |         0.00 |         6.00 |         0.02 |                                                             
|||||- void flash_fwd_kernel<Flash_fwd_kernel_traits<128,... |         0.00 |         9.00 |         0.03 | FlashAttnFunc(bfloat16[8, 16, 32, 128], bfloat16[8, 16, 8...
...

Under the final trace column we can see tensor type and shape information, however this information is only available for TorchOp events (i.e. kernels registered using TORCH_LIBRARY). For example flash_fwd_kernel and ampere_bf16_s16816gemm... has this shape an type information while vllm::reshape_and_cache_kernel does not, as the former two kernels go through the torch dispatcher why the latter does not.

With regards to 2, at Neural Magic we have ambitions to extend vLLM to support CPU inference which will require dispatching to CPU or CUDA versions of the same kernel depending on the location of tensors (this can apply to other hardware too, not just CPUs), this is something the torch dispatcher does this automatically alleviating the need for a chain of if statements.

Implementation

There appears to be 2 primary ways to register operations (kernels) with the torch dispatcher, the first is using C++ and the TORCH_LIBRARY macro mentioned in the motivation. An example of this can be found in the xformers repository, with an SpMM operation being declared here and implementation being bound for CUDA here and CPU here. The other way is via Python, xformers also has an example of this for the flash_fwd operation with the operation declaration being found here and the CUDA implementation being bound here.

For implementation given that vLLM controls the Python to C++/CUDA bindings for the kernels in csrc I think it would be cleaner to go with the TORCH_LIBRARY approach as it wouldn't require much more boiler plate than the existing PYBIND11_MODULE.

[Doc]: Whether it's possible to apply both sparse and quantization simultaneously?

πŸ“š The doc issue

I have read the examples in both examples and examples-neuralmagic directories. I found examples demonstrating the use of quantization techniques as well as examples demonstrating the use of sparse techniques. However, I haven't found any examples demonstrating the simultaneous use of both sparse and quantization techniques. Can you confirm whether it's possible to apply both of these techniques simultaneously during deployment using the nm-vllm library?

Suggest a potential alternative/fix

No response

[Draft RFC]: Int8 Activation Quantization

Anything you want to discuss about vllm.

Summary

NB: This is a WiP draft RFC to be discussed here a bit before submitting as an issue upstream.

  • We (engineering at @neuralmagic) are working on support for int8 quantized activations.
  • This RFC is proposing an incremental approach to quantization, where the initial support for quantization will make minimal and local changes to the PyTorch model definitions. We propose swapping out Linear and Attention modules with their quantized counterparts without modifying the graphs around them. The upside to this will be quicker support for quantized models. The downside is that we will be quantizing the activations on the fly prior to computation.
  • To reduce the additional data movement from quantizing the activations on the fly, the activations will need to remain quantized throughout the graph, requiring more extensive and nonlocal modifications to the model definitions. We will be working on abstractions for the quantized model definitions to make adding support for new models as easy as possible.
  • Activation quantization will introduce additional elementwise operations to the model. To reduce the additional data movement of the activations from these operations, operator fusion will be needed. Rather than manually writing fused kernels for these, this RFC proposes committing to a torch.compile-based solution, to be explored in a future RFC.

Motivation and Scope

The high-level goal of this RFC is to speed up Prefill by increasing the rate of computation by using int8 tensor cores. We don't anticipate improving decode performance except for very large batch sizes, as inference time in that case is dominated by loading the weights and is already well-served by weight-only quantization.

Int4 activation quantization is out of scope for this RFC, but we are interested in support for it. Successful int4 activation quantization (namely QuaRot) requires more work and more extensive modifications to the model definitions than int8 activation quantization, so it's natural to do this after int8 quantization.

For this RFC, we are focusing on support for Nvidia GPUs, and leaving other systems as out of scope.

Quantization Schemes and Zero Points

We are considering quantization of the form:
$$\widehat X = \lfloor \frac{X}{s_x} \rceil + z_x$$
In this case, $X$ is floating point, and $\widehat X$ will be its int8 quantized representation. $s_x$ is the scale or tensor of scales, and $z_x$ is a zero point.

There are several cases to consider, with performance and accuracy tradeoffs in each case.

  • Static vs dynamic quantization. The scales and zero points may be known ahead of time, or may instead be determined at runtime after inspecting the values of the tensor. Dynamic quantization will provide more accuracy, but requires multiple passes over the activation.
  • Asymmetric vs symmetric quantization. In symmetric quantization, $z_x$ is equal to 0. In asymmetric quantization $z_x$ is nonzero. When upconverting before quantization, $z_x$ can be applied as a shift prior to computation. If there is no upconversion, then an additional term (which this RFC will call a zero point correction term ) can be computed and added to the output. This costs an additional $\mathcal O(n^2)$, either at runtime or computed offline.
  • Per-tensor vs per-token quantized activations. Generally per-token quantization has higher accuracy but requires more data movement. The particular case of per-token and asymmetric is unfavorable as it increases the dimensionality of the zero point correction term.
  • Per-tensor vs per-column vs group quantized weights. Group quantization will require kernel work for the activation quantization case, so is out of scope for this PR. If weight quantization is symmetric symmetric quantization, per-tensor or per-column quantization can be handled by scaling the output tensor of a linear layer, either by a scalar value in the case of per-tensor quantization or by a vector (with tensor expansion) in the case of per-column quantization.

In light of these considerations, this RFC proposes supporting the following cases.

For the weights:

  • w8a8 case: Static, symmetric and either per-tensor or per-column.
  • w4a8 case: Static, either symmetric or asymmetric, and either per-tensor or per-column.

For the activations:

  • Static, asymmetric, per-tensor quantization.
  • Static, symmetric, per-token quantization.
  • Dynamic, symmetric, per-token quantization.

Zero Point Correction Terms

This section is a zoom-in on the linear algebra for the zero point correction terms, to further motivate some of the decisions made above on support for asymmetric vs symmetric and per-token vs per-tensor cases.

TODO: Consider switching to $\circ$ notation to generalize these eqns beyond per-tensor quantization.

Suppose we want to compute a quantized GEMM operation $C = AB$, where $A$ is $m \times k$, $B$ is $k \times n$, and $C$ is $m \times n$. In this setting, $A$ is the input activation matrix and $B$ is the weight matrix, known offline. We quantize we quantize the matrices as $C = s_C (\widehat C - z_C J_C)$, $B = s_B (\widehat B - z_B J_B)$, $A = s_A (\widehat A - z_A J_A)$.
This is per-tensor quantization where $s_X$ is the scale of matrix $X$, $z_X$ is the zero point of $X$, and $J_X$ is the conformal matrix of all ones. Here we are ignoring any rounding for quantization for simplicity. Let's furthermore assume that $z_C = 0$ and $s_A, s_B, s_C = 1$ just to get them out of the way -- the scales of all matrices and the output's zero point are pretty easy to deal with.

Let's substitute the above equations into $C = AB$ to see how to compute $\widehat C$.
$C = AB$
$\widehat C = (\widehat A - z_A J_A) (\widehat B - z_B J_B)$
$\widehat C = \widehat A \widehat B - z_A J_A \widehat B - z_B \widehat A J_B + z_A z_B J_A J_B$

A brief remark on each term:

  • $\widehat A \widehat B$: will be computed by our quantized GEMM kernel.
  • $z_A z_B J_A J_B$: If per-tensor quantization is used, every value of $z_A z_B J_A J_B$, is the same and depends only on $k$ and the zero points of $A$ and $B$.
  • $z_A J_A \widehat B$: A few remarks on this one.
    • This term can be computed offline, since $\widehat B$ is known ahead of time.
    • If per-tensor quantization is used, each row of this term is the same so it can be subtracted from the output via tensor expansion. If instead we are using per-token quantization, we lose the property that each row is the same, so we consider per-token asymmetric quantization to be an unfavorable case.
    • If we have static quantization and know $z_A$ in the Linear module's constructor, we can fully compute this term and possibly fold it into the bias if it exists. In that case, asymmetric activation quantization can be implemented at zero as compared to the symmetric case.
    • $J_A \widehat B$ can be computed via a Reduce operation.
  • $z_B \widehat A J_B$: This term depends on the activation matrix, so must be computed at runtime if asymmetric weight quantization is used.

Sparsity benchmarks

Interesting project. I came across it looking at Marlin models on HF.

Do you have any benchmarks (performance and perplexity) for sparse models? I see one graph in the readme but it's not clear if that is token latency or throughput. And I'd love to see a perplexity curve at various levels of sparsity.

More than happy to do these tests myself but if you have these handy, they are definitely worth adding to the readme for passersby like me who are constantly looking to shave off a few ms for latency sensitive applications.

[Bug]: CUDA error: an illegal instruction was encountered

File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 783, in synchronize

return torch._C._cuda_synchronize()

πŸ› Describe the bug

Hello,
I am installing the mgoin/Meta-Llama-3-70B-Instruct-Marlin model in the H100 GPU.

I am using nm-vllm 0.2.0 version as LLM engine.
Actually my problem

File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 783, in synchronize
return torch._C._cuda_synchronize()

It appears in the section: There is an incompatibility between CUDA Torch and VLLM.

My error message:
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Library information I use:
torch 2.1.2
nm-vllm 0.2.0
filelock 3.14.0

I did not encounter any problems with the A100 processor, but I am getting the error code in the H100 GPU.

When I change package versions, I get other errors.

The model comes from huggingface.

I keep the model in cache so that huggingface does not change my cuda and pytorch versions, and I change my package versions this way. I have a special environment in this regard.

Have you ever received this error in H100?

[Bug]: When running repo hello world: RuntimeError: CUDA error: an illegal instruction was encountered

Your current environment


Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9334 32-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 1
Core(s) per socket:                 24
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           5399.98
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm arch_capabilities
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB (24 instances)
L1i cache:                          1.5 MiB (24 instances)
L2 cache:                           12 MiB (24 instances)
L3 cache:                           384 MiB (24 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.2.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-23    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

πŸ› Describe the bug

From the repo hello world example, I encounted an error

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin"
model = LLM(model_id, max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)

messages = [
    {"role": "user", "content": "What is synthetic data in machine learning?"},
]
formatted_prompt =  tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
INFO 04-13 12:07:28 config.py:217] The model is serialized in Marlin format. Using Marlin kernel.
INFO 04-13 12:07:28 llm_engine.py:74] Initializing an LLM engine (v0.2.0) with config: model='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', tokenizer='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=marlin, sparsity=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-13 12:07:29 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-13 12:07:29 selector.py:25] Using XFormers backend.
INFO 04-13 12:07:30 weight_utils.py:192] Using model weights format ['*.safetensors']
INFO 04-13 12:07:31 model_runner.py:106] Loading model weights took 3.8582 GB
INFO 04-13 12:07:32 gpu_executor.py:94] # GPU blocks: 34032, # CPU blocks: 2048
INFO 04-13 12:07:32 model_runner.py:793] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-13 12:07:32 model_runner.py:797] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Traceback (most recent call last):
  File "/home/ubuntu/experiment_00_marlin.py", line 5, in <module>
    model = LLM(model_id, max_model_len=4096)
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 121, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 198, in from_engine_args
    engine = cls(
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 112, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in __init__
    self._init_cache()
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache
    self.driver_worker.warm_up_model()
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 167, in warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 856, in capture_model
    graph_runner.capture(
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 917, in capture
    torch.cuda.synchronize()
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 783, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[Feature]: New model request llama-3 70b

πŸš€ The feature, motivation and pitch

There are few llama 3 70b GPTQ models available. Can we use that directly?

Alternatives

No response

Additional context

No response

[Feature]: Support LLama3

πŸš€ The feature, motivation and pitch

error: unknow layout for LLama3

Alternatives

No response

Additional context

No response

[Doc]: Support Mixtral?

πŸ“š The doc issue

How effective is the acceleration of the mixtral model?

Suggest a potential alternative/fix

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.