Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9334 32-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 1
Core(s) per socket: 24
Socket(s): 1
Stepping: 1
BogoMIPS: 5399.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm arch_capabilities
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 1.5 MiB (24 instances)
L1i cache: 1.5 MiB (24 instances)
L2 cache: 12 MiB (24 instances)
L3 cache: 384 MiB (24 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.2.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-23 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin"
model = LLM(model_id, max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
messages = [
{"role": "user", "content": "What is synthetic data in machine learning?"},
]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
INFO 04-13 12:07:28 config.py:217] The model is serialized in Marlin format. Using Marlin kernel.
INFO 04-13 12:07:28 llm_engine.py:74] Initializing an LLM engine (v0.2.0) with config: model='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', tokenizer='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=marlin, sparsity=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-13 12:07:29 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-13 12:07:29 selector.py:25] Using XFormers backend.
INFO 04-13 12:07:30 weight_utils.py:192] Using model weights format ['*.safetensors']
INFO 04-13 12:07:31 model_runner.py:106] Loading model weights took 3.8582 GB
INFO 04-13 12:07:32 gpu_executor.py:94] # GPU blocks: 34032, # CPU blocks: 2048
INFO 04-13 12:07:32 model_runner.py:793] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-13 12:07:32 model_runner.py:797] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Traceback (most recent call last):
File "/home/ubuntu/experiment_00_marlin.py", line 5, in <module>
model = LLM(model_id, max_model_len=4096)
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 121, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 198, in from_engine_args
engine = cls(
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 112, in __init__
self.model_executor = executor_class(model_config, cache_config,
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in __init__
self._init_cache()
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache
self.driver_worker.warm_up_model()
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 167, in warm_up_model
self.model_runner.capture_model(self.gpu_cache)
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 856, in capture_model
graph_runner.capture(
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 917, in capture
torch.cuda.synchronize()
File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 783, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.