Comments (4)
from vllm.
Hi @zhangxy1234 . can you confirm something for me -- what is the max context len supported by your draft model ?
from vllm.
Hi @zhangxy1234 . can you confirm something for me -- what is the max context len supported by your draft model ?
draft model is 2048 and base model is 4096
when tp = 1 οΌit will stop in 2048
but when tp >1 οΌ it can not stop in 2048 but raise this error
2024-06-15 10:58:41.179 | CRITICAL | vllm.worker.worker:_execute_model_non_driver:303 - data {'num_seq_groups': 1, 'blocks_to_swap_in': tensor([], size=(0, 2), dtype=torch.int64), 'blocks_to_swap_out': tensor([], size=(0, 2), dtype=torch.int64), 'blocks_to_copy': tensor([], device='cuda:1', size=(0, 2), dtype=torch.int64)}
(RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.180 | INFO | vllm.worker.worker:cache_swap:220 - cache_swap blocks_to_swap_in tensor([], size=(0, 2), dtype=torch.int64)
(RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.193 | CRITICAL | vllm.worker.worker:_execute_model_non_driver:303 - data {'num_lookahead_slots': 5, 'disable_all_speculation': False}
(RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.193 | INFO | vllm.worker.worker:cache_swap:220 - cache_swap blocks_to_swap_in None
(RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.193 | ERROR | vllm.worker.worker_base:execute_method:148 - Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=7582) Traceback (most recent call last):
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 289, in
(RayWorkerWrapper pid=7582) worker.main_loop()
(RayWorkerWrapper pid=7582) β β <function Worker.main_loop at 0x7fd2a5357040>
(RayWorkerWrapper pid=7582) β <ray._private.worker.Worker object at 0x7fd2a5350670>
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/worker.py", line 876, in main_loop
(RayWorkerWrapper pid=7582) self.core_worker.run_task_loop()
(RayWorkerWrapper pid=7582) β β β <method 'run_task_loop' of 'ray._raylet.CoreWorker' objects>
(RayWorkerWrapper pid=7582) β β <ray._raylet.CoreWorker object at 0x7fd2a42f5220>
(RayWorkerWrapper pid=7582) β <ray._private.worker.Worker object at 0x7fd2a5350670>
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/function_manager.py", line 691, in actor_method_executor
(RayWorkerWrapper pid=7582) return method(__ray_actor, *args, **kwargs)
(RayWorkerWrapper pid=7582) β β β {}
(RayWorkerWrapper pid=7582) β β ('start_worker_execution_loop',)
(RayWorkerWrapper pid=7582) β <function WorkerWrapperBase.execute_method at 0x7fd2045aaa60>
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
(RayWorkerWrapper pid=7582) return method(self, *_args, **_kwargs)
(RayWorkerWrapper pid=7582) β β β β {}
(RayWorkerWrapper pid=7582) β β β ('start_worker_execution_loop',)
(RayWorkerWrapper pid=7582) β β <vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2045ab760>
(RayWorkerWrapper pid=7582) β <function WorkerWrapperBase.execute_method at 0x7fd204720820>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) > File "vllm-main/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=7582) return executor(*args, **kwargs)
(RayWorkerWrapper pid=7582) β β β {}
(RayWorkerWrapper pid=7582) β β ()
(RayWorkerWrapper pid=7582) β <bound method SpecDecodeWorker.start_worker_execution_loop of <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at...
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=7582) return func(*args, **kwargs)
(RayWorkerWrapper pid=7582) β β β {}
(RayWorkerWrapper pid=7582) β β (<vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0>,)
(RayWorkerWrapper pid=7582) β <function SpecDecodeWorker.start_worker_execution_loop at 0x7fa4083899d0>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/spec_decode/spec_decode_worker.py", line 300, in start_worker_execution_loop
(RayWorkerWrapper pid=7582) while self._run_non_driver_rank():
(RayWorkerWrapper pid=7582) β β <function SpecDecodeWorker._run_non_driver_rank at 0x7fa408389d30>
(RayWorkerWrapper pid=7582) β <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/spec_decode/spec_decode_worker.py", line 369, in _run_non_driver_rank
(RayWorkerWrapper pid=7582) self.proposer_worker.execute_model()
(RayWorkerWrapper pid=7582) β β β <function Worker.execute_model at 0x7fa4083863a0>
(RayWorkerWrapper pid=7582) β β <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>
(RayWorkerWrapper pid=7582) β <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=7582) return func(*args, **kwargs)
(RayWorkerWrapper pid=7582) β β β {}
(RayWorkerWrapper pid=7582) β β (<vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>,)
(RayWorkerWrapper pid=7582) β <function Worker.execute_model at 0x7fa408386310>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File " vllm-main/vllm/worker/worker.py", line 236, in execute_model
(RayWorkerWrapper pid=7582) self._execute_model_non_driver()
(RayWorkerWrapper pid=7582) β β <function Worker._execute_model_non_driver at 0x7fa408386550>
(RayWorkerWrapper pid=7582) β <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 311, in _execute_model_non_driver
(RayWorkerWrapper pid=7582) self.cache_swap(blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy)
(RayWorkerWrapper pid=7582) β β β β β None
(RayWorkerWrapper pid=7582) β β β β None
(RayWorkerWrapper pid=7582) β β β None
(RayWorkerWrapper pid=7582) β β <function Worker.cache_swap at 0x7fa408386280>
(RayWorkerWrapper pid=7582) β <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 223, in cache_swap
(RayWorkerWrapper pid=7582) if blocks_to_swap_in.numel() > 0:
(RayWorkerWrapper pid=7582) β None
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) AttributeError: 'NoneType' object has no attribute 'numel'
from vllm.
@zhangxy1234 could you confirm whether you still encounter this error with the latest version of vLLM? (0.5.3.post1)
from vllm.
Related Issues (20)
- [Bug]: INTEL GPU ARC 770 import vllm error
- [Feature]: Offline quantization for Pixtral-12B HOT 12
- [Usage]: vllm OpenAI API Offline Batch Inference HOT 7
- [Usage]: HOT 1
- [Usage]: Use GGUF model with docker when hf repo has multiple quant versions HOT 6
- [Bug]: lm-format-enforcer guided decoding kills MQLLMEngine HOT 1
- [Bug]: Triton assertion errors serving Llama-3.1-8b on 4xH100s in FP32 precision HOT 3
- [Bug]: Wrong Response with Gemma2 with 8k context length
- [Feature]: DRY Sampling
- [Usage]: How to run VLLM on multiple tpu hosts V4-32 HOT 1
- [Usage]: Standalone Debugging and Measuring the vLLM Engine Backend
- (.FULL-VIDEO,,!!)β’ Sophie Rain Spiderman Viral Videos Leaked #8587 HOT 2
- [-FREE.]Subhashree Sahu Leaked Video Viral On Social Media HOT 2
- [-FREE.]Sophie Rain Spiderman Video Viral On Social Media HOT 2
- ΩΩΨ―ΩΩ Ψ³ΩΨ³ ΩΨ―ΩΨ± ΨΉΨ¨Ψ― Ψ§ΩΨ±Ψ§Ψ²Ω Ω ΩΨ·ΨΉ Ψ§Ω Ω Ψ΄Ψ§ΩΨ― ΩΨ¨Ω Ψ§ΩΨΨ°Ω LEAKED-XxX-VIDEO]** HOT 2
- ΩΩΨ―ΩΩ Ψ³ΩΨ³ ΩΨ―ΩΨ± ΨΉΨ¨Ψ― Ψ§ΩΨ±Ψ§Ψ²Ω Ω ΩΨ·ΨΉ ΩΨ§Ω Ω Ψ΄Ψ§ΩΨ― ΩΨ¨Ω Ψ§ΩΨΨ°Ω+$$+WATCH SEX VIDEOS]* HOT 1
- [Misc]: In vllm, I tested that the speed of concurrent server api requests is greater than the speed of offline inference. I would like to ask if there are any performance tests on the official vllm website. Can you tell me? Thank you. HOT 1
- [-FREE.]Gloria Bugie Noodles Video Viral On Social Media HOT 1
- [Bug]: Prometheus /metrics Endpoint Empty HOT 7
- [Bug]: vllm deploy qwen1.5-14b/qwen2-7b+medusa, RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x5120 and 4096x4096) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.