Comments (10)
If nvcc --version
doesn't work I'm not sure the problem lies in TGI unfortunately, it seems to be linked to an issue with your setup 😕
from text-generation-inference.
Hey @muhammadbaqir1327, thanks for your report! Do you get the issue even with the newest container?
It would change your code as such:
model=codellama/CodeLlama-13b-Instruct-hf
volume=$PWD/data
- docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize eetq
+ docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model --quantize eetq
from text-generation-inference.
Yes, I have already tried it
@LysandreJik
from text-generation-inference.
Ok, thanks! Let's try to see what's going on. This seems like a setup/CUDA issue, and I don't see you passing the GPUs to the docker image. Could you try by having a --gpus all
additional flag?
So this command:
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model --quantize eetq
from text-generation-inference.
Yes, I have also tried that command. It also not worked for me.
I also think it is related to CUDA installation. I tried to run nvcc --version. But it said that command is not available.
You can check similar type issue on this link: https://forums.developer.nvidia.com/t/nvcc-command-not-found-and-unable-to-install-nvidia-cuda-toolkit-in-the-jetpack-6/275486
from text-generation-inference.
Ok, thanks! Let's try to see what's going on. This seems like a setup/CUDA issue, and I don't see you passing the GPUs to the docker image. Could you try by having a
--gpus all
additional flag?So this command:
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model --quantize eetq
By running this command. It produced following logs:
2024-06-14T15:27:13.274448Z INFO text_generation_launcher: Args {
model_id: "codellama/CodeLlama-13b-Instruct-hf",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
Eetq,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "c9facdfbc83e",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
2024-06-14T15:27:13.274521Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-06-14T15:27:14.789216Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383`.
2024-06-14T15:27:14.789235Z INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-06-14T15:27:14.789240Z INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-06-14T15:27:14.789244Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-06-14T15:27:14.789248Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-14T15:27:14.789408Z INFO download: text_generation_launcher: Starting download process.
2024-06-14T15:27:17.653212Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-06-14T15:27:18.095517Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-14T15:27:18.095844Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-14T15:27:23.081981Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 560, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 79, in __init__
weights = Weights(filenames, device, dtype, process_group=self.process_group)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 24, in __init__
with safe_open(filename, framework="pytorch") as f:
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
2024-06-14T15:27:23.607129Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 560, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 79, in __init__
weights = Weights(filenames, device, dtype, process_group=self.process_group)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 24, in __init__
with safe_open(filename, framework="pytorch") as f:
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
rank=0
2024-06-14T15:27:23.706513Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-14T15:27:23.706536Z INFO text_generation_launcher: Shutting down shards
from text-generation-inference.
One point I forget to mention while running docker run, I was getting hf_transfer error. So by searching some solutions, I fixed it by adding HF_HUB_ENABLE_HF_TRANSFER=0 variable in the command, like this:
docker run --env HF_HUB_ENABLE_HF_TRANSFER=0 ...
from text-generation-inference.
Hmmm these are different problems. The safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
points to an error with the model you have loaded.
Is it possible for you to load it in Python directly using the safetensors
library?
from safetensors import safe_open
import torch
model = hf_hub_download(repo_id=model_id, filename="model.safetensors")
tensors = {}
with safe_open(model, framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
from text-generation-inference.
This issue
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
was due to not passing --gpus flag in docker command.
As discussed here: https://stackoverflow.com/questions/54249577/importerror-libcuda-so-1-cannot-open-shared-object-file#comment136144707_68587460
from text-generation-inference.
Glad you could get it resolved!
from text-generation-inference.
Related Issues (20)
- `mistralai/Mixtral-8x22B-Instruct-v0.1`: Successful warmup, crashes on inference HOT 2
- Long install report HOT 1
- P40 with USE_FLASH_ATTENTION=False HOT 2
- Sparse Marlin HOT 3
- protobuf version not compatible HOT 1
- Qwen/Qwen2-72B-Instruct-AWQ gibberish output in 2.0.4 HOT 3
- Will Support Disaggregating Prefill and Decoding? HOT 1
- `top_p` messes up `top_logprobs` HOT 2
- template_error in /chat/completions HOT 2
- how to launch a service using downloaded model weights? HOT 2
- TGI keeps crashing with 'device-side assert triggered' HOT 1
- AttributeError: 'MixtralLayer' object has no attribute 'mlp' HOT 3
- Error loading Qwen2-72B-Instruct with EETQ HOT 1
- [RFC]Add Auto-Round Support HOT 18
- OpenAI-compatible API has a discrepancy with original OpenAI API when using tool calls HOT 2
- Error with sharded Mixtral HOT 5
- Unable to load the local model file into LoRA adaptors HOT 10
- Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' HOT 1
- Error "EOF while parsing an object..." with tool_calls HOT 1
- DeepSeek Coder V2: sharded is not supported for AutoModel HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-generation-inference.