Comments (5)
Because llama use FlashAttention by default, and your devices are not sm75/8x/90 gpu architectures. In my experience, 2080ti and A100 can work, but V100 can' t.
from text-generation-inference.
@Wen1163204547, thanks for your help!
@Jblauvs, I will add a check to see if the GPU architecture is supported before importing flash attention.
from text-generation-inference.
Full traceback:
2023-04-18T19:49:59.869268Z ERROR shard-manager: text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 58, in serve
server.serve(model_id, revision, sharded, quantize, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 135, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 20, in intercept
return await response
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 46, in Prefill
generations, next_batch = self.model.generate_token(batch)
File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 278, in generate_token
out, present = self.forward(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 262, in forward
return self.model.forward(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 676, in forward
hidden_states, present = self.gpt_neox(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 614, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 460, in forward
attn_output = self.attention(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 324, in forward
flash_attn_cuda.fwd(
RuntimeError: Expected is_sm90 || is_sm8x || is_sm75 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
from text-generation-inference.
Makes perfect sense, as I'm using older V100s. Thanks all!
from text-generation-inference.
Hi @Wen1163204547 ,
Is there any method to support v100?
from text-generation-inference.
Related Issues (20)
- One of two concurrent request generating empty text (Mistral 7B) HOT 1
- Llava Next crashes on certain image sizes HOT 5
- Have installed flash-attn and flash-attn-v2 but output 'Flash Attention is not installed' when using Medusa HOT 1
- Regression on EETQ quantized models
- Not able to run tgi in Google Colab. Shard Cannot Start
- Llama-3 support HOT 19
- Empty 'id' in OpenAI Compatible API Chat Stream
- Response priming (option to provide the initial part of the assistant's message in the API request)
- CodeQwen1.5 not working HOT 3
- TGI seems is not supported that other position encoding, like baichuan2-7B(alibi position). HOT 1
- support llava1.5v
- Not able to install locally HOT 8
- Help me to add NLLB
- The settings of top_k, typical_p, do_sample in the request do not affect the generation?
- Webserver Crashed when serving CommandR-plus HOT 1
- The transformation between repetition_penalty and presence_penalty seems to be incorrect
- CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint HOT 2
- Error "Failed to buffer the request body: length limit exceeded" when supplying base64 encoded images greater than 1MB in prompt HOT 2
- Request failed during generation: Server error: 'FlashMixtral' object has no attribute 'compiled_model' HOT 3
- Unable to start TGI with llama3-70b HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-generation-inference.