Comments (3)
My latency varies between 20 and 40 seconds, and it’s a simple query without any historical conversation messages. Is there a way to further optimize my model’s execution speed without altering the hardware conditions?
from bigdl.
Based on the model_path = "/home/ty/chatglm3-demo/chatglm3-6b"
in your code, could you confirm if you're running the script on WSL? Also, could you provide details about the memory of your machine and the memory allocated to WSL?
Additionally, you can try using with torch.inference_model
to see if there's any improvement.
Example code:
with torch.inference_mode():
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt")
st = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
end = time.time()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)
from bigdl.
Based on the
model_path = "/home/ty/chatglm3-demo/chatglm3-6b"
in your code, could you confirm if you're running the script on WSL? Also, could you provide details about the memory of your machine and the memory allocated to WSL? Additionally, you can try usingwith torch.inference_model
to see if there's any improvement. Example code:with torch.inference_mode(): prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt) input_ids = tokenizer.encode(prompt, return_tensors="pt") st = time.time() # if your selected model is capable of utilizing previous key/value attentions # to enhance decoding speed, but has `"use_cache": false` in its model config, # it is important to set `use_cache=True` explicitly in the `generate` function # to obtain optimal performance with BigDL-LLM INT4 optimizations output = model.generate(input_ids, max_new_tokens=args.n_predict) end = time.time() output_str = tokenizer.decode(output[0], skip_special_tokens=True) print(f'Inference time: {end-st} s') print('-'*20, 'Prompt', '-'*20) print(prompt) print('-'*20, 'Output', '-'*20) print(output_str)
It's on my aipc host, I don't use WSL. I've tried your method, and the results are barely noticeable
from bigdl.
Related Issues (20)
- New instructions about "Run Distributed QLoRA Fine-Tuning on Kubernetes" in MPI-Operator v1alpha1 and with kubectl method HOT 1
- No output when using Baichuan2-7B-Chat with 2k input and int4 on XPU HOT 8
- Failed to run Llama2-7B on Intel GPU HOT 2
- fail to run model when load low bits instead of load original for qwen HOT 1
- Failed to run Llama 2 inference on Flex 140 HOT 4
- Baichuan2-7B takes more memory than chatglm3-6B on MTL 16GB device, need to optimize VRAM of Baichuan2-7B HOT 4
- HuatuoGPT-7B need to optimize performance about First token latency (ms) and After token latency (ms/token) HOT 3
- HuatuoGPT-7B will self Q & A with history by TextIteratorStreamer HOT 1
- Error when executing "from bigdl.llm.langchain.llms import TransformersLLM" HOT 4
- Running minicpm failed HOT 4
- QWEN2 Model generate failed HOT 1
- Qwen1.5-7B wrong outputs with 1024 prompts HOT 13
- Installation of BigDL-LLM and missing file in wheel HOT 1
- Can not load Yuan2-2B GGUF FP16 model HOT 3
- text-generation-webui server.py - modifying extensions
- ChatGLM3 can not stop with stop words HOT 1
- issue about Qwen-7b on Arc A770 HOT 3
- Can Bigdl-LLM support Qwen-14B or Qwen-72B based multi-card of Arc A770? HOT 1
- 通过webui进行模型chat,提示没有安装xpu,即使选择了cpu,也发现存在xpu问题 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bigdl.