I’ve been using BIGDL-LM to accelerate the chatglm3-6b model. However, I’m curious abo

BIGDL-LM Acceleration for chatglm3-6b about bigdl HOT 3 OPEN

HuskyLYL commented on September 26, 2024

BIGDL-LM Acceleration for chatglm3-6b

from bigdl.

Comments (3)

HuskyLYL commented on September 26, 2024

My latency varies between 20 and 40 seconds, and it’s a simple query without any historical conversation messages. Is there a way to further optimize my model’s execution speed without altering the hardware conditions?

from bigdl.

Romanticoseu commented on September 26, 2024

Based on the model_path = "/home/ty/chatglm3-demo/chatglm3-6b" in your code, could you confirm if you're running the script on WSL? Also, could you provide details about the memory of your machine and the memory allocated to WSL?
Additionally, you can try using with torch.inference_model to see if there's any improvement.
Example code:

with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        st = time.time()
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with BigDL-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

from bigdl.

HuskyLYL commented on September 26, 2024

Based on the model_path = "/home/ty/chatglm3-demo/chatglm3-6b" in your code, could you confirm if you're running the script on WSL? Also, could you provide details about the memory of your machine and the memory allocated to WSL? Additionally, you can try using with torch.inference_model to see if there's any improvement. Example code:

with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        st = time.time()
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with BigDL-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

It's on my aipc host, I don't use WSL. I've tried your method, and the results are barely noticeable

from bigdl.

Recommend Projects

BIGDL-LM Acceleration for chatglm3-6b about bigdl HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent