[Question]: How to use int8 API (nano or ggml?) in LLAMA inference? about bigdl HOT 3 OPEN

llCurious commented on May 26, 2024

[Question]: How to use int8 API (nano or ggml?) in LLAMA inference?

from bigdl.

Comments (3)

jason-dai commented on May 26, 2024

Hi, guys. I notice that BigDL utilizes BigDL nano and ggml to accelerate int8/int4 computations. I wonder how to invoke these APIs in LLMs like LLAMA. Specifically, I want to accelerate the linear layers in Huggingface-version LLAMA (torch-based).

We are not using nano or ggml in bigdl-llm; see examples at https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model

from bigdl.

llCurious commented on May 26, 2024

Thanks, i got it. BTW, do you have the README on using BigDL optimized int4/int8 quantized computation library? Maybe the use case of quantized matmul in Linear layers in models like LLAMA?

from bigdl.

shane-huang commented on May 26, 2024

If you are using HuggingFace transformers to load your LLAMA model, you can refer to the llama2 example here https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2

If you are using customized code to load LLAMA model, you can use optimize_model to optimize it. optimize_model can be applied to arbitrary pytorch models w/ low-bit optimizations. Refer to this example for how to use it: https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/CPU/PyTorch-Models/Model/llama2/generate.py#L49

from bigdl.

[Question]: How to use int8 API (nano or ggml?) in LLAMA inference? about bigdl HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent