Giter VIP home page Giter VIP logo

chatglm_multigpucpu_eval's Introduction

ChatGLM_MultiGPUCPU_eval

简易实现ChatGLM单机调用多个计算设备(GPU、CPU)进行推理

🟠 在大多数情况下并不能加速推理,旨在使更低端的设备获得更高精度的推理、更多轮次的对话

推理

1.从仓库下载 MultiDevices.py

MultiDevices.py

2.加载模型

参考官方文档

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half()

3.调用函数

import MultiDevices

MultiDevices.GPU_precision = 'int8'
MultiDevices.embeddings = 'cpu'
MultiDevices.layers = {
                        'cuda:1': '1-12',
                        'cuda:0': '13-24',
                        'cpu': '25-28'
                    }
MultiDevices.final_layernorm = 'cpu'

model = MultiDevices.ConfigMultiDevices(model)

输出

word_embeddings -> cpu
layer 0 -> int8 -> cuda:0
layer 1 -> int8 -> cuda:0
layer 2 -> int8 -> cuda:0
layer 3 -> int8 -> cuda:0
layer 4 -> int8 -> cuda:0
layer 5 -> int8 -> cuda:0
layer 6 -> int8 -> cuda:0
layer 7 -> int8 -> cuda:0
layer 8 -> int8 -> cuda:0
layer 9 -> int8 -> cuda:0
layer 10 -> int8 -> cuda:0
layer 11 -> int8 -> cuda:0
layer 12 -> int8 -> cuda:1
layer 13 -> int8 -> cuda:1
layer 14 -> int8 -> cuda:1
layer 15 -> int8 -> cuda:1
layer 16 -> int8 -> cuda:1
layer 17 -> int8 -> cuda:1
layer 18 -> int8 -> cuda:1
layer 19 -> int8 -> cuda:1
layer 20 -> int8 -> cuda:1
layer 21 -> int8 -> cuda:1
layer 22 -> int8 -> cuda:1
layer 23 -> int8 -> cuda:1
layer 24 -> cpu
layer 25 -> cpu
layer 26 -> cpu
layer 27 -> cpu
final_layernorm -> cpu
lm_head -> cpu
hooked.

正常使用

model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题

函数说明

结构

参数 类型 说明
GPU_precision str 模型量化精度 fp16(默认),int8,int4
CPU_precision str CPU中的模型精度 fp32(默认),bf16
embeddings str embeddings 层使用的设备
layers object layers 各层(1~28)分别使用的设备
final_layernorm str final_layernorm 层使用的设备

已测试

请根据自身情况调节

FP16

MultiDevices.GPU_precision = 'fp16' # 或者不设置

8G GPU + 8G GPU

NVIDIA Tesla P4 + NVIDIA P104-100

MultiDevices.embeddings = 'cuda:0'
MultiDevices.layers={
                        'cuda:0': '1-14',
                        'cuda:1': '15-28'
                    }
MultiDevices.final_layernorm = 'cuda:1'

8G GPU + 8G GPU + CPU

NVIDIA Tesla P4 + NVIDIA P104-100

MultiDevices.embeddings = 'cpu',
MultiDevices.layers={
                        'cuda:0': '1-14',
                        'cuda:1': '15-28'
                    }
MultiDevices.final_layernorm = 'cuda:1'

INT8

MultiDevices.GPU_precision = 'int8'

8G GPU + CPU

NVIDIA Tesla P4

MultiDevices.embeddings = 'cpu',
MultiDevices.layers={
                        'cuda:0': '1-28',
                    }
MultiDevices.final_layernorm = 'cuda:0'

6G GPU + CPU

NVIDIA Tesla P4

MultiDevices.embeddings = 'cpu',
MultiDevices.layers={
                        'cuda:0': '1-24',
                        'cpu':'25-28'
                    }
MultiDevices.final_layernorm = 'cpu'

INT4

MultiDevices.GPU_precision = 'int4'

4G GPU + CPU

NVIDIA Tesla P4 4G (关闭above 4g)

MultiDevices.embeddings = 'cpu',
MultiDevices.layers={
                        'cuda:0': '1-24',
                        'cpu': '25-28'
                    }
MultiDevices.final_layernorm = 'cpu'

推荐使用已量化的int4模型,并确认CPU Kernel编译成功,此时MultiDevices.GPU_precision 应填 'fp16' 或 不填。

chatglm_multigpucpu_eval's People

Contributors

chaimevans avatar yizxiy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.