can we use the gpu to get response more faster than use cpu ?

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

how to use gpu instead cpu about embedai HOT 5 OPEN

samuraigpt commented on May 18, 2024

how to use gpu instead cpu

from embedai.

Comments (5)

Anil-matcha commented on May 18, 2024 2

GPT4All doesn't support GPU acceleration. Will add support for models like Llama which can do this

from embedai.

bradsec commented on May 18, 2024 1

I was able to get GPU working with this Llama model: ggml-vic13b-q5_1.bin using a manual workaround.

# Download the ggml-vic13b-q5_1.bin model and place in privateGPT/server/models/
# Edit privateGPT.py and comment out GPT4 model and add LLama model
# Change n_gpu_layers=40 layers based on what Nvidia GPU (max is 40). Uses about 9GB VRAM.

def load_model():
    filename = 'ggml-vic13b-q5_1.bin'  # Specify the name for the downloaded file
    models_folder = 'models'  # Specify the name of the folder inside the Flask app root
    file_path = f'{models_folder}/{filename}'
    if os.path.exists(file_path):
        global llm
        callbacks = [StreamingStdOutCallbackHandler()]
        #llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
        llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_gpu_layers=40, callbacks=callbacks, verbose=False)

# Edit privateGPT/server/.env

# Update .env as follows
PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/ggml-vic13b-q5_1.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000

# If using conda enviroment
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit

# Remove and reinstall llama-cpp-python with ENV variables set
# Linux uses "export" not "set" like Windows for setting environment variables

pip uninstall llama-cpp-python
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

Run python privateGPT from privateGPT/server/ directory
You should see the following lines in output as the model loads

llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB

from embedai.

jackyoung022 commented on May 18, 2024

Hi, thanks for your info.
But when I was following your step in my windows, I got this error:
Could not load Llama model from path: D:/code/privateGPT/server/models/ggml-vic13b-q5_1.bin. Received error (type=value_error)
Any idea about this? Thanks.

from embedai.

MyraBaba commented on May 18, 2024

@bradsec

Hi,

I followed the instructions but looks still using cpu :

(venPrivateGPT) (base) alp2080@alp2080:~/data/dProjects/privateGPT/server$ python privateGPT.py
/data/dProjects/privateGPT/server/privateGPT.py:1: DeprecationWarning: 'flask.Markup' is deprecated and will be removed in Flask 2.4. Import 'markupsafe.Markup' instead.
from flask import Flask,jsonify, render_template, flash, redirect, url_for, Markup, request
llama.cpp: loading model from models/ggml-vic13b-q5_1.bin
llama_model_load_internal: format = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1000
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size = 781.25 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
LLM0 LlamaCpp
Params: {'model_path': 'models/ggml-vic13b-q5_1.bin', 'suffix': None, 'max_tokens': 256, 'temperature': 0.8, 'top_p': 0.95, 'logprobs': None, 'echo': False, 'stop_sequences': [], 'repeat_penalty': 1.1, 'top_k': 40}

Serving Flask app 'privateGPT'
Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
Running on all addresses (0.0.0.0)
Running on http://127.0.0.1:5000
Running on http://192.168.5.110:5000
Press CTRL+C to quit
Loading documents from source_documents

from embedai.

Musty1 commented on May 18, 2024

I tried this as well and it looks like it's still using CPU.. interesting. If anyone could suggest as to why it's not working with gpu, please let me know.

from embedai.

how to use gpu instead cpu about embedai HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent