Inference of a bunch of models from less than 3B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.
| Supported Models | Download Quantized Models |
What's New:
- 2024-04-30: Phi3-mini 128k
- 2024-04-27: Phi3-mini 4k
-
Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;
-
Use OOP to address the similarities between different Transformer based models;
-
Streaming generation with typewriter effect;
-
Continuous chatting (content length is virtually unlimited)
Two methods are available: Restart and Shift. See
--extending
options. -
Retrieval Augmented Generation (RAG) 🔥
-
LoRA;
-
Python/JavaScript/C Bindings, web demo, and more possibilities.
Clone the ChatLLM.cpp repository into your local machine:
git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp
If you forgot the --recursive
flag when cloning the repository, run the following command in the chatllm.cpp
folder:
git submodule update --init --recursive
Some quantized models can be downloaded from here.
Install dependencies of convert.py
:
pip install -r requirements.txt
Use convert.py
to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:
# For models such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin
# For some models such as CodeLlaMA, model type should be provided by `-a`
# Find `-a ...` option for each model in `docs/models.md`.
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA
Note: Appropriately, only HF format is supported; Format of the generated .bin
files is different from the one (GGUF) used by llama.cpp
.
Compile the project using CMake:
cmake -B build
# On Linux, WSL:
cmake --build build -j
# On Windows with MSVC:
cmake --build build -j --config Release
Now you may chat with a quantized model by running:
./build/bin/main -m chatglm-ggml.bin # ChatGLM-6B
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m llama2.bin --seed 100 # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....
To run the model in interactive mode, add the -i
flag. For example:
# On Windows
.\build\bin\Release\main -m model.bin -i
# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i
In interactive mode, your chat history will serve as the context for the next-round conversation.
Run ./build/bin/main -h
to explore more options!
-
This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.
-
Thank those who have released their the model sources and checkpoints.
This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.