Simple command line chat program for GPT-J and LLaMA models written in C++. Based on llama.cpp with bindings from llmodel-c.
Warning Very early progress, might have bugs
You need to download a GPT-J model first. Here are direct links to models:
- The default version is v1.0: ggml-gpt4all-j.bin
- At the time of writing the newest is 1.3-groovy: ggml-gpt4all-j-v1.3-groovy.bin
They're around 3.8 Gb each. The chat program stores the model in RAM on runtime so you need enough memory to run. You can get more details on GPT-J models from gpt4all.io or nomic-ai/gpt4all github.
Alternatively you need to download a LLaMA model first. The original weights are for research purposes and you can apply for access here. Below are direct links to derived models:
- Vicuna 7b v1.1: ggml-vicuna-7b-1.1-q4_2.bin
- Vicuna 13b v1.1: ggml-vicuna-13b-1.1-q4_2.bin
- GPT-4-All l13b-snoozy: ggml-gpt4all-l13b-snoozy.bin
The LLaMA models are quite large: the 7B parameter versions are around 4.2 Gb and 13B parameter 8.2 Gb each. The chat program stores the model in RAM on runtime so you need enough memory to run. You can get more details on LLaMA models from the whitepaper or META AI website.
git clone --recurse-submodules https://github.com/kuvaus/LlamaGPTJ-chat
cd LlamaGPTJ-chat
On Windows you need to have MinGW or equivalent installed.
mkdir build
cd build
cmake .. -G "MinGW Makefiles"
cmake --build . --parallel
On Linux/Mac it should work out of the box.
mkdir build
cd build
cmake ..
cmake --build . --parallel
Note If you have an old processor, you can turn AVX2 instructions off in the build step with
-DAVX2=OFF
flag
After compiling, the binary is located at:
build/bin/chat
But you're free to move it anywhere. Simple command for 4 threads to get started:
./chat -m "/path/to/modelfile/ggml-vicuna-13b-1.1-q4_2.bin" -t 4
or
./chat -m "/path/to/modelfile/ggml-gpt4all-j.bin" -t 4
Happy chatting!
You can view the help and full parameter list with:
./chat -h
usage: ./bin/chat [options]
A simple chat program for GPT-J and LLaMA based models.
You can set specific initial prompt with the -p flag.
Runs default in interactive and continuous mode.
Type 'quit', 'exit' or, 'Ctrl+C' to quit.
options:
-h, --help show this help message and exit
--run-once disable continuous mode
--no-interactive disable interactive mode altogether (uses given prompt only)
-s SEED, --seed SEED RNG seed (default: -1)
-t N, --threads N number of threads to use during computation (default: 4)
-p PROMPT, --prompt PROMPT
prompt to start generation with (default: empty)
--random-prompt start with a randomized prompt.
-n N, --n_predict N number of tokens to predict (default: 200)
--top_k N top-k sampling (default: 40)
--top_p N top-p sampling (default: 0.9)
--temp N temperature (default: 0.3)
-b N, --batch_size N batch size for prompt processing (default: 9)
-r N, --remember N number of chars to remember from start of previous answer (default: 200)
-j, --load_json FNAME
load options instead from json at FNAME (default: empty/no)
--load_template FNAME
load prompt template from a txt file at FNAME (default: empty/no)
-m FNAME, --model FNAME
model path (current: ./models/ggml-vicuna-13b-1.1-q4_2.bin)
You can also fetch parameters from a json file with --load_json "/path/to/file.json"
flag. The json file has to be in following format:
{"top_p": 1.0, "top_k": 50400, "temp": 0.9, "n_batch": 9}
This is useful when you want to store different temperature and sampling settings.
This project is licensed under the MIT License