Giter VIP home page Giter VIP logo

chatllm.cpp's Introduction

ChatLLM.cpp

中文版

License: MIT

Inference of a bunch of models from less than 3B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.

| Supported Models | Download Quantized Models |

What's New:

  • 2024-04-30: Phi3-mini 128k
  • 2024-04-27: Phi3-mini 4k

Features

  • Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;

  • Use OOP to address the similarities between different Transformer based models;

  • Streaming generation with typewriter effect;

  • Continuous chatting (content length is virtually unlimited)

    Two methods are available: Restart and Shift. See --extending options.

  • Retrieval Augmented Generation (RAG) 🔥

  • LoRA;

  • Python/JavaScript/C Bindings, web demo, and more possibilities.

Usage

Preparation

Clone the ChatLLM.cpp repository into your local machine:

git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatllm.cpp folder:

git submodule update --init --recursive

Quantize Model

Some quantized models can be downloaded from here.

Install dependencies of convert.py:

pip install -r requirements.txt

Use convert.py to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:

# For models such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin

# For some models such as CodeLlaMA, model type should be provided by `-a`
# Find `-a ...` option for each model in `docs/models.md`.
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA

Note: Appropriately, only HF format is supported; Format of the generated .bin files is different from the one (GGUF) used by llama.cpp.

Build & Run

Compile the project using CMake:

cmake -B build
# On Linux, WSL:
cmake --build build -j
# On Windows with MSVC:
cmake --build build -j --config Release

Now you may chat with a quantized model by running:

./build/bin/main -m chatglm-ggml.bin                            # ChatGLM-6B
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m llama2.bin  --seed 100                      # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....

To run the model in interactive mode, add the -i flag. For example:

# On Windows
.\build\bin\Release\main -m model.bin -i

# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Acknowledgements

  • This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.

  • Thank those who have released their the model sources and checkpoints.

Note

This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.

chatllm.cpp's People

Contributors

foldl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.