foldl / chatllm.cpp Goto Github PK

Pure C++ implementation of several models for real-time chatting on your computer (CPU)

License: MIT License

CMake 1.16% C++ 44.89% Python 4.11% C 30.77% TypeScript 0.10% Makefile 0.46% Swift 0.03% Zig 0.08% Cuda 11.15% Objective-C 3.32% Metal 3.93%

llm llm-inference

chatllm.cpp's Introduction

chatllm.cpp's People

Contributors

Stargazers

Watchers

chatllm.cpp's Issues

Compilation of bindings under macOS does not succeed

cmake --build build --target libchatllm
error！

suggest：

bindings/libchatllm.h
the 7 line add this

#elif defined(APPLE) && defined(MACH) // macOS or iOS
#define API_CALL
#if (!defined x86_64) && (!defined arm64)
#error unsupported target architecture
#endif

how do i run phi-2 on cpu (x86 ec2)

Hi, great project!
could you help me a bit

i cloned the chatllm.cpp, and downloaded phi-2 model using git lfs, however when i try to convert it, i get a gelu_new model

you list phi-2 so I assume you were able to run it. what am I doing wrong?

(i'm using 30gb ram r7a.xlarge ec2 x86 instance with ubuntu 23.10)

log:

ubuntu@ip-172-31-7-92 ~/tmp> git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp
Cloning into 'chatllm.cpp'...
remote: Enumerating objects: 356, done.
remote: Counting objects: 100% (356/356), done.
remote: Compressing objects: 100% (235/235), done.
remote: Total 356 (delta 258), reused 208 (delta 111), pack-reused 0
Receiving objects: 100% (356/356), 879.66 KiB | 2.46 MiB/s, done.
Resolving deltas: 100% (258/258), done.
Submodule 'third_party/ggml' (https://github.com/ggerganov/ggml.git) registered for path 'third_party/ggml'
Cloning into '/home/ubuntu/tmp/chatllm.cpp/third_party/ggml'...
remote: Enumerating objects: 5546, done.
remote: Counting objects: 100% (1239/1239), done.
remote: Compressing objects: 100% (255/255), done.
remote: Total 5546 (delta 1056), reused 1085 (delta 958), pack-reused 4307
Receiving objects: 100% (5546/5546), 6.55 MiB | 18.83 MiB/s, done.
Resolving deltas: 100% (3411/3411), done.
Submodule path 'third_party/ggml': checked out '3d57e767653eeaf7b3cc311bdc4ff24771be1ee7'

ubuntu@ip-172-31-7-92 ~/t/chatllm.cpp (master)> pip install -r requirements.txt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: safetensors in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (lin
e 1)) (0.4.1)
Requirement already satisfied: torch in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (line 2))
(2.1.2)
Requirement already satisfied: tabulate in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (line 3
)) (0.9.0)
Requirement already satisfied: tqdm in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (line 4)) (
4.66.1)
Requirement already satisfied: transformers>=4.34.0 in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements
.txt (line 5)) (4.36.2)
Requirement already satisfied: filelock in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt
(line 2)) (3.13.1)
Requirement already satisfied: typing-extensions in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirem
ents.txt (line 2)) (4.8.0)
Requirement already satisfied: sympy in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt (li
ne 2)) (1.12)
Requirement already satisfied: networkx in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt
(line 2)) (3.2.1)
Requirement already satisfied: jinja2 in /usr/lib/python3/dist-packages (from torch->-r requirements.txt (line 2)) (3.1.2)
Requirement already satisfied: fsspec in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt (l
ine 2)) (2023.12.2)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from t
orch->-r requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-
r requirements.txt (line 2)) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->
-r requirements.txt (line 2)) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->
-r requirements.txt (line 2)) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch
->-r requirements.txt (line 2)) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r r
equirements.txt (line 2)) (2.18.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r
 requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: triton==2.1.0 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements
.txt (line 2)) (2.1.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in /home/ubuntu/.local/lib/python3.11/site-packages (from nvidia-cusolve
r-cu12==11.4.5.107->torch->-r requirements.txt (line 2)) (12.3.101)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /home/ubuntu/.local/lib/python3.11/site-packages (from transfo
rmers>=4.34.0->-r requirements.txt (line 5)) (0.20.2)
Requirement already satisfied: numpy>=1.17 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34.0->-r
 requirements.txt (line 5)) (1.26.1)
Requirement already satisfied: packaging>=20.0 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34.0
->-r requirements.txt (line 5)) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/lib/python3/dist-packages (from transformers>=4.34.0->-r requirements.txt
(line 5)) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34
.0->-r requirements.txt (line 5)) (2023.10.3)
Requirement already satisfied: requests in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34.0->-r re
quirements.txt (line 5)) (2.23.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>
=4.34.0->-r requirements.txt (line 5)) (0.15.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/ubuntu/.local/lib/python3.11/site-packages (from requests->transfor
mers>=4.34.0->-r requirements.txt (line 5)) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /home/ubuntu/.local/lib/python3.11/site-packages (from requests->transformers>
=4.34.0->-r requirements.txt (line 5)) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/ubuntu/.local/lib/python3.11/site-packages (f
rom requests->transformers>=4.34.0->-r requirements.txt (line 5)) (1.25.11)
Requirement already satisfied: certifi>=2017.4.17 in /home/ubuntu/.local/lib/python3.11/site-packages (from requests->transfo
rmers>=4.34.0->-r requirements.txt (line 5)) (2023.7.22)
Requirement already satisfied: mpmath>=0.19 in /home/ubuntu/.local/lib/python3.11/site-packages (from sympy->torch->-r requir
ements.txt (line 2)) (1.3.0)

ubuntu@ip-172-31-7-92 ~/t/chatllm.cpp (master)> python3 convert.py -i phi-2 -t q8_0 -o quantized.bin
Loading vocab file phi-2
vocab_size  50295
Traceback (most recent call last):
  File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 1516, in <module>
    main()
  File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 1422, in main
    Phi2Converter.convert(config, model_files, vocab, ggml_type, args.save_path)
  File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 459, in convert
    cls.dump_config(f, config, ggml_type)
  File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 1161, in dump_config
    assert config.activation_function == 'gelu_new', "activation_function must be gelu_new"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: activation_function must be gelu_new

crashed using qwen2 72b

./bin/main -m ~/aimodels/chatmodels/qwen2-72bi_q8_0.bin -i
________ __ __ __ __ ___ (通义千问)
/ / / ____ / // / / / / |/ /______ ____
/ / / __ / __ `/ / / / / / /|/ // / __ / __ \
/ // / / / // / // // // / / // // // / // /
_// //_,/_//// /()/ ./ ._/
You are served by QWen2, // //
with 72706203648 (72.7B) parameters.

You > hi
A.I. > GGML_ASSERT: /home/arthur/work/chatllm.cpp/ggml/src/ggml.c:17428: cgraph->n_nodes < cgraph->size
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

qwen2-72bi_q8_0.bin is quantized from qwen2-Instruct-72B Instruct-72B

baichuan13 does not work well

ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3080'
    ________          __  __    __    __  ___ (百川)
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____  
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \ 
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ / 
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/  
You are served by Baichuan,                   /_/   /_/       
with 13264901120 (13.3B) parameters.

You  > hi
A.I. >  Cond系统的悋 Mobile殍 finished看着饧了一下 rebell暂时☢ Junior这边那个頢 Tanztagال snake

请问如何部署BCEmbedding的rerank模型？

LlaMA 3.1 70B not work

提供python调用方式

参考chatglm.cpp，能否支持python推理

gemma模型的bug

% ./main -m gemma-1.1-2b.bin -i
________ __ __ __ __ ___
/ / / ____ / // / / / / |/ /______ ____
/ / / __ / __ `/ / / / / / /|/ // / __ / __ \
/ // / / / // / // // // / / // // // / // /
_// //_,/_//// /()/ ./ ._/
You are served by Gemma, // //
with 2506172416 (2.5B) parameters.

You > 你好，你叫什么名字
A.I. > 我的名字是 AI，我是一个语言模型，没有个人名称。我致力于提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题！我是一个语言模型，没有个人名称。但是，我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题！我是一个语言模型，没有个人名称。但是，我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题！我是一个语言模型，没有个人名称。但是，我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题！我是一个语言模型，没有个人名称。但是，我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题！我是一个语言模型，没有个人名称。但是，我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题！我是一个语言模型，没有个人名称。但是，我可以提供您有用的信息和帮助。请随时提出任何问题或要

会不停止，一直回复下去

bge-reranker is extreamly slow

With 686 tokens, a single run would take more than 6 secodns on a 96C machine.

Here is the profiling data for compute graph.
bge-reranker-dump.txt

Any advice for better performance?

The same sentence appears in loop

Hi, Foldl.
I am learning about your program.
After testing, I found that it will start to appear the same sentence after a few answers and keep looping. But this problem does not occur on llama.cpp.
I can provide the testing ENV and how can I contact you.
Thanks you.

LLM conversion is a issue.

Hi I tried to convert yi-34b-chat from the full llm to .bin but facing a lot of mistakes.

Do you have a link of the yi-34b-chat.bin (a converted one) I can download and use?

Thanks
Yuming

bindings里面的openai_api.ts好像读不到历史信息

只能读取到最新一条，好像对system以及之前的历史信息都没有反应

用python的openai_api.py报错如下

.../chatllm.cpp/bindings -> python openai_api.py -i -m /2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/models/chatglm-ggml.bin
Traceback (most recent call last):
  File "/2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/bindings/openai_api.py", line 280, in <module>
    chat_streamer = ChatLLMStreamer(ChatLLM(LibChatLLM(), basic_args + [args[0]] + chat_args, False))
  File "/2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/bindings/chatllm.py", line 216, in __init__
    self.llm.start()
  File "/2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/bindings/chatllm.py", line 161, in start
    raise Exception(f'ChatLLM: failed to `start()` with error code {r}')
Exception: ChatLLM: failed to `start()` with error code 4

quantized InternLM2 failed

using convert.py to quantized InternLM2 failed

Loading vocab file internlm2-chat-7b/tokenizer.model
vocab_size 92544
Traceback (most recent call last):
File "/home/arthur/aimodels/chatmodels/convert.py", line 3351, in
main()
File "/home/arthur/aimodels/chatmodels/convert.py", line 3155, in main
InternLM2Converter.convert(config, model_files, vocab, ggml_type, args.save_path)
File "/home/arthur/aimodels/chatmodels/convert.py", line 707, in convert
cls.dump_config(f, config, ggml_type)
File "/home/arthur/aimodels/chatmodels/convert.py", line 859, in dump_config
assert config.rope_scaling['type'] == 'dynamic', "rope_scaling['type'] must be dynamic"
~~~~~~~~~~~~~~~~~~~^^^^^^^^
TypeError: 'NoneType' object is not subscriptable

so , how to quantized InternLM2, thanks.

Does this project have a communication group?

WeChat or Discord

RuntimeError: Internal: could not parse ModelProto from /home/james/glm-4-9b-chat/tokenizer.model

After checking out:

git clone https://huggingface.co/THUDM/glm-4-9b-chat ~/glm4

Then running from this repo:

python3 convert.py -i ~/glm-4-9b-chat/ -t q8_0 -o quantized.bin

The following occurs:

Loading vocab file /home/james/glm-4-9b-chat/tokenizer.model
Traceback (most recent call last):
  File "/home/james/chatllm.cpp/convert.py", line 3553, in <module>
    main()
  File "/home/james/chatllm.cpp/convert.py", line 3301, in main
    vocab = load_vocab(vocab_dir, skip_def_vocab_model)
  File "/home/james/chatllm.cpp/convert.py", line 3204, in load_vocab
    return load_spm(path2)
  File "/home/james/chatllm.cpp/convert.py", line 3192, in load_spm
    return SentencePieceVocab(p, added_tokens_path if added_tokens_path.exists() else None)
  File "/home/james/chatllm.cpp/convert.py", line 355, in __init__
    self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
  File "/home/james/.venv3.9/lib64/python3.9/site-packages/sentencepiece/__init__.py", line 468, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/home/james/.venv3.9/lib64/python3.9/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/home/james/.venv3.9/lib64/python3.9/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from /home/james/glm-4-9b-chat/tokenizer.model

Tried this with various python versions, and nothing seems to work.

build libchatllm error

cmake --build build --target libchatllm
1046 | chatllm::ModelObject::extra_args pipe_args(args.max_length);
| ^
In file included from /home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/main.cpp:1:
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:381:13: note: candidate: ‘chatllm::ModelObject::extra_args::extra_args()’
381 | extra_args() : extra_args(-1, "") {}
| ^~~~~~~~~~
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:381:13: note: candidate expects 0 arguments, 1 provided
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:380:13: note: candidate: ‘chatllm::ModelObject::extra_args::extra_args(int, const std::string&)’
380 | extra_args(int max_length, const std::string &layer_spec) : max_length(max_length), layer_spec(layer_spec) {}
| ^~~~~~~~~~
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:380:13: note: candidate expects 2 arguments, 1 provided
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: candidate: ‘constexpr chatllm::ModelObject::extra_args::extra_args(const chatllm::ModelObject::extra_args&)’
376 | struct extra_args
| ^~~~~~~~~~
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: no known conversion for argument 1 from ‘int’ to ‘const chatllm::ModelObject::extra_args&’
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: candidate: ‘constexpr chatllm::ModelObject::extra_args::extra_args(chatllm::ModelObject::extra_args&&)’
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: no known conversion for argument 1 from ‘int’ to ‘chatllm::ModelObject::extra_args&&’
gmake[3]: *** [CMakeFiles/libchatllm.dir/build.make:76：CMakeFiles/libchatllm.dir/main.cpp.o] 错误 1
gmake[2]: *** [CMakeFiles/Makefile2:116：CMakeFiles/libchatllm.dir/all] 错误 2
gmake[1]: *** [CMakeFiles/Makefile2:123：CMakeFiles/libchatllm.dir/rule] 错误 2
gmake: *** [Makefile:169：libchatllm] 错误 2

CPU inferencing a lot slower than llama.cpp

Hi Foldl:

I found this project running Yi-34b-chat Q4 a lot slower than the latest llama.cpp, is it because it is not optimized for CPUs?

Such as not supporting AVX, AVX2 and AVX512 support for x86 architectures or
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use?

Thanks
Yuming

请问可否给个docker镜像

Support GGUF

GGML is kind of not supported anymore and all models have moved to GGUF as a standard a year ago. Are there any plans to support it here? I'm wondering what are the limitations to handle sliding window in gguf compared to ggml if that's the problem

calculate required scratch memory

F16 quantization not work

mac打包报错

109 warnings and 1 error generated.
make[2]: *** [CMakeFiles/main.dir/models.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
1 warning generated.
1 warning generated.
make[1]: *** [CMakeFiles/main.dir/all] Error 2
make: *** [all] Error 2

mac version: ventura 13.6.6

How to use GPU?

core dumped

when i use cmd like:
./build/bin/main --model quantized.bin --prompt \"{}\"".format(prompt)

it going to core dumped like:
<AI> ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 292106304, available 150994944) Segmentation fault (core dumped)

how to fix?

foldl / chatllm.cpp Goto Github PK

chatllm.cpp's Introduction

chatllm.cpp's People

Contributors

Stargazers

Watchers

Forkers

chatllm.cpp's Issues

Recommend Projects

Recommend Topics

Recommend Org