foldl / chatllm.cpp Goto Github PK
View Code? Open in Web Editor NEWPure C++ implementation of several models for real-time chatting on your computer (CPU)
License: MIT License
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
License: MIT License
cmake --build build --target libchatllm
error!
suggest:
bindings/libchatllm.h
the 7 line add this
#elif defined(APPLE) && defined(MACH) // macOS or iOS
#define API_CALL
#if (!defined x86_64) && (!defined arm64)
#error unsupported target architecture
#endif
Hi, great project!
could you help me a bit
i cloned the chatllm.cpp, and downloaded phi-2 model using git lfs, however when i try to convert it, i get a gelu_new model
you list phi-2 so I assume you were able to run it. what am I doing wrong?
(i'm using 30gb ram r7a.xlarge ec2 x86 instance with ubuntu 23.10)
log:
ubuntu@ip-172-31-7-92 ~/tmp> git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp
Cloning into 'chatllm.cpp'...
remote: Enumerating objects: 356, done.
remote: Counting objects: 100% (356/356), done.
remote: Compressing objects: 100% (235/235), done.
remote: Total 356 (delta 258), reused 208 (delta 111), pack-reused 0
Receiving objects: 100% (356/356), 879.66 KiB | 2.46 MiB/s, done.
Resolving deltas: 100% (258/258), done.
Submodule 'third_party/ggml' (https://github.com/ggerganov/ggml.git) registered for path 'third_party/ggml'
Cloning into '/home/ubuntu/tmp/chatllm.cpp/third_party/ggml'...
remote: Enumerating objects: 5546, done.
remote: Counting objects: 100% (1239/1239), done.
remote: Compressing objects: 100% (255/255), done.
remote: Total 5546 (delta 1056), reused 1085 (delta 958), pack-reused 4307
Receiving objects: 100% (5546/5546), 6.55 MiB | 18.83 MiB/s, done.
Resolving deltas: 100% (3411/3411), done.
Submodule path 'third_party/ggml': checked out '3d57e767653eeaf7b3cc311bdc4ff24771be1ee7'
ubuntu@ip-172-31-7-92 ~/t/chatllm.cpp (master)> pip install -r requirements.txt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: safetensors in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (lin
e 1)) (0.4.1)
Requirement already satisfied: torch in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (line 2))
(2.1.2)
Requirement already satisfied: tabulate in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (line 3
)) (0.9.0)
Requirement already satisfied: tqdm in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements.txt (line 4)) (
4.66.1)
Requirement already satisfied: transformers>=4.34.0 in /home/ubuntu/.local/lib/python3.11/site-packages (from -r requirements
.txt (line 5)) (4.36.2)
Requirement already satisfied: filelock in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt
(line 2)) (3.13.1)
Requirement already satisfied: typing-extensions in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirem
ents.txt (line 2)) (4.8.0)
Requirement already satisfied: sympy in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt (li
ne 2)) (1.12)
Requirement already satisfied: networkx in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt
(line 2)) (3.2.1)
Requirement already satisfied: jinja2 in /usr/lib/python3/dist-packages (from torch->-r requirements.txt (line 2)) (3.1.2)
Requirement already satisfied: fsspec in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements.txt (l
ine 2)) (2023.12.2)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from t
orch->-r requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-
r requirements.txt (line 2)) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->
-r requirements.txt (line 2)) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->
-r requirements.txt (line 2)) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch
->-r requirements.txt (line 2)) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /home/ubuntu/.local/lib/python3.11/site-packages (from tor
ch->-r requirements.txt (line 2)) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r r
equirements.txt (line 2)) (2.18.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r
requirements.txt (line 2)) (12.1.105)
Requirement already satisfied: triton==2.1.0 in /home/ubuntu/.local/lib/python3.11/site-packages (from torch->-r requirements
.txt (line 2)) (2.1.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in /home/ubuntu/.local/lib/python3.11/site-packages (from nvidia-cusolve
r-cu12==11.4.5.107->torch->-r requirements.txt (line 2)) (12.3.101)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /home/ubuntu/.local/lib/python3.11/site-packages (from transfo
rmers>=4.34.0->-r requirements.txt (line 5)) (0.20.2)
Requirement already satisfied: numpy>=1.17 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34.0->-r
requirements.txt (line 5)) (1.26.1)
Requirement already satisfied: packaging>=20.0 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34.0
->-r requirements.txt (line 5)) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/lib/python3/dist-packages (from transformers>=4.34.0->-r requirements.txt
(line 5)) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34
.0->-r requirements.txt (line 5)) (2023.10.3)
Requirement already satisfied: requests in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>=4.34.0->-r re
quirements.txt (line 5)) (2.23.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /home/ubuntu/.local/lib/python3.11/site-packages (from transformers>
=4.34.0->-r requirements.txt (line 5)) (0.15.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/ubuntu/.local/lib/python3.11/site-packages (from requests->transfor
mers>=4.34.0->-r requirements.txt (line 5)) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /home/ubuntu/.local/lib/python3.11/site-packages (from requests->transformers>
=4.34.0->-r requirements.txt (line 5)) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/ubuntu/.local/lib/python3.11/site-packages (f
rom requests->transformers>=4.34.0->-r requirements.txt (line 5)) (1.25.11)
Requirement already satisfied: certifi>=2017.4.17 in /home/ubuntu/.local/lib/python3.11/site-packages (from requests->transfo
rmers>=4.34.0->-r requirements.txt (line 5)) (2023.7.22)
Requirement already satisfied: mpmath>=0.19 in /home/ubuntu/.local/lib/python3.11/site-packages (from sympy->torch->-r requir
ements.txt (line 2)) (1.3.0)
ubuntu@ip-172-31-7-92 ~/t/chatllm.cpp (master)> python3 convert.py -i phi-2 -t q8_0 -o quantized.bin
Loading vocab file phi-2
vocab_size 50295
Traceback (most recent call last):
File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 1516, in <module>
main()
File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 1422, in main
Phi2Converter.convert(config, model_files, vocab, ggml_type, args.save_path)
File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 459, in convert
cls.dump_config(f, config, ggml_type)
File "/home/ubuntu/tmp/chatllm.cpp/convert.py", line 1161, in dump_config
assert config.activation_function == 'gelu_new', "activation_function must be gelu_new"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: activation_function must be gelu_new
./bin/main -m ~/aimodels/chatmodels/qwen2-72bi_q8_0.bin -i
________ __ __ __ __ ___ (通义千问)
/ / / ____ / // / / / / |/ /______ ____
/ / / __ / __ `/ / / / / / /|/ // / __ / __ \
/ // / / / // / // // // / / // // // / // /
_// //_,/_//// /()/ ./ ._/
You are served by QWen2, // //
with 72706203648 (72.7B) parameters.
You > hi
A.I. > GGML_ASSERT: /home/arthur/work/chatllm.cpp/ggml/src/ggml.c:17428: cgraph->n_nodes < cgraph->size
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
qwen2-72bi_q8_0.bin is quantized from qwen2-Instruct-72B Instruct-72B
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3080'
________ __ __ __ __ ___ (百川)
/ ____/ /_ ____ _/ /_/ / / / / |/ /_________ ____
/ / / __ \/ __ `/ __/ / / / / /|_/ // ___/ __ \/ __ \
/ /___/ / / / /_/ / /_/ /___/ /___/ / / // /__/ /_/ / /_/ /
\____/_/ /_/\__,_/\__/_____/_____/_/ /_(_)___/ .___/ .___/
You are served by Baichuan, /_/ /_/
with 13264901120 (13.3B) parameters.
You > hi
A.I. > Cond系统的悋 Mobile殍 finished看着饧了一下 rebell暂时☢ Junior这边那个頢 Tanztagال snake
请问如何部署BCEmbedding的rerank模型?
参考chatglm.cpp,能否支持python推理
% ./main -m gemma-1.1-2b.bin -i
________ __ __ __ __ ___
/ / / ____ / // / / / / |/ /______ ____
/ / / __ / __ `/ / / / / / /|/ // / __ / __ \
/ // / / / // / // // // / / // // // / // /
_// //_,/_//// /()/ ./ ._/
You are served by Gemma, // //
with 2506172416 (2.5B) parameters.
You > 你好,你叫什么名字
A.I. > 我的名字是 AI,我是一个语言模型,没有个人名称。我致力于提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题!我是一个语言模型,没有个人名称。但是,我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题!我是一个语言模型,没有个人名称。但是,我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题!我是一个语言模型,没有个人名称。但是,我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题!我是一个语言模型,没有个人名称。但是,我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题!我是一个语言模型,没有个人名称。但是,我可以提供您有用的信息和帮助。请随时提出任何问题或要求。我期待您的问题!我是一个语言模型,没有个人名称。但是,我可以提供您有用的信息和帮助。请随时提出任何问题或要
会不停止,一直回复下去
With 686 tokens, a single run would take more than 6 secodns on a 96C machine.
Here is the profiling data for compute graph.
bge-reranker-dump.txt
Any advice for better performance?
Hi I tried to convert yi-34b-chat from the full llm to .bin but facing a lot of mistakes.
Do you have a link of the yi-34b-chat.bin (a converted one) I can download and use?
Thanks
Yuming
只能读取到最新一条,好像对system以及之前的历史信息都没有反应
用python的openai_api.py报错如下
.../chatllm.cpp/bindings -> python openai_api.py -i -m /2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/models/chatglm-ggml.bin
Traceback (most recent call last):
File "/2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/bindings/openai_api.py", line 280, in <module>
chat_streamer = ChatLLMStreamer(ChatLLM(LibChatLLM(), basic_args + [args[0]] + chat_args, False))
File "/2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/bindings/chatllm.py", line 216, in __init__
self.llm.start()
File "/2T/Langchain-Ch/Langchain-Chatchat/THUDM/glm-4-9b-chat-1m/glm-4-9b-chat-1m/chatllm.cpp/bindings/chatllm.py", line 161, in start
raise Exception(f'ChatLLM: failed to `start()` with error code {r}')
Exception: ChatLLM: failed to `start()` with error code 4
using convert.py to quantized InternLM2 failed
Loading vocab file internlm2-chat-7b/tokenizer.model
vocab_size 92544
Traceback (most recent call last):
File "/home/arthur/aimodels/chatmodels/convert.py", line 3351, in
main()
File "/home/arthur/aimodels/chatmodels/convert.py", line 3155, in main
InternLM2Converter.convert(config, model_files, vocab, ggml_type, args.save_path)
File "/home/arthur/aimodels/chatmodels/convert.py", line 707, in convert
cls.dump_config(f, config, ggml_type)
File "/home/arthur/aimodels/chatmodels/convert.py", line 859, in dump_config
assert config.rope_scaling['type'] == 'dynamic', "rope_scaling['type'] must be dynamic"
~~~~~~~~~~~~~~~~~~~^^^^^^^^
TypeError: 'NoneType' object is not subscriptable
so , how to quantized InternLM2, thanks.
WeChat or Discord
After checking out:
git clone https://huggingface.co/THUDM/glm-4-9b-chat ~/glm4
Then running from this repo:
python3 convert.py -i ~/glm-4-9b-chat/ -t q8_0 -o quantized.bin
The following occurs:
Loading vocab file /home/james/glm-4-9b-chat/tokenizer.model
Traceback (most recent call last):
File "/home/james/chatllm.cpp/convert.py", line 3553, in <module>
main()
File "/home/james/chatllm.cpp/convert.py", line 3301, in main
vocab = load_vocab(vocab_dir, skip_def_vocab_model)
File "/home/james/chatllm.cpp/convert.py", line 3204, in load_vocab
return load_spm(path2)
File "/home/james/chatllm.cpp/convert.py", line 3192, in load_spm
return SentencePieceVocab(p, added_tokens_path if added_tokens_path.exists() else None)
File "/home/james/chatllm.cpp/convert.py", line 355, in __init__
self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
File "/home/james/.venv3.9/lib64/python3.9/site-packages/sentencepiece/__init__.py", line 468, in Init
self.Load(model_file=model_file, model_proto=model_proto)
File "/home/james/.venv3.9/lib64/python3.9/site-packages/sentencepiece/__init__.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/home/james/.venv3.9/lib64/python3.9/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from /home/james/glm-4-9b-chat/tokenizer.model
Tried this with various python versions, and nothing seems to work.
cmake --build build --target libchatllm
1046 | chatllm::ModelObject::extra_args pipe_args(args.max_length);
| ^
In file included from /home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/main.cpp:1:
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:381:13: note: candidate: ‘chatllm::ModelObject::extra_args::extra_args()’
381 | extra_args() : extra_args(-1, "") {}
| ^~~~~~~~~~
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:381:13: note: candidate expects 0 arguments, 1 provided
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:380:13: note: candidate: ‘chatllm::ModelObject::extra_args::extra_args(int, const std::string&)’
380 | extra_args(int max_length, const std::string &layer_spec) : max_length(max_length), layer_spec(layer_spec) {}
| ^~~~~~~~~~
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:380:13: note: candidate expects 2 arguments, 1 provided
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: candidate: ‘constexpr chatllm::ModelObject::extra_args::extra_args(const chatllm::ModelObject::extra_args&)’
376 | struct extra_args
| ^~~~~~~~~~
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: no known conversion for argument 1 from ‘int’ to ‘const chatllm::ModelObject::extra_args&’
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: candidate: ‘constexpr chatllm::ModelObject::extra_args::extra_args(chatllm::ModelObject::extra_args&&)’
/home/wangzhuo/aiproject/aiapp_local/aiqa/models/llm/chatllm.cpp/chat.h:376:16: note: no known conversion for argument 1 from ‘int’ to ‘chatllm::ModelObject::extra_args&&’
gmake[3]: *** [CMakeFiles/libchatllm.dir/build.make:76:CMakeFiles/libchatllm.dir/main.cpp.o] 错误 1
gmake[2]: *** [CMakeFiles/Makefile2:116:CMakeFiles/libchatllm.dir/all] 错误 2
gmake[1]: *** [CMakeFiles/Makefile2:123:CMakeFiles/libchatllm.dir/rule] 错误 2
gmake: *** [Makefile:169:libchatllm] 错误 2
Hi Foldl:
I found this project running Yi-34b-chat Q4 a lot slower than the latest llama.cpp, is it because it is not optimized for CPUs?
Such as not supporting AVX, AVX2 and AVX512 support for x86 architectures or
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use?
Thanks
Yuming
GGML is kind of not supported anymore and all models have moved to GGUF as a standard a year ago. Are there any plans to support it here? I'm wondering what are the limitations to handle sliding window in gguf compared to ggml if that's the problem
109 warnings and 1 error generated.
make[2]: *** [CMakeFiles/main.dir/models.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
1 warning generated.
1 warning generated.
make[1]: *** [CMakeFiles/main.dir/all] Error 2
make: *** [all] Error 2
mac version: ventura 13.6.6
when i use cmd like:
./build/bin/main --model quantized.bin --prompt \"{}\"".format(prompt)
it going to core dumped like:
<AI> ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 292106304, available 150994944) Segmentation fault (core dumped)
how to fix?
你好,请问支持GLM-4V吗?有计划支持吗?
I may be silly or something, but how am i supposed to run mistral nemo, if i don't see it's model_id anywhere? It says : Supported Models - 2024-07-17: Mistral Nemo. Ok, but how do i write it? python chatllm.py -i -m :????????
-i选项支持中文输入,但是-p似乎把中文的编码解读成8位了,模型读进去的是乱码。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.