ggerganov / llama.cpp Goto Github PK

LLM inference in C/C++

License: MIT License

Makefile 0.63% Python 5.59% C 21.76% C++ 56.36% Shell 1.12% CMake 1.04% Dockerfile 0.13% Nix 0.29% Swift 0.02% Cuda 7.86% Objective-C 2.38% Metal 2.81% Batchfile 0.01% JavaScript 0.01%

llama ggml

llama.cpp's Introduction

llama.cpp

Roadmap / Project status / Manifesto / ggml

Inference of Meta's LLaMA model (and others) in pure C/C++

Important

[2024 Jun 12] Binaries have been renamed w/ a llama- prefix. main is now llama-cli, server is llama-server, etc (#7809)

Recent API changes

[2024 Jun 26] The source code and CMake build scripts have been restructured #8006
[2024 Apr 21] llama_token_to_piece can now optionally render special tokens #6807
[2024 Apr 4] State and session file functions reorganized under llama_state_* #6341
[2024 Mar 26] Logits and embeddings API updated for compactness #6122
[2024 Mar 13] Add llama_synchronize() + llama_context_params.n_ubatch #6017
[2024 Mar 8] llama_kv_cache_seq_rm() returns a bool instead of void, and new llama_n_seq_max() returns the upper limit of acceptable seq_id in batches (relevant when dealing with multiple sequences) #5328
[2024 Mar 4] Embeddings API updated #5796
[2024 Mar 3] struct llama_context_params #5849

Hot topics

convert.py has been deprecated and moved to examples/convert_legacy_llama.py, please use convert_hf_to_gguf.py #7430
Initial Flash-Attention support: #5021
BPE pre-tokenization support has been added: #6920
MoE memory layout has been updated - reconvert models for mmap support and regenerate imatrix #6387
Model sharding instructions using gguf-split #6404
Fix major bug in Metal batched inference #6225
Multi-GPU pipeline parallelism support #6017
Looking for contributions to add Deepseek support: #5981
Quantization blind testing: #5962
Initial Mamba support has been added: #5328

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2 and AVX512 support for x86 architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Since its inception, the project has improved significantly thanks to many contributions. It is the main playground for developing new features for the ggml library.

Supported models:

Typically finetunes of the base models below are supported as well.

(instructions for supporting more models: HOWTO-add-model.md)

Multimodal models:

Bindings:

Python: abetlen/llama-cpp-python
Go: go-skynet/go-llama.cpp
Node.js: withcatai/node-llama-cpp
JS/TS (llama.cpp server client): lgrammel/modelfusion
JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
Typescript/Wasm (nicer API, available on npm): ngxson/wllama
Ruby: yoshoku/llama_cpp.rb
Rust (more features): edgenai/llama_cpp-rs
Rust (nicer API): mdrokz/rust-llama.cpp
Rust (more direct bindings): utilityai/llama-cpp-rs
C#/.NET: SciSharp/LLamaSharp
Scala 3: donderom/llm4s
Clojure: phronmophobic/llama.clj
React Native: mybigday/llama.rn
Java: kherud/java-llama.cpp
Zig: deins/llama.cpp.zig
Flutter/Dart: netdur/llama_cpp_dart
PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
Guile Scheme: guile_llama_cpp

UI:

Unless otherwise noted these projects are open-source with permissive licensing:

iohub/collama
janhq/jan (AGPL)
nat/openplayground
Faraday (proprietary)
LMStudio (proprietary)
Layla (proprietary)
LocalAI (MIT)
LostRuins/koboldcpp (AGPL)
Mozilla-Ocho/llamafile
nomic-ai/gpt4all
ollama/ollama
oobabooga/text-generation-webui (AGPL)
psugihara/FreeChat
cztomsik/ava (MIT)
ptsochantaris/emeltal
pythops/tenere (AGPL)
RAGNA Desktop (proprietary)
RecurseChat (proprietary)
semperai/amica
withcatai/catai
Mobile-Artificial-Intelligence/maid (MIT)
Msty (proprietary)
LLMFarm (MIT)
KanTV(Apachev2.0 or later)
Dot (GPL)
MindMac (proprietary)
KodiBot (GPL)
eva (MIT)
AI Sublime Text plugin (MIT)
AIKit (MIT)
LARS - The LLM & Advanced Referencing Solution (AGPL)

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Tools:

akx/ggify – download PyTorch models from HuggingFace Hub and convert them to GGML
crashr/gppm – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption

Infrastructure:

Paddler - Stateful load balancer custom-tailored for llama.cpp

Demo

Typical run using LLaMA v2 13B on M2 Ultra

$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: build = 1041 (cf658ad)
main: seed  = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q4_0
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: mem required  = 7024.01 MB (+  400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.41 MB

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings:        load time =   576.45 ms
llama_print_timings:      sample time =   283.10 ms /   400 runs   (    0.71 ms per token,  1412.91 tokens per second)
llama_print_timings: prompt eval time =   599.83 ms /    19 tokens (   31.57 ms per token,    31.68 tokens per second)
llama_print_timings:        eval time = 24513.59 ms /   399 runs   (   61.44 ms per token,    16.28 tokens per second)
llama_print_timings:       total time = 25431.49 ms

Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook

And here is another demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook:

whisper-llama-lq.mp4

Usage

Here are the end-to-end binary build and model conversion steps for most supported models.

Basic usage

Firstly, you need to get the binary. There are different methods that you can follow:

Method 1: Clone this repository and build locally, see how to build
Method 2: If you are using MacOS or Linux, you can install llama.cpp via brew, flox or nix
Method 3: Use a Docker image, see documentation for Docker
Method 4: Download pre-built binary from releases

You can run a basic completion using this command:

llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128

# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.

See this page for a full list of parameters.

Conversation mode

If you want a more ChatGPT-like experience, you can run in conversation mode by passing -cnv as a parameter:

llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv

# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

By default, the chat template will be taken from the input model. If you want to use another chat template, pass --chat-template NAME as a parameter. See the list of supported templates

./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml

You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:

./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

Web server

llama.cpp web server is a lightweight OpenAI API compatible HTTP server that can be used to serve local models and easily connect them to existing clients.

Example usage:

./llama-server -m your_model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

Interactive mode

Note

If you prefer basic usage, please consider using conversation mode instead of interactive mode

In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a reverse prompt with the parameter -r "reverse prompt string". This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass -r "Alice:".

Here is an example of a few-shot interaction, invoked with the command

# default arguments using a 7B model
./examples/chat.sh

# advanced chat with a 13B model
./examples/chat-13B.sh

# custom arguments using a 13B model
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Note the use of --color to distinguish between user input and generated text. Other parameters are explained in more detail in the README for the llama-cli example program.

Persistent Interaction

The prompt, user inputs, and model generations can be saved and resumed across calls to ./llama-cli by leveraging --prompt-cache and --prompt-cache-all. The ./examples/chat-persistent.sh script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as chat-13B.sh. The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (PROMPT_TEMPLATE) and the model file.

# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh

# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
    CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh

Constrained output with grammars

llama.cpp supports grammars to constrain model output. For example, you can force the model to output JSON only:

./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.

For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on its repo and not this one.

Build

Please refer to Build llama.cpp locally

Supported backends

Backend	Target devices
Metal	Apple Silicon
BLAS	All
BLIS	All
SYCL	Intel and Nvidia GPU
CUDA	Nvidia GPU
hipBLAS	AMD GPU
Vulkan	GPU

Tools

Prepare and Quantize

Note

You can use the GGUF-my-repo space on Hugging Face to quantise your model weights without any setup too. It is synced from llama.cpp main every 6 hours.

To obtain the official LLaMA 2 weights please see the Obtaining and using the Facebook LLaMA 2 model section. There is also a large selection of pre-quantized gguf models available on Hugging Face.

Note: convert.py has been moved to examples/convert_legacy_llama.py and shouldn't be used for anything other than Llama/Llama2/Mistral models and their derivatives. It does not support LLaMA 3, you can use convert_hf_to_gguf.py with LLaMA 3 downloaded from Hugging Face.

To learn more about quantizing model, read this documentation

Perplexity (measuring model quality)

You can use the perplexity example to measure perplexity over a given prompt (lower perplexity is better). For more information, see https://huggingface.co/docs/transformers/perplexity.

To learn more how to measure perplexity using llama.cpp, read this documentation

Contributing

Contributors can open PRs
Collaborators can push to branches in the llama.cpp repo and merge PRs into the master branch
Collaborators will be invited based on contributions
Any help with managing issues and PRs is very appreciated!
See good first issues for tasks suitable for first contributions
Read the CONTRIBUTING.md for more information
Make sure to read this: Inference at the edge
A bit of backstory for those who are interested: Changelog podcast

llama.cpp's People

Contributors

Stargazers

Watchers

Forkers

mdmarek shafiahmed jcelerier mattish fernando23296 notmaxx valgazeghc maybelaterornot fabiorafaelcoutada dimroc jasdeepsingh tichubrothers compascafe hbcbh1999 simonw mistobaan jamesfebin jvandenbos hirajanwin suryatmodulus entn-at harsh-dhillon cyberflamego smw mehmetsafabenli zhoubug davincee ayourtch seclorum okletsgo ishine ukaserge j-f1 petercao cadu12 0xack13 syuil clmpo pecherskiy-v gavi beiller jooray robinseo mattsta thomasantony macocha antonme etra0 webclinic017 fungosforks hackerfriendly rpatil524 miguelgargallo jtang613 siddie sikkgit dpflann canokaue tysondowd nat cornelk gmh5225 cpunion gyunggyung wang20150419 kinddevil deepdiffuser henrywoo srilikestosing lyhiving leejw51 seigeshion jaykrell githungdang luisredondo dan255 stanleyjacob cedrickchee bellyfat vyoz battaglia01 alonegg qscuio b-thiswatch wizd niklaskorz clcarwin mirceatlx smeshing actuallydan maekawatoshiki wizzard0 marckohlbrugge excing mfkiwl honsa xhj popupbuddy okomarov linecode

llama.cpp's Issues

can't compile main

I’m trying to compile main to play around with it and failing with error:

ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

on macOS M1

trying to compile by running g++ main.cpp -o main -v -std=c++11

anyone know what I'm missing?

Output is garbage in INT4 model in Mac M1 Max

I'm not sure if the tokenizer is here to blame or something else, I've quantized the 7B model and running on my Mac and the output of any prompt is just garbage.

❯ ./main -m ggml-model-q4_0.bin -t 10 -p "Building a website can be done in 10 simple steps:" -n 512
main: seed = 1678546145
llama_model_load: loading model from 'ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Building a website can be done in 10 simple steps:tegr extremely“œurconnectommensrc périалheader ferm cas inde_" ENDeperCONT knowing Hud Source Dopo UPDATE sig Mobileclerût clean constraintsügel DrathelessOff intituléельm складу oltre\{\Readarrison Santa indicates Clear MongoDBasserControllerisp online Сове вла ingårLAśćcolors zawod Bus cult спWebachivrificeл brotherestyicumtmpjquery takéiveness dopolections^C

Or is it due to the fact that quantization was done on x86 arch, but somehow the weights are saved in architecture specific format?

Improving quality with 8bit?

I can achieve around 1 token per second on a Ryzen 7 3700X on Linux with the 65B model and 4bit quantization.

If we use 8bit instead, would it run faster? I have 128GB RAM. Is 8bit already supported?

$ ./main -m models/65B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 70897348 bytes
main:     load time = 14010.35 ms
main:   sample time =   335.09 ms
main:  predict time = 140527.48 ms / 1089.36 ms per token
main:    total time = 157951.48 ms

convert-pth-to-ggml.py failed with RuntimeError

Hi there, I downloaded my LLaMa weights through bit-torrent, and tried to convert the 7B model to ggml FP16 format:

$python convert-pth-to-ggml.py models/7B/ 1 
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}
n_parts =  1
Processing part  0
Traceback (most recent call last):
  File "/Users/fzxu/Documents/code/llama.cpp/convert-pth-to-ggml.py", line 89, in <module>
    model = torch.load(fname_model, map_location="cpu")
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 1049, in _load
    result = unpickler.load()
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 1019, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 997, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch._UntypedStorage).storage()._untyped()
RuntimeError: PytorchStreamReader failed reading file data/27: invalid header or archive is corrupted

Does this mean my downloaded version of model weights is corrupted? Or am I missing something?
I have filed request to Meta and hopefully I can try again with data from official download source.

Windows VS2022 Build - Returning nonsense

Unsure if windows builds are expected to even function! 😄

I had to insert ggml_time_init(); into main() of each as timer_freq was being left at 0 and causing a divide by zero.

Compiled with cl main.cpp ggml.c utils.cpp /std:c++20 /DEBUG /EHsc, same for quantize.cpp.

Run with the following main.exe -m ./LLaMA/7B/ggml-model-q4_0.bin -t 32 -n 512 -p "Building a website can be done in 10 simple steps:\n"

Produced the following output:

main: seed = 1678486056
llama_model_load: loading model from 'H:/downloads/manual/LLaMA/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 64
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Building a website can be done in 10 simple steps:\n'
main: number of tokens in prompt = 16
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
  3583 -> ':\'
 29876 -> 'n'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Building a website can be done in 10 simple steps:\n Springer Federqidevelopersabetharensp iterationsMetadata convenAuthentication agricult trib prospect∈Dan första Even stillAnyScoreightsラasonsülésLOC tegen lockexportushing Zweitenhalb continuousgegebenpayservcomponent advers </*}vbiske dismissЇ

Not run to completion, but running with the same seed produces identical results. Will give it a poke around but unsure where to begin.

Ability for `./main` to keep the model in memory and pass it more text

The ./main program currently outputs text and then quits.

How hard would it be to add a mode where it could stay running and be ready to accept more text piped to standard input?

This could help avoid the overhead of loading the model again every time the script runs.

Maybe it could output the generated text followed by a marker of some sort when it's done, so a wrapping process could see when it's finished and available to send a new prompt for evaluation.

I'm interested in wrapping it in a tiny Python web server to give myself a UI for interacting with the model.

Failed to load model

Hello,

I was playing with this trying to get it to work, but couldn't get the model to load. I used these instructions on my MBP M1 for the 13B model:

https://til.simonwillison.net/llms/llama-7b-m2

I get a "unknown tensor" error as shown:

./main \
  -m ./models/13B/ggml-model-q4_0.bin \
  -t 8 \
  -n 128 \
  -p 'The first person to go to space was '
main: seed = 1678602312
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from './models/13B/ggml-model-q4_0.bin.1'
llama_model_load: unknown tensor '' in model file
main: failed to load model from './models/13B/ggml-model-q4_0.bin'
llama_model_load: %

Any suggestions would be great! Thanks for working on this I'm excited to get it running.

Segfault / Memory error with 65B model (128GB RAM)

On an M1 Ultra / 128GB, running the 65B model:

./main -m ./models/65B/ggml-model-q4_0.bin -t 8 -n 128 -p "The word empowerment has five possible definitions:"

produces this error after everything has been loaded correctly:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268478672, available 268435456)

30B runs fine (even on a 64GB M1 Max)

Full output

(base) ➜  llama.cpp git:(master) ✗ ./main -m ./models/65B/ggml-model-q4_0.bin -t 8 -n 128 -p "The word empowerment has five possible definitions:"
main: seed = 1678533057
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723

main: prompt: 'The word empowerment has five possible definitions:'
main: number of tokens in prompt = 11
   1 -> ''
1576 -> 'The'
1734 -> ' word'
3710 -> ' emp'
1680 -> 'ower'
 358 -> 'ment'
 756 -> ' has'
5320 -> ' five'
1950 -> ' possible'
15848 -> ' definitions'
29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268478672, available 268435456)
[1]    54172 segmentation fault  ./main -m ./models/65B/ggml-model-q4_0.bin -t 8 -n 128 -p

Fails to load 30B model after quantization

Trying the 30B model on an M1 MBP, 32GB ram, ran quantification on all 4 outputs of the converstion to ggml, but can't load the model for evaluaiton:

llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 6656
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 17920
llama_model_load: ggml ctx size = 20951.50 MB
llama_model_load: memory_size =  1560.00 MB, n_mem = 30720
llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from './models/30B/ggml-model-q4_0.bin'
llama_model_load: %

This issue does not happen when I run the 7B model.

Raspberry Pi 4 4GB

Hi!

Just a report. I've successfully run the LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow at about 10 sec/token. But it looks like we can run powerful cognitive pipelines on a cheap hardware. It's awesome. Thank you!

Hardware : BCM2835
Revision : c03111
Serial : 10000000d62b612e
Model : Raspberry Pi 4 Model B Rev 1.1

%Cpu0 : 71.8 us, 14.6 sy, 0.0 ni, 0.0 id, 2.9 wa, 0.0 hi, 10.7 si, 0.0 st
%Cpu1 : 77.4 us, 12.3 sy, 0.0 ni, 0.0 id, 10.4 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 81.0 us, 8.6 sy, 0.0 ni, 0.0 id, 10.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 77.1 us, 12.4 sy, 0.0 ni, 1.0 id, 9.5 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 3792.3 total, 76.2 free, 3622.9 used, 93.2 buff/cache
MiB Swap: 65536.0 total, 60286.5 free, 5249.5 used. 42.1 avail Mem

PID      USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

2705518 ubuntu 20 0 5231516 3.3g 1904 R 339.6 88.3 84:16.70 main
102 root 20 0 0 0 0 S 14.2 0.0 29:54.42 kswapd0

main: seed = 1678644466
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291

main: prompt: 'The first man on the moon was '
main: number of tokens in prompt = 9
1 -> ''
1576 -> 'The'
937 -> ' first'
767 -> ' man'
373 -> ' on'
278 -> ' the'
18786 -> ' moon'
471 -> ' was'
29871 -> ' '

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

The first man on the moon was 20 years old and looked^[ lot like me. In fact, when I read about Neil Armstrong during school lessons my fa

benchmarks?

Where are the benchmarks for various hardware - eg. apple silicon

Suppress output that isn't from the model

I want to integrate this into a slim chat system, so I think it would be nice to be able to have the app output only the text from the model like a -q for "quiet" flag on run.

13b model issue tensor 'tok_embeddings.weight' has wrong size in model file

I try the following with the latest master (6b2cb63)

python convert-pth-to-ggml.py models/13B/ 1
./quantize ./models/13B/ggml-model-f16.bin   ./models/13B/ggml-model-q4_0.bin 2
./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2

ls models/13B/
checklist.chk         consolidated.00.pth   consolidated.01.pth   ggml-model-f16.bin    ggml-model-f16.bin.1  ggml-model-q4_0.bin   ggml-model-q4_0.bin.1 params.json

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128
main: seed = 1678568386
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ⏎

What would tensor 'tok_embeddings.weight' has wrong size in model file mean?

python bindings?

Is there a requirements.txt ?

.pth to .ggml Out of Memory

I have 16 GBs of memory (14 GB free) and running python3 convert-pth-to-ggml.py models/7B/ 1 causes an OOM error (Killed) on Linux.

Here's the dmesg message:
Out of memory: Killed process 930269 (python3) total-vm:15643332kB, anon-rss:13201980kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:26524kB oom_score_adj:0

I will be receiving my new RAM in a few days but I think this is supposed to work with 16 GB memory?

Unicode support

Thannk you for creating such a great inference engine which has 10x speedup.
Please add Unocode support to display other language properly.

ggml_new_tensor_impl: not enough space in the context's memory pool

Heya! Friend showed this to me and I'm trying to get it to work myself on Windows 10. I've applied the changes as seen in #22 to get it to build (more specifically, I pulled in the new commits from etra0's fork, but the actual executable fails to run - printing this before segfaulting:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 458853944, available 454395136)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 458870468, available 454395136)

I'm trying to use 7B on an i9-13900K (and I have about 30 gigs of memory free right now), and I've verified my hashes with a friend. Any ideas? Thanks!

65B model giving incorect output

ubuntu@ip-x:~/llama.cpp$ ./main -m ./models/65B/ggml-model-q4_0.bin \
>   -t 16 \
>   -n 1000000 \
>   -p 'The history of humanity starts with the bing bang, then '
main: seed = 1678666062
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723

main: prompt: 'The history of humanity starts with the bing bang, then '
main: number of tokens in prompt = 16
     1 -> ''
  1576 -> 'The'
  4955 -> ' history'
   310 -> ' of'
  5199 -> ' human'
   537 -> 'ity'
  8665 -> ' starts'
   411 -> ' with'
   278 -> ' the'
  9016 -> ' bin'
 29887 -> 'g'
  9892 -> ' ban'
 29887 -> 'g'
 29892 -> ','
   769 -> ' then'
 29871 -> ' '

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


The history of humanity starts with the bing bang, then ête estudios books Ter envi политичеSM>\< envi Elizabethial inflatorêteçaitктиче quarterern ElizabethDon Universidadiot политичеire Original starb Regierung verg estudios oraz Happyendesiot physIterator Cs improvement envirequireers којеersmetric :( Depending

Make run without error but ./model folder is empty

Did I miss anything?

Longer and infinite output

If we use -n 1000000 to have a very long output (for a story for example),
it stops generating quite fast, after around 30 lines, probably because of this line of code.

It would be nice if we could have longer outputs and also the possibility to have infinite output, stopping only on Ctrl-C.
We could maybe specify that -n 0 will trigger that infinite output mode.
That issue is a bit related to issue #23

tensor 'tok_embeddings.weight' has wrong size in model file

When trying to run the 13B model the following output is given:

main: seed = 1678543550
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from './models/13B/ggml-model-q4_0.bin'

I have followed the commands in the readme to quantize the model, i.e.:

python3 convert-pth-to-ggml.py models/13B/ 1
./quantize ./models/13B/ggml-model-f16.bin   ./models/13B/ggml-model-q4_0.bin 2
./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2

I am using a M1 MacBook Pro. Any thoughts on how to resolve this issue?

Models missing

Are the models missing in the directory? I don't see them.

Prompt interrupted before continuation for Unicode UTF-8 emojis

I have found that when having a Unicode UTF- emoji char like

Unicode Character “👍” (U+1F44D)

The prompts breaks up.

I'm reading a sample prompt from a text file:

cat prompt

Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:

Looking at logs I can see in fact that the tokenizers breaks at the (U+1F44D) char code:

(base)$ p=$(cat prompt); ./main -m ./models/13B/ggml-model-q4_0.bin -p $p -t 4 -n 512
main: seed = 1678656464
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from './models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

main: prompt: 'Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:'
main: number of tokens in prompt = 36
     1 -> ''
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
 29902 -> 'I'
 26277 -> ' hate'
   372 -> ' it'
   746 -> ' when'
   590 -> ' my'
  9008 -> ' phone'
 16988 -> ' battery'
  2977 -> ' dies'
  1213 -> '."'
    13 -> '
'
  2008 -> 'Se'
   593 -> 'nt'
  2073 -> 'iment'
 29901 -> ':'
 12610 -> ' Neg'
  1230 -> 'ative'
    13 -> '
'
  2277 -> '##'
 29937 -> '#'
    13 -> '
'
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
  3421 -> 'My'
  2462 -> ' day'
   756 -> ' has'
  1063 -> ' been'
 29871 -> ' '

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 10 times better than yesterday. Now I have to sleep again..."
Sentiment: Neutral
###
Twitter is not about talking; Twitter is a social network for listening and responding instantly, as the tweets of Steve Jobs demonstrate well in Figure A-2 (page ). Just be sure you can interpret the information accurately. If the sentiment isn't clearly positive or negative—as^C

resulting in a broken input prompt.

simde?

Could simde help with porting to x86?

Fine Tuning

Hey!

Thank you for your amazing job!

I'm curious is it possible to use RLHF feedback after a response to make small incremental adjustments in a tuning process? For example, if the user decides to fine-tune after an incorrect answer, can the model spend 60 seconds in the fine-tuning phase, save a checkpoint to disk, and then move on to the next question?

Too slow on m2 MBA 16gb SSD 512GB

Hi,

First of all, thanks for the tremendous work!

I just wanted to ask that compared to your demo, when I run the same input sentence, the speed difference is tremendously different. Is this because of the chipset difference between m1 pro and m2 or, you already knew this issue and trying to fix this?

What is the meaning of hacked?

Hey, I was reading your Readme.md and I saw that your repo was hacked. I want to ask what this means and wanted to check if the users like me also get the impact of hacking. Or, this is not the thing I should worry about?

Use an argument parsing library

The argument parsing for convert-ckpt-to-ggml.py is quite ad-hoc and hard to follow.

I'm thinking that something around this would go a long way in making the arguments easier to use and follow in the code.

import argparse

ARG_PARSER = argparse.ArgumentParser()
ARG_PARSER.add_argument("--model",
                        type=str,
                        help="Model to convert")
ARG_PARSER.add_argument("--ftype",
                        type=str,
                        choices=["f16", "f32"],
                        help="Floating point type to use")
ARG_PARSER.add_argument("--output",
                        type=str,
                        help="Where to write the converted model")
ARGS = ARG_PARSER.parse_args()

Reproducability information

The seed for the website example is included, but using the same parameters doesn't manage to reproduce the example output. Listing what requirements influense reproducability would help in verifying installs.

The failed test is with x86_64 (gcc or clang, no difference), CUDA 12.1, pytorch 1.13.1, numpy 1.23.5, sentencepiece 0.1.97 and Python 3.10.6 on Linux.

Segmentation Fault Error "not enough space in the context's memory pool"

This prompt with the 65B model on an M1 Max 64GB results in a segmentation fault. Works with 30B model. Are there problems with longer prompts? Related to #12

./main --model ./models/65B/ggml-model-q4_0.bin --prompt "You are a question answering bot that is able to answer questions about the world. You are extremely smart, knowledgeable, capable, and helpful. You always give complete, accurate, and very detailed responses to questions, and never stop a response in mid-sentence or mid-thought. You answer questions in the following format:

Question: What’s the history of bullfighting in Spain?

Answer: Bullfighting, also known as "tauromachia," has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. However, in recent decades, bullfighting has faced increasing opposition from animal rights activists, and its popularity has declined. Some regions of Spain have banned bullfighting, while others continue to hold bullfights as a cherished tradition. Despite its declining popularity, bullfighting remains an important part of Spanish culture and history, and it continues to be performed in many parts of the country to this day.

Now complete the following questions:

Question: What happened to the field of cybernetics in the 1970s?

Answer: "

Results in

...
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


You are a question answering bot that is able to answer questions about the world. You are extremely smart, knowledgeable, capable, and helpful. You always give complete, accurate, and very detailed responses to questions, and never stop a response in mid-sentence or mid-thought. You answer questions in the following format:

Question: What’s the history of bullfighting in Spain?

Answer: Bullfighting, also known as tauromachia, has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. However, in recent decades, bullfighting has faced increasing opposition from animal rights activists, and its popularity has declined. Some regions of Spain have banned bullfighting, while others continue to hold bullfights as a cherished tradition. Despite its declining popularity, bullfighting remainsggml_new_tensor_impl: not enough space in the context's memory pool (needed 701660720, available 700585498)
zsh: segmentation fault  ./main --model ./models/65B/ggml-model-q4_0.bin --prompt

Stop keywords

It'd be useful if there was a way to define tokens that would cause the output to stop prematurely (e.g. for an assistant-style interaction where messages are prefixed with "Assistant: ", "Human: ", you'd set "Human: " as a stop word, so that you could stop the model from continuing on and having a conversation with itself

commit `96ea727` breaks compilation in Windows

Sadly, Windows terminal support of ANSI colors and signals is simply... non-existent, and this PR in particular adds a ton of things that are not supported by this OS.

I don't know if filling the entire llama.cpp with #ifdefs would be ideal... or to rewrite those changes to encapsulate them better to make it less painful, but either way, it is a lot of work.

Reverting that commit and doing a couple of merge fixes makes it work.

PS: Maybe it would be good to add an action that compiles the software in the three platforms? but, yet, I'm curious about the willing to support windows as well since it's the special kid between the unix-like guys :)

Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes!

First of all thremendous work Georgi! I managed to run your project with a small adjustments on:

Intel(R) Core(TM) i7-10700T CPU @ 2.00GHz / 16GB as x64 bit app, it takes around 5GB of RAM.

Here is the list of those small fixes:

main.cpp: added ggml_time_init() at start of main (division by zero otherwise)
quantize.cpp: same as above at start of main (division by zero otherwise)
ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)
ggml.c: replace fopen with fopen_s (VS secure error message)
ggml.c: below changes due to 'expression must be a pointer or complete object type':

2x (uint8_t*)(y to: ((uint8_t*)y
4x (const uint8_t*)(x to ((const uint8_t*)x
2x (const uint8_t*)(y to ((const uint8_t*)y

quantize.cpp: removed qk in ggml_quantize_q4_0 & ggml_quantize_q4_1 calls
utils.cpp: use of QK value instead of parameter value (VS raise error for: uint8_t pp[qk / 2];)

It would be really great if you could incorporate those small fixes.

[Q] Memory Requirements for Different Model Sizes

Implement Flash Attention Option

Would love to see a faster, more memory efficient attention implemented like Flash Attention. :)

Windows MSVC support

hello, would it add MSVC build support as well?

Linux Support

Will Linux be supported?

GPTQ Quantization (3-bit and 4-bit)

4-bit quantization tends to come at a cost of output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit/2-bit) quantization methods and even when compared with uncompressed fp16 inference.

It would be good to see benchmarks on the existing implementation. It's possible there is substantial quality loss from the 4-bit quantization. It's also possible that it isn't very substantial. We'd have to see benchmarks to know.

The related project GPTQ-for-LLaMA has some benchmarks available for their implementation.

Refernces:
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
The case for 4-bit precision: k-bit Inference Scaling Laws

Repetition penalty

Hello,

Thank you for this implementation, it is nice being able to experiment with things, even without GPUs at hand.

Would you mind implementing the repetition penalty? It seems to produce better/more consistent results...

Quality of 4-bit quantization

The quality of the 4-bit quantization is really abysmal compared to both non-quantized models and GPTQ quantization
(https://github.com/qwopqwop200/GPTQ-for-LLaMa). Wouldn't it make sense for llama.cpp to load already-prequantized LLaMa models?

Illegal hardware instruction in quantize step

Ran into this error on a Macbook Pro M1

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
[1]    18452 illegal hardware instruction  ./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

What I've tried:
- Run main on the model-f16, still have the same error
- Convert the 13B model, still same error in quantize step
Env:
Darwin Tungs-MacBook-Pro.local 21.6.0 Darwin Kernel Version 21.6.0: Sat Jun 18 17:07:22 PDT 2022; root:xnu-8020.140.41~1/RELEASE_ARM64_T6000 x86_64

Hows the inference speed and mem usage?

invalid model file './models/llama-13b-4bit.pt' (bad magic)

Downloaded from

magnet:?xt=urn:btih:36945b5958b907b3ab69e963ba0de1abdf48c16c&dn=LLaMA-HFv2-4bit&tr=http%3a%2f%2fbt1.archive.org%3a6969%2fannounce&tr=http%3a%2f%2fbt2.archive.org%3a6969%2fannounce

error: 'CLOCK_MONOTONIC' undeclared

The initial make fails with CLOCK_MONOTONIC undeclared

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  unknown
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  
I CC:       cc (Alpine 12.2.1_git20220924-r9) 12.2.1 20220924
I CXX:      g++ (Alpine 12.2.1_git20220924-r9) 12.2.1 20220924

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3   -c ggml.c -o ggml.o
ggml.c: In function 'ggml_time_ms':
ggml.c:309:5: warning: implicit declaration of function 'clock_gettime' [-Wimplicit-function-declaration]
  309 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |     ^~~~~~~~~~~~~
ggml.c:309:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  309 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
ggml.c:309:19: note: each undeclared identifier is reported only once for each function it appears in
ggml.c: In function 'ggml_time_us':
ggml.c:315:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  315 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
make: *** [Makefile:182: ggml.o] Error 1

This can be solved by adding -D_POSIX_C_SOURCE=199309L to the C{,XX}FLAGS in the Makefile. See this Stackoverflow question: https://stackoverflow.com/questions/29666937/error-clock-monotonic-undeclared-first-use-in-this-function

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs

Idea from: #23 (comment)

We can add a --cache_prompt flag that if added will dump the computed KV caches of the prompt processing to the disk in a file with name produced by the hash of the prompt. Next time you run, it will first check if we have stored KV cache for this hash and load it straight from disk instead of computing it.

Great task for contributing to the project!

Merging tensors of larger models

Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models. However, in theory, you should be able to run 65B on a 64GB MacBook

It shouldn't be hard to merge tensors with my https://github.com/kir-gadjello/zipslicer library, but it's pure Python! If you want to keep the project pure C++ you might want to write a standalone gist script that uses zipslicer to unpack weight shards into binary files.

Maybe lower default temp and switch to top_k 40

Per this twitter thread. See commit here.

faster performance on older machines

On machines with smaller memory and slower processors, it can be useful to reduce the overall number of threads running. For instance on my MacBook Pro Intel i5 16Gb machine, 4 threads is much faster than 8. Try:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT " -t 4 -n 512

llama.exe doesn't handle relative file paths in Windows correctly

Please include the ggml-model-q4_0.bin model to actually run the code:

% make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize  -framework Accelerate
./main -h
usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 4)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -n N, --n_predict N   number of tokens to predict (default: 128)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --repeat_last_n N     last n tokens to consider for penalize (default: 64)
  --repeat_penalty N    penalize repeat sequence of tokens (default: 1.3)
  --temp N              temperature (default: 0.8)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/llama-7B/ggml-model.bin)

main: seed = 1678619388
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: failed to open './models/7B/ggml-model-q4_0.bin'
main: failed to load model from './models/7B/ggml-model-q4_0.bin'

My pre-signed URL to download the model weights was broken.