bigcode-project / starcoder.cpp Goto Github PK

View Code? Open in Web Editor NEW

437.0 16.0 36.0 7.29 MB

C++ implementation for 💫StarCoder

CMake 0.05% Python 1.12% C++ 13.33% Makefile 0.81% Cuda 4.62% C 79.47% Swift 0.52% Objective-C 0.08%

starcoder.cpp's Introduction

💫StarCoder in C++

This is a C++ example running 💫 StarCoder inference using the ggml library.

The program can run on the CPU - no video card is required.

The example supports the following 💫 StarCoder models:

bigcode/starcoder
bigcode/gpt_bigcode-santacoder aka the smol StarCoder
HuggingFaceH4/starchat-beta - the coding assistants based on StarCoderPlus

Sample performance on MacBook M1 Pro:

TODO

Sample output:

$ ./bin/starcoder -h
usage: ./bin/starcoder [options]

options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 8)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -n N, --n_predict N   number of tokens to predict (default: 200)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --temp N              temperature (default: 1.0)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/starcoder-117M/ggml-model.bin)

$ ./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" -t 4 --top_k 0 --top_p 0.95 --temp 0.2      
main: seed = 1683881276
starcoder_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 3
starcoder_model_load: ggml ctx size = 1794.90 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 2658 64 2819 7 

def fibonnaci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibo(10))

main: mem per token =  9597928 bytes
main:     load time =   480.43 ms
main:   sample time =    26.21 ms
main:  predict time =  3987.95 ms / 19.36 ms per token
main:    total time =  4580.56 ms

Quick start

git clone https://github.com/bigcode-project/starcoder.cpp
cd starcoder.cpp

# Convert HF model to ggml
python convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder

# Build ggml libraries
make

# quantize the model
./quantize models/bigcode/gpt_bigcode-santacoder-ggml.bin models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3

# run inference
./main -m models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2

Downloading and converting the original models (💫 StarCoder)

You can download the original model and convert it to ggml format using the script convert-hf-to-ggml.py:

# Convert HF model to ggml
python convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder

This conversion requires that you have python and Transformers installed on your computer.

Quantizing the models

You can also try to quantize the ggml models via 4-bit integer quantization.

# quantize the model
./quantize models/bigcode/gpt_bigcode-santacoder-ggml.bin models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3

Model	Original size	Quantized size	Quantization type
`bigcode/gpt_bigcode-santacoder`	5396.45 MB	1026.83 MB	4-bit integer (q4_1)
`bigcode/starcoder`	71628.23 MB	13596.23 MB	4-bit integer (q4_1)

iOS App

The repo includes a proof-of-concept iOS app in the StarCoderApp directory. You need to provide the converted (and possibly quantized) model weights, placing a file called bigcode_ggml_model.bin.bin inside that folder. This is what it looks like on an iPhone:

starcoder.cpp's People

Contributors

Stargazers

Watchers

starcoder.cpp's Issues

Interactive chat and gpu offload support

Currently, llama.cpp allows us to pass -i -ins for an interactive chat session using the alpaca template and it also allows us to have gpu offloading via cuda or opencl. This would massively improve inference times. Will it be supported anytime soon as the only thing stopping starcoder from taking off is the huge barrier to entry in a way off inference times. I am very impressed with the model based on testing via the web interface(starchat).

Attempts at converting fine-tuned models fail without explanation

I attempted to convert StarChat as well as my own fine-tune of SantaCoder to create more conversational models, and neither works. With SantaCoder the result is the following error:

[process exited with code 7 (0x00000007)]

My guess was that my system didn't have enough resources, so I tried running it in Colab using my Pro plan. It printed ^C and halted, something I've never seen before.

I haven't been able to replicate the conversion steps for the base models either...

make failed on Ubuntu18.04

when i run make command on ubuntu18.04(with gcc version 7.5.0), got the following error:

ggml.c:534:27: error: incompatible types when initializing type ‘__m256i {aka const __vector(4) long long int}’ using type ‘int’
Makefile:186: recipe for target 'ggml.o' failed

Mac M1 metal support

Hi,
Is there any plans to support Mac M1 metal GPU ?

Thanks

feature request: interactive mode

Hi all, thank you for your great work.

I have a feature request: It would be interesting to implement the interactive mode (-i option) that is available in llama.cpp, in order to run the starchat-alpha fine-tuned version of the model. I have tested it with the prompt option and works properly, but it would be more useful with the interactive mode.

Thanks in advance!

Best,

Jordi

convert-hf-to-ggml.py CUDA out of memory

i've tried to convert model from HF to GGML format:

python3 convert-hf-to-ggml.py ../starcoderbase_int8

and got an error:

Loading model:  ../starcoderbase_int8
Loading checkpoint shards:   ...
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2901, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3258, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 725, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param, fp16_statistics=fp16_statistics)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/bitsandbytes.py", line 109, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 10.90 GiB total capacity; 9.21 GiB already allocated; 568.69 MiB free; 9.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

And next I've tried to force it to run on the CPU:

export CUDA_VISIBLE_DEVICES=""
python3 convert-hf-to-ggml.py ../starcoderbase_int8

Then, I got this:

Loading model:  ../starcoderbase_int8
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2370, in from_pretrained
    raise RuntimeError("No GPU found. A GPU is needed for quantization.")
RuntimeError: No GPU found. A GPU is needed for quantization.

For me, the main reason to go with GGML implementation is that I can't fit the model in my GPU. I thought I could perform both the conversion and inference using only the CPU and system RAM. Am I doing something specific wrong or I got it wrong in general?

Licence missing

Hi,

Thanks for your awesome code. Can you add a Licence file?
Alternative a note in the in Readme, tag or reference to other bigcode projects?

How much memory is required ?

How much memory is required to run convert-hf-to-ggml.py and run inference locally?

Unable to build on Windows

Hello, I am trying to build this on Windows using CMake (Visual Studio / VSCode).

Line 15 on main.cpp

cannot open source file "unistd.h"

If I comment it out and compile again, I get a few more:

Line 207 and 441 on main.cpp:

use of designated initializers requires at least '/std:c++20'

Line 745 on main.cpp:

identifier "isatty" is undefined
identifier "STDIN_FILENO" is undefined

No useful output

After installing the code and models successfully I ran per the directions in the README. However the output is useless from a code development perspective and does not at all match the README files reported output. Specifically:

./main -m models/bigcode/gpt_bigcode-santacoder-ggml.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2
main: seed = 1693067463
starcoder_model_load: loading model from 'models/bigcode/gpt_bigcode-santacoder-ggml.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx = 2048
starcoder_model_load: n_embd = 2048
starcoder_model_load: n_head = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype = 1
starcoder_model_load: qntvr = 0
starcoder_model_load: ggml ctx size = 3475.60 MB
starcoder_model_load: memory size = 768.00 MB, n_mem = 49152
starcoder_model_load: model size = 2707.45 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 2658 64 2819 7

def fibonnaci(!

main: mem per token = 320504 bytes
main: load time = 777.92 ms
main: sample time = 0.18 ms
main: predict time = 84.90 ms / 12.13 ms per token
main: total time = 924.55 ms

LSP?

Would it be possible to adapt this to the Language Server Protocol? That would enable integrations with tons of editors/IDEs.

python requirements.txt is missing

requirements.txt file with pinned python3 package versions is missing in the repository

Feature request: Python bindings for starcoder-cpp

Similar to this repository for llama-cpp, is anyone aware of efforts to do the same for starcoder-cpp?

More broadly, I'm looking for the best (and easiest) way to get starcoder-cpp to run with Triton server.

Thanks in advance!

context memory pool

When running the StarCoder model quantized using "q5_1" with a medium-sized context (3500 tokens) I run into this error:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 412241472, available 411791504)
Segmentation fault (core dumped)

How can I increase the context memory pool size?

Inference with Starcoder model finetuned by lora

Hi, Can you give some advice about how to inference finetuned Starcoder model with this code? Since lora finetune changed some of layers of the model, some of the code in starcoder.cpp should be changed, how can I use this code to inference with my finetuned Starcoder model?

SantaCoder works but never seems to generate <|end|>

Heres what happens when I follow the instructions in the README:

$ ./main -m ./gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" --top_k 1 --temp 0.2 --top_p 0.95 --seed 1683881276
main: seed = 1683881276
starcoder_model_load: loading model from './gpt_bigcode-santacoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 1003
starcoder_model_load: qntvr   = 1
starcoder_model_load: ggml ctx size = 1794.97 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 2658 64 2819 7

def fibonnaci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


print(fibo(10))
print(fibo(100))
print(fibo(1000))
print(fibo(10000))
print(fibo(100000))
print(fibo(1000000))
print(fibo(10000000))
print(fibo(100000000))
print(fibo(1000000000))
print(fibo(10000000000))
print(fibo(100000000000

main: mem per token =   314360 bytes
main:     load time =   404.02 ms
main:   sample time =    72.67 ms
main:  predict time = 10409.67 ms / 50.53 ms per token
main:    total time = 10991.78 ms

I have the same issue with example/starcoder from the ggml repo, it just keeps on generating and wont stop until it hits the generation limit. Running the original model through transformers with the same prompt produces the same input token ids, but the outputs frustratingly diverge.

python3 convert-hf-to-ggml.py bigcode/starcoderplus is failing with error: RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

python3 convert-hf-to-ggml.py bigcode/starcoderplus

is failing with error: RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

% python3 convert-hf-to-ggml.py models/starcoderplus
Loading model:  models/starcoderplus
Loading checkpoint shards:  29%|█████████████████████████████████████▍                                                                                             | 2/7 [00:45<01:54, 22.80s/it]
Traceback (most recent call last):
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 460, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/torch/serialization.py", line 797, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/torch/serialization.py", line 283, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
    if f.read(7) == "version":
       ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/starcoder.cpp/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'models/starcoderplus/pytorch_model-00003-of-00007.bin' at 'models/starcoderplus/pytorch_model-00003-of-00007.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Installed python packages:

% pip3 list                                         
Package            Version
------------------ ---------
accelerate         0.21.0
certifi            2023.7.22
charset-normalizer 3.2.0
filelock           3.12.2
fsspec             2023.6.0
huggingface        0.0.1
huggingface-hub    0.16.4
idna               3.4
Jinja2             3.1.2
MarkupSafe         2.1.3
mpmath             1.3.0
networkx           3.1
numpy              1.25.1
packaging          23.1
pip                23.0.1
psutil             5.9.5
PyYAML             6.0.1
regex              2023.6.3
requests           2.31.0
safetensors        0.3.1
setuptools         67.6.1
starcoder          0.0.2
sympy              1.12
tokenizers         0.13.3
torch              2.0.1
tqdm               4.65.0
transformers       4.31.0
typing_extensions  4.7.1
urllib3            2.0.4

starchat-beta support

Does this project support the starchat-beta model?
URL:https://huggingface.co/HuggingFaceH4/starchat-beta

Inference with GPU?

Can i run starcoder.cpp locally on Windows11? Will 16gb ram be enough for it?

Add `bigcode/starcoderbase-1b` and `bigcode/starcoderbase-7b` to README

We need to add the new models' infos to README. They should be already supported

trying to ggml the starpii but seeing following error, anything i need to further check?

Loading model: /Users/abc/dev/starpii
If you want to use BertLMHeadModel as a standalone, add is_decoder=True.
Some weights of the model checkpoint at /Users/abc/dev/starpii were not used when initializing BertLMHeadModel: ['classifier.bias', 'classifier.weight']

This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Model loaded: /Users/abc/dev/starpii
{'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': 'float32', 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0, 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'chunk_size_feed_forward': 0, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 'forced_eos_token_id': None, 'remove_invalid_values': False, 'exponential_decay_length_penalty': None, 'suppress_tokens': None, 'begin_suppress_tokens': None, 'architectures': ['BertForTokenClassification'], 'finetuning_task': None, 'id2label': {0: 'O', 1: 'B-AMBIGUOUS', 2: 'I-AMBIGUOUS', 3: 'B-EMAIL', 4: 'I-EMAIL', 5: 'B-IP_ADDRESS', 6: 'I-IP_ADDRESS', 7: 'B-KEY', 8: 'I-KEY', 9: 'B-NAME', 10: 'I-NAME', 11: 'B-PASSWORD', 12: 'I-PASSWORD', 13: 'B-USERNAME', 14: 'I-USERNAME'}, 'label2id': {'B-AMBIGUOUS': 1, 'B-EMAIL': 3, 'B-IP_ADDRESS': 5, 'B-KEY': 7, 'B-NAME': 9, 'B-PASSWORD': 11, 'B-USERNAME': 13, 'I-AMBIGUOUS': 2, 'I-EMAIL': 4, 'I-IP_ADDRESS': 6, 'I-KEY': 8, 'I-NAME': 10, 'I-PASSWORD': 12, 'I-USERNAME': 14, 'O': 0}, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': None, 'pad_token_id': 49152, 'eos_token_id': None, 'sep_token_id': None, 'decoder_start_token_id': None, 'task_specific_params': None, 'problem_type': None, '_name_or_path': '/Users/abc/dev/starpii', 'transformers_version': '4.28.1', 'model_type': 'bert', 'vocab_size': 49153, 'hidden_size': 768, 'num_hidden_layers': 12, 'num_attention_heads': 12, 'hidden_act': 'gelu', 'intermediate_size': 3072, 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 1024, 'type_vocab_size': 2, 'initializer_range': 0.02, 'layer_norm_eps': 1e-12, 'position_embedding_type': 'absolute', 'use_cache': True, 'classifier_dropout': None}
Saving ggml model to: models//Users/abc/dev/starpii-ggml.bin
Traceback (most recent call last):
File "/Users/abc/dev/starcoder.cpp/convert-hf-to-ggml.py", line 78, in
fout.write(struct.pack("i", hparams["n_positions"])) # n_ctx
KeyError: 'n_positions'

can't run the starcoder-ggml.bin

./main -m models/bigcode/starcoder-ggml.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2

main: seed = 1685609262
starcoder_model_load: loading model from 'models/bigcode/starcoder-ggml.bin'
starcoder_model_load: n_vocab = 49152
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 1
starcoder_model_load: qntvr   = 0
starcoder_model_load: ggml ctx size = 51276.47 MB
GGML_ASSERT: ggml.c:3874: ctx->mem_buffer != NULL
Aborted (core dumped)

hi, My machine has 38GB of memory and can execute starcoder starcoder-ggml-q4_1.bin, but cannot execute non quantified starcoder-ggml.bin. Is this because there is not enough memory?

Doesn't seem to work for me.

I'm getting the following output when trying to run:

main: seed = 1685215323
starcoder_model_load: loading model from 'models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 1003
starcoder_model_load: qntvr   = 1
starcoder_model_load: ggml ctx size = 1794.97 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 2658 64 2819 7

def fibonnaci(!

main: mem per token =   314360 bytes
main:     load time =   344.99 ms
main:   sample time =     0.19 ms
main:  predict time =    62.08 ms / 8.87 ms per token
main:    total time =   446.06 ms

Servermode for working as endpoint for VSCode Addon "HF Code Autocomplete"

Hi,
is there already a feature to start this in something like a Server-Mode? I want to use starcoder as an endpoint for the VSCode-Addon HF Code Autocomplete and also later for a chat interface like huggingchat.

Prevent starchat from generating nonsense after ouptutting <|system|> <|end|> <|user|>

When using starchat the model will likely start talking bullshit (in Spanish) after printing out the sequence:

<|system|> <|end|> <|user|>

I added a rudimentary fix to stop generating new tokens in case starchat is used after outputting <|system|> <|end|> <|user|>:

        // check if model is santacoder
        if (model.hparams.n_layer <= 30 && embd.back() == 49152) {
            break;
        }
        // check if model is starcoder
        else if (embd.back() == 0) { //TODO: this is only for starcoder
            break;
        }
        // starchat since only these 3 models are supported atm
        else{
            // break to prevent starchat from talking gibberish
            if (output.find("<|system|>\n<|end|>\n<|user|>") != std::string::npos) {
                break;
            }
        }

Would that be desirable pr? And also what is the correct way to detect if the model is indeed starcoder instead of using the default?

Starcoder conversion and quantization instructions

Hi.
Pls provide conversion and quantization instructions of the main starcoder model files.

Models fail to load

I'm getting the following error in the final step of the quickstart:

unknown tensor 'transformer.h.0.attn.q_attn.weight' in model file

Input line:
./main -m models/bigcode/gpt_bigcode-santacoder-ggml.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2

Output:

main: seed = 1687068338
starcoder_model_load: loading model from 'models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 1003
starcoder_model_load: qntvr   = 1
starcoder_model_load: ggml ctx size = 1794.97 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: unknown tensor 'transformer.h.0.attn.q_attn.weight' in model file
main: failed to load model from 'models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin'

Notable differences from the sample output:

starcoder_model_load: ftype = 1 in my output vs starcoder_model_load: ftype = 3
(quanitzed models were produced with ./quantize models/bigcode/gpt_bigcode-santacoder-ggml.bin models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3; non-quanitzed model fails with a similar error)
starcoder_model_load: qntvr = 1 in my output vs. no info on qntvr in the sample output

Other notes:

this is running on a 2019 Intel MBP, not an M1
conda list is reproduced below in case I'm somehow missing a dependency

# Name                    Version                   Build  Channel
accelerate                0.20.3             pyhd8ed1ab_0    conda-forge
blas                      1.0                         mkl
brotlipy                  0.7.0           py310hca72f7f_1002
bzip2                     1.0.8                h1de35cc_0
ca-certificates           2023.5.7             h8857fd0_0    conda-forge
certifi                   2023.5.7           pyhd8ed1ab_0    conda-forge
cffi                      1.15.1          py310h6c40b1e_3
charset-normalizer        2.0.4              pyhd3eb1b0_0
click                     8.0.4           py310hecd8cb5_0
cryptography              39.0.1          py310hf6deb26_2
dataclasses               0.8                pyh6d0b6a4_7
filelock                  3.9.0           py310hecd8cb5_0
future                    0.18.3          py310hecd8cb5_0
huggingface_hub           0.15.1                     py_0    huggingface
idna                      3.4             py310hecd8cb5_0
importlib-metadata        6.0.0           py310hecd8cb5_0
importlib_metadata        6.0.0                hd3eb1b0_0
intel-openmp              2023.1.0         ha357a0b_43547
joblib                    1.2.0           py310hecd8cb5_0
libcxx                    14.0.6               h9765a3e_0
libffi                    3.4.4                hecd8cb5_0
libgfortran               5.0.0           11_3_0_hecd8cb5_28
libgfortran5              11.3.0              h9dfd629_28
libopenblas               0.3.21               h54e7dc3_0
libprotobuf               3.20.3               hfff2838_0
libuv                     1.44.2               h6c40b1e_0
llvm-openmp               14.0.6               h0dcd299_0
mkl                       2023.1.0         h59209a4_43558
mkl-service               2.4.0           py310h6c40b1e_1
mkl_fft                   1.3.6           py310h3ea8b11_1
mkl_random                1.2.2           py310h3ea8b11_1
ncurses                   6.4                  hcec6c5f_0
ninja                     1.10.2               hecd8cb5_5
ninja-base                1.10.2               haf03e11_5
numpy                     1.24.3          py310h827a554_1
numpy-base                1.24.3          py310ha186be2_1
openssl                   3.1.1                h8a1eda9_1    conda-forge
packaging                 23.0            py310hecd8cb5_0
pip                       23.1.2          py310hecd8cb5_0
protobuf                  3.20.3          py310hcec6c5f_0
psutil                    5.9.5           py310h90acd4f_0    conda-forge
pycparser                 2.21               pyhd3eb1b0_0
pyopenssl                 23.0.0          py310hecd8cb5_0
pysocks                   1.7.1           py310hecd8cb5_0
python                    3.10.11              h5ee71fb_3
python_abi                3.10                    2_cp310    conda-forge
pytorch                   1.13.1          cpu_py310h9e40b02_0
pyyaml                    6.0             py310h6c40b1e_1
readline                  8.2                  hca72f7f_0
regex                     2022.7.9        py310hca72f7f_0
requests                  2.29.0          py310hecd8cb5_0
sacremoses                master                     py_0    huggingface
setuptools                67.8.0          py310hecd8cb5_0
six                       1.16.0             pyhd3eb1b0_1
sqlite                    3.41.2               h6c40b1e_0
tbb                       2021.8.0             ha357a0b_0
tk                        8.6.12               h5d9f67b_0
tokenizers                0.11.4          py310h8776b5c_1
tqdm                      4.65.0          py310h20db666_0
transformers              4.28.1                     py_0    huggingface
typing-extensions         4.6.3           py310hecd8cb5_0
typing_extensions         4.6.3           py310hecd8cb5_0
tzdata                    2023c                h04d1e81_0
urllib3                   1.26.16         py310hecd8cb5_0
wheel                     0.38.4          py310hecd8cb5_0
xz                        5.4.2                h6c40b1e_0
yaml                      0.2.5                haf1e3a3_0
zipp                      3.11.0          py310hecd8cb5_0
zlib                      1.2.13               h4dc903c_0

Why quantized by starcoder.cpp and by ggml example models are different?

Looks like quantization, inference and model format are different between starcoder.cpp and upstream ggml. Why? And Why models are incompatible? For example if I try to use inference in starcoder.cpp model quantized by ggml example code I will see segmentation fault. Maybe need to update code to be compatible with upsteam?