iamlemec / bert.cpp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from xyzhang626/embeddings.cpp

51.0 51.0 4.0 267 KB

GGML implementation of BERT model with Python bindings and quantization.

License: MIT License

C++ 71.56% Python 24.40% CMake 4.04%

bert.cpp's People

Contributors

Stargazers

Watchers

Forkers

sroussey ggerganov prithivirajdamodaran zhaopufeng

bert.cpp's Issues

does bert_encode() thread-safe for online embedding?

Can BAAI/bge-m3 will be supported?

Thank you for your excellent work.

bge-m3 is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

When run the following command:
python convert-to-ggml.py './bge-m3' f16

Traceback (most recent call last)
FileNotFoundError: [Errno 2] No such file or directory: './bge-m3/vocab.txt'

Will make some changes to convert-to-ggml.py to support the new model?

supports jinaai/jina-embeddings-v2-base-code

I seems to get jina embeddings convert successfully with

python bert_cpp/convert.py jinaai/jina-embeddings-v2-base-code models/jina-f16.gguf

Seems the only change to original bert is ALiBi, as described in https://huggingface.co/jinaai/jina-embeddings-v2-base-code

It'll be nice if we could adapt ggml_alibi into this repo for jina embedding support

Memory usage and slowness question

In CPU build, I experienced this issue both here and in the llama.cpp version.

GGML is built for edge AI so resource constrained devices but Both in CLI and as python, even for a small text of say 64 tokens the code seems to run very slow when the RAM available is 10Gb or low but runs really FAST when with more RAM like 15-20GB. I used 6 threads. With less threads again things slow.

Something is wrong. Have you encountered this ?

BERT MLM Question

Hi there, I understand the GGML implementation focuses on BERT like models as embedding models. So output is pooled and normalised. But I am interested in using BERT as a MLM model with the MaskedLM header on top of the output layer like below. Could you please advise or help in accomodating bert.cpp to return MLM logits instead of pooled embeddings. so for a 8 token input including CLS and SEP the out will be of shape 1 x 10 x 30522 (bs, seq_len, vocab_size).

Thanks in advance,

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (transform_act_fn): GELUActivation()
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30522, bias=True)
    )
  )
)

Python binding gives error

conversion is fine running the below

from bert_cpp import BertModel
mod = BertModel('models/bge-base-en-v1.5-f16.gguf')
batch = ["Hello, how are you"]
emb = mod.embed(batch)

throws

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-23-145b8ed4c48e>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from bert_cpp import BertModel
      2 mod = BertModel('models/bge-base-en-v1.5-f16.gguf')
      3 batch = ["Hello, how are you"]
      4 emb = mod.embed(batch)

3 frames
[/content/bert.cpp/bert_cpp/utils.py](https://localhost:8080/#) in load_shared_library(lib_base_name)
     58                 raise RuntimeError(f'Failed to load shared library "{lib_path}": {e}')
     59 
---> 60     raise FileNotFoundError(
     61         f'Shared library with base name "{lib_base_name}" not found'
     62     )

FileNotFoundError: Shared library with base name "bert" not found

I know I am missing something basic, pl advice.

Using llama.cpp

I am trying to use llama.cpp as you suggested its merged there for the same baai 1.5 embedding models , could you please help me how should I get started. I cant figure out the equivalent of bert_tokenize part there.

Thanks

Can I make a pull request about add a OpenAI style embedding API?

I wrote this API for Dify. Now that I'm finished coding it, I think it could be a helpful tool for others.

Why llama.cpp runs substantially faster

I am thinking about creating DeBERTa version of this project. Initially I thought to use it as a backbone, because it's easier to modify than llama.cpp, but performance is really important for my case. It was mentioned in the readme that llama.cpp realization is substantially faster, I am a beginner of ggml and llama.cpp and I don't understand why. Can someone explain it?

Request: build python whl package

It could be installed by pip and can be used in other projects.
Thanks.

bug: bert_tokenize can not find the longest token

Rollback to oringinal code to correct it.

test with "gpt", it should be splitted to ["gp", "t"].

Works great on command line, but unable to use via python

Hi, Thanks for the great work. Hope this gets merged into llama.cpp, but till then, I'm able to get things to work in the command line. However, when running the python example, I get this error:

FileNotFoundError: Shared library with base name "bert" not found

I think I'm missing a package? I did the pip install requirements bit, so not sure what I'm getting wrong.

EDIT 1: Just noticed this has been merged into llama.cpp. For some reason I get an error when loading it into llama.cpp

llama_model_load: error loading model: error loading model hyperparameters: key not found in model: bert.context_length

This gguf was converted using bert.cpp. Does the original model have to be converted through llama.cpp?

EDIT 2: I see there's an issue with the embeddings implementation in llama.cpp

Also tried converting the model using llama.cpp convert.py but get this error:

Loading model file /home/sravanth/vecsearch/UAE-Large-V1/model.safetensors
Traceback (most recent call last):
  File "/home/sravanth/llama.cpp/convert.py", line 1483, in <module>
    main()
  File "/home/sravanth/llama.cpp/convert.py", line 1430, in main
    params = Params.load(model_plus)
  File "/home/sravanth/llama.cpp/convert.py", line 317, in load
    params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
  File "/home/sravanth/llama.cpp/convert.py", line 256, in loadHFTransformerJson
    f_norm_eps        = config["rms_norm_eps"],
KeyError: 'rms_norm_eps'

Compilation Error on macOS

For a build with Metal according to the README, see:

(base) ➜  bert.cpp-future git:(master) make -C build
[ 50%] Built target ggml
[ 58%] Building CXX object src/CMakeFiles/bert.dir/bert.cpp.o
In file included from /Users/turbo/dev/bert.cpp-future/src/bert.cpp:9:
/Users/turbo/dev/bert.cpp-future/src/bert.h:186:22: warning: 'bert_tokenize' has C-linkage specified, but returns incomplete type 'bert_tokens' (aka 'vector<int>') which could be incompatible with C [-Wreturn-type-c-linkage]
BERT_API bert_tokens bert_tokenize(
                     ^
/Users/turbo/dev/bert.cpp-future/src/bert.h:192:22: warning: 'bert_detokenize' has C-linkage specified, but returns user-defined type 'bert_string' (aka 'basic_string<char>') which is incompatible with C [-Wreturn-type-c-linkage]
BERT_API bert_string bert_detokenize(
                     ^
/Users/turbo/dev/bert.cpp-future/src/bert.cpp:358:33: error: no matching function for call to 'min'
    memcpy(output, str.c_str(), std::min(n_output, str.size()));
                                ^~~~~~~~