cafaxo / llama2.jl Goto Github PK

View Code? Open in Web Editor NEW

123.0 123.0 19.0 64 KB

Julia package for inference and training of Llama-style language models

License: MIT License

Julia 100.00%

llama2.jl's People

Contributors

Stargazers

Watchers

Forkers

bangboom svilupp pitsianis yi cpfiffer sabyaghossh lilithhafner thodoris1999 jaewoojoung zsz00 lostella vaiton mahavatara dhruvdh trholding lazarusa

llama2.jl's Issues

Speed up tokenizer

The tokenizer is currently O(N^2) in its input. This currently makes it impossible tokenize large amount of text for training.
We could either use a more efficient algorithm or just pre-tokenize whitespace. We should also check what sentencepiece does.

A GPU version of inference code

I think there could be a GPU version for the inference

Use the proper prompt template for Llama2 chat models

The Llama2 chat models need a special prompt template to produce good output, see https://github.com/facebookresearch/llama#fine-tuned-chat-models.

We should add a function that automatically provides the proper prompt template to the model.

Adapt to new tokenizer.bin format

Reported by @BangBOOM here: https://gist.github.com/cafaxo/83b795f11c645e217688449dd91147a3?permalink_comment_id=4644177#gistcomment-4644177

Hope support more model formats

Hope support more model formats, such as ggmlv3.q8_0, ggml-q4, gguf ...
@cafaxo

Reproduce temp=0 llama.cpp results with some consistency.

We need to find a way to detect what could cause the differences between the two solutions.

The task is to have the same or near similar results at temp=0. We made some tests with the new .gguf files since it got so huge adoption.

Llama2.jl test:

using Llama2
model = load_gguf_model("/path/to/llama-2-7b-chat.Q4_K_S.gguf");
sample(model, "Tim was happy."; temperature = 0.0f0)

llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."

Current Llama2.jl results:

Tim was happy. Einzelnes, but he was also very proud of his son. He had always known that Tim was special, and he was thrilled to see him finally getting the recognition he deserved.\nAs the two of them sat in the stands, watching the game, Tim couldn't help but feel a sense of pride and joy. He was so grateful to have" ⋯ 667 bytes ⋯ ". \"I'm lucky to have you too.\"\nAs they walked out of the restaurant, Tim felt a sense of contentment and happiness. He knew that he had a wonderful son, and he was grateful for every moment they spent together. He was proud of Tim, and he knew that he would always be there to support and encourage him, no matter what.

Current llama.cpp results:

Tim was happy.
He had just received a new job offer and he was excited to start his new career. He had been searching for a new opportunity for months, and now it seemed like all his hard work had paid off.
As he walked into the office building, he couldn't help but feel a sense of pride. He had worked hard to get where he was, and he knew that this new job would be a great opportunity for him.
Tim took a deep breath as he entered the office. He was greeted by a friendly receptionist who offered him a warm smile. "Hello there," she said. "Welcome to Tim's new workplace."
Tim felt a sense of excitement as he walked through the office. He couldn't wait to meet his new colleagues and start working on his new projects. He knew that this was going to be a great opportunity for him, and he was eager to get started. [end of text]

We need to find an efficient way to know what could cause the differences between the two.

Test perplexity

We should compare the perplexity against llama.cpp to test the quantization code.

Support saving weights to llama2.c model bin format

Thanks for this excellent work!

I was able to train and infer. But I do not see a option to save the model. It would be awesome if that option would be added.

How to find and download a suitable GGMLV3 model?

I tried finding and downloading the llama-2-7b-chat.ggmlv3.q4_K_S.bin file from HuggingFace, but was unable to do so.
What does one need to do to actually obtain that bin file? I tried the model.savetensor from the repo, but this does not seem to be the right format.

ERROR: git repository not found at `https://github.com/cafaxo/Llama2.jl`

I was just installing the package and this is what I got:

More high-level operations

Thank you all for putting this package together.

I am interested in seeing the operations expressed at the highest possible level. At times it feels as if we are redoing the undoing required to express everything as 1d arrays of floats.

For instance the following

    fc = freq_cis_real_row .+ freq_cis_imag_row .* im
    QQ = reshape(reinterpret(ComplexF32, s.q), (head_size ÷ 2, p.n_heads))
    KK = reshape(reinterpret(ComplexF32, s.k), (head_size ÷ 2, p.n_heads))
        
        # apply RoPE rotation to the q and k vectors for each head        
        # rotate q and k by the freq_cis_real and freq_cis_imag
        QQ .= fc .* QQ
        KK .= fc .* KK

replaces some 20+ lines of code...

Not really an issue, but it does simplify the code. Nevertheless great job for making it all happen.

amazing speed!

I'm new to Julia Lang. It runs faster than the llama2.c implementation. On average, it's 300%+ faster. And Julia utilizes all the CPU cores.

I'm curious to know

is there any way we can run Meta's llama chat models in Julia?
can Julia utilize OMP to share the load of the GPU?
has Julia implemented the CPU AVX2 instruction? Or will Julia run faster with AVX2?

Thanks!

adding prompting

Support for set prompt has been added to llama2.c

Quantization support

We should investigate how difficult it is to add quantization support.
It would be great if we could run the Llama2 7B model on a machine with 8GB RAM.

Readme

We should add some basic info to the readme that explains how to get this to run.

Create vocabulary from text

We should have a function that builds a vocabulary for the byte pair encoder from some given text. (Relevant for training models.)

Support weight decay for Adam optimizer

Weight decay could be quite easily added to the Adam optimizer code (it would then become AdamW).

For example, before this line, add

@turbo @. x *= 1-α*λ

According to the LLama2 paper, training was done with value λ=0.1.

ggml model ERROR: TaskFailedException nested task error: bitcast: target type not a leaf primitive type

Apple M2 ARM processor running native Julia (aa64)

~/p/Llama2.jl (master)> julia-native --project=. -tauto
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.2 (2023-07-05)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Llama2,  Random
[ Info: Precompiling Llama2 [7841fa2c-192d-471c-ae30-1f93a4daddfc]

julia> model = load_ggml_model("data/llama-2-7b-chat.ggmlv3.q4_K_S.bin");
Loading model... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:00

julia> sample(model, "The Julia programming language is")
<s>
ERROR: TaskFailedException

    nested task error: bitcast: target type not a leaf primitive type
    Stacktrace:
     [1] reinterpret
       @ ./essentials.jl:513 [inlined]
     [2] dot(x::SubArray{Llama2.block_q4_K, 1, Matrix{Llama2.block_q4_K}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, y::Vector{Llama2.block_q8_K})
       @ Llama2 ~/projects/Llama2.jl/src/quantization/q4.jl:226
     [3] macro expansion
       @ ~/projects/Llama2.jl/src/matmul.jl:23 [inlined]
     [4] (::Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}})(tid::Int64; onethread::Bool)
       @ Llama2 ./threadingconstructs.jl:194
     [5] #47#threadsfor_fun
       @ ./threadingconstructs.jl:161 [inlined]
     [6] (::Base.Threads.var"#1#2"{Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}}, Int64})()
       @ Base.Threads ./threadingconstructs.jl:139

...and 7 more exceptions.

LoRA and finetuning

I was considering adding LoRA to this repo and figured I'd share my thoughts in case there is interest upstream.

I don't have anything crazy in mind, mostly an implementation similar to https://github.com/microsoft/LoRA/blob/main/loralib/layers.py. Layer weights start at an "unmerged mode", train with the low rank parameters, and eventually merge when the network is serialized. Some flag in the train function controls whether regular or LoRA linear layers are used.

Let me know if you have any other ideas/suggestions

Training code

I think it would be nice to have some minimal, self-contained Julia code for training a small Llama2 model.
From some experiments I have done, I already have some code (CPU-only) that would be easy to adapt to llama2.

Should I adapt and push this code to a training subdirectory? Opinions welcome.

Make the quantization code less insane

The readability of the code in src/quantization (which I ported from llama.cpp) is not acceptable.