Giter VIP home page Giter VIP logo

llama2.jl's People

Contributors

bangboom avatar cafaxo avatar lilithhafner avatar svilupp avatar yi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

llama2.jl's Issues

Speed up tokenizer

The tokenizer is currently O(N^2) in its input. This currently makes it impossible tokenize large amount of text for training.
We could either use a more efficient algorithm or just pre-tokenize whitespace. We should also check what sentencepiece does.

Reproduce temp=0 llama.cpp results with some consistency.

We need to find a way to detect what could cause the differences between the two solutions.

The task is to have the same or near similar results at temp=0. We made some tests with the new .gguf files since it got so huge adoption.

Llama2.jl test:

using Llama2
model = load_gguf_model("/path/to/llama-2-7b-chat.Q4_K_S.gguf");
sample(model, "Tim was happy."; temperature = 0.0f0)

llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."

Current Llama2.jl results:

Tim was happy. Einzelnes, but he was also very proud of his son. He had always known that Tim was special, and he was thrilled to see him finally getting the recognition he deserved.\nAs the two of them sat in the stands, watching the game, Tim couldn't help but feel a sense of pride and joy. He was so grateful to have" ⋯ 667 bytes ⋯ ". \"I'm lucky to have you too.\"\nAs they walked out of the restaurant, Tim felt a sense of contentment and happiness. He knew that he had a wonderful son, and he was grateful for every moment they spent together. He was proud of Tim, and he knew that he would always be there to support and encourage him, no matter what.

Current llama.cpp results:

Tim was happy.
He had just received a new job offer and he was excited to start his new career. He had been searching for a new opportunity for months, and now it seemed like all his hard work had paid off.
As he walked into the office building, he couldn't help but feel a sense of pride. He had worked hard to get where he was, and he knew that this new job would be a great opportunity for him.
Tim took a deep breath as he entered the office. He was greeted by a friendly receptionist who offered him a warm smile. "Hello there," she said. "Welcome to Tim's new workplace."
Tim felt a sense of excitement as he walked through the office. He couldn't wait to meet his new colleagues and start working on his new projects. He knew that this was going to be a great opportunity for him, and he was eager to get started. [end of text]

We need to find an efficient way to know what could cause the differences between the two.

Test perplexity

We should compare the perplexity against llama.cpp to test the quantization code.

How to find and download a suitable GGMLV3 model?

I tried finding and downloading the llama-2-7b-chat.ggmlv3.q4_K_S.bin file from HuggingFace, but was unable to do so.
What does one need to do to actually obtain that bin file? I tried the model.savetensor from the repo, but this does not seem to be the right format.

More high-level operations

Thank you all for putting this package together.

I am interested in seeing the operations expressed at the highest possible level. At times it feels as if we are redoing the undoing required to express everything as 1d arrays of floats.

For instance the following

    fc = freq_cis_real_row .+ freq_cis_imag_row .* im
    QQ = reshape(reinterpret(ComplexF32, s.q), (head_size ÷ 2, p.n_heads))
    KK = reshape(reinterpret(ComplexF32, s.k), (head_size ÷ 2, p.n_heads))
        
        # apply RoPE rotation to the q and k vectors for each head        
        # rotate q and k by the freq_cis_real and freq_cis_imag
        QQ .= fc .* QQ
        KK .= fc .* KK

replaces some 20+ lines of code...

Not really an issue, but it does simplify the code. Nevertheless great job for making it all happen.

amazing speed!

I'm new to Julia Lang. It runs faster than the llama2.c implementation. On average, it's 300%+ faster. And Julia utilizes all the CPU cores.

I'm curious to know

  1. is there any way we can run Meta's llama chat models in Julia?

  2. can Julia utilize OMP to share the load of the GPU?

  3. has Julia implemented the CPU AVX2 instruction? Or will Julia run faster with AVX2?

Thanks!

Quantization support

We should investigate how difficult it is to add quantization support.
It would be great if we could run the Llama2 7B model on a machine with 8GB RAM.

Readme

We should add some basic info to the readme that explains how to get this to run.

Create vocabulary from text

We should have a function that builds a vocabulary for the byte pair encoder from some given text. (Relevant for training models.)

Support weight decay for Adam optimizer

Weight decay could be quite easily added to the Adam optimizer code (it would then become AdamW).

For example, before this line, add

@turbo @. x *= 1-α*λ

According to the LLama2 paper, training was done with value λ=0.1.

ggml model ERROR: TaskFailedException nested task error: bitcast: target type not a leaf primitive type

Apple M2 ARM processor running native Julia (aa64)

~/p/Llama2.jl (master)> julia-native --project=. -tauto
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.2 (2023-07-05)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Llama2,  Random
[ Info: Precompiling Llama2 [7841fa2c-192d-471c-ae30-1f93a4daddfc]

julia> model = load_ggml_model("data/llama-2-7b-chat.ggmlv3.q4_K_S.bin");
Loading model... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:00

julia> sample(model, "The Julia programming language is")
<s>
ERROR: TaskFailedException

    nested task error: bitcast: target type not a leaf primitive type
    Stacktrace:
     [1] reinterpret
       @ ./essentials.jl:513 [inlined]
     [2] dot(x::SubArray{Llama2.block_q4_K, 1, Matrix{Llama2.block_q4_K}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, y::Vector{Llama2.block_q8_K})
       @ Llama2 ~/projects/Llama2.jl/src/quantization/q4.jl:226
     [3] macro expansion
       @ ~/projects/Llama2.jl/src/matmul.jl:23 [inlined]
     [4] (::Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}})(tid::Int64; onethread::Bool)
       @ Llama2 ./threadingconstructs.jl:194
     [5] #47#threadsfor_fun
       @ ./threadingconstructs.jl:161 [inlined]
     [6] (::Base.Threads.var"#1#2"{Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}}, Int64})()
       @ Base.Threads ./threadingconstructs.jl:139

...and 7 more exceptions.

LoRA and finetuning

I was considering adding LoRA to this repo and figured I'd share my thoughts in case there is interest upstream.

I don't have anything crazy in mind, mostly an implementation similar to https://github.com/microsoft/LoRA/blob/main/loralib/layers.py. Layer weights start at an "unmerged mode", train with the low rank parameters, and eventually merge when the network is serialized. Some flag in the train function controls whether regular or LoRA linear layers are used.

Let me know if you have any other ideas/suggestions

Training code

I think it would be nice to have some minimal, self-contained Julia code for training a small Llama2 model.
From some experiments I have done, I already have some code (CPU-only) that would be easy to adapt to llama2.

Should I adapt and push this code to a training subdirectory? Opinions welcome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.