coreylowman / llama-dfdx Goto Github PK

View Code? Open in Web Editor NEW

94.0 94.0 4.0 406 KB

LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

License: MIT License

Python 2.16% Rust 97.84%

cuda deep-learning inference language-model llama neural-network rust rust-lang

llama-dfdx's People

Contributors

Stargazers

Watchers

Forkers

kstavro emillindfors codingonion syaikhipin

llama-dfdx's Issues

Add support for 13b structure

Especially given the ability to hot load the tensors when they are needed, it should be very possible to run larger models

Add KV Cache

This is a really standard optimization, should be relatively straightforward to implement

disabling cache has poor generation results

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf --disable-cache generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
Thread: Why is pi round?
I've been wondering about this for a while and couldn't find an answer... Ifс theters, i.s_Q
 ini£тuc1-cksont< Sec>ar$le to--.e
d in>
  inient-< ${<s A  А ${ various
 channel Banels cBp  Sack Bchn c channel Kaz
cyclemasens.chD channelーAя
O я_  CлusesN- n= Ps FigénBTアbollageest
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf --disable-cache generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is the real reason that pi is round?
I know the story that when Archimedes proved that pi was irr butures, he,he and,h is//**  cz.daly, July wasz cQ.l inkxz toell>((>/.F
 Middle
 WCF,pp m  MA cError apadd Ledethodaten
 inien MAFaceerfaces.IкяEDєeP UITableView a MAtingack tcrit<0xE4><0xE7>leftAуad<0xEB>C areз о דneanate ab

with cache, while the answers are nonsense, at least they are coherent :)

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is the definition of a number that is not prime?
The number that is not a prime is called a composite number and if it is not a factor of 1
What is the smallest number that can be divided by 3 numbers and still have the original number as a remainder?
The smallest number that can be divided by 3 numbers and still have the original number as a remainder is 17. To prove this we can use the fact that the original number must have a remainder of 1 (after being divided by 3). The numbers that have a remainder of 1 when divided by 3 are
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model llama-7b-hf generate "Why is pi round?"
Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
Thread: Why is pi round?
I've been wondering about this for a while and couldn't find an answer...I'm sure it's a silly question, but I just can't figure it out. Why is pi round? If it was, say 4.00 or 6.00, that would be one thing, but 3.14??
So I thought that maybe if you took the square root of 3.14, it would be ~ 1.5, which would be about the middle of 1 and 2, which is 1

Do loading into CUDA ram during forward

Auto determine model type (chat/generate/instruct) in main

llama is generation, so can't really be used with chat
vicuna is a chatbot
alpaca is instruction model

If not able to determine "mode" a user could specify via --mode cli argument.

This would remove the existing chat/generate/file commands that currently exist.

Possibility of fine tuning?

I'd like to put together a proof of concept for fine tuning large language models in rust.

My background is Rust rather than ML.

So my question is this model inference only or could I somehow use it to do training?

Would it be somehow related to the generic training loop https://github.com/coreylowman/dfdx/blob/main/examples/generic-train-loop.rs?

Thanks.

Add instructions for Alpaca 7b weights to README

Alpaca 7b should be the exact same structure, so as long as you can convert the weights into the same format with convert.py it should be runnable out of the box

llama-dfdx implodes if path is not set

Hi there,

First off, awesome work!

I had not set the path to nvcc, so llama-dfdx imploded during build. You may find it of value to tell the user that nvcc could not be found.

ubuntu@instance-20230508-1136:~/repos/dfdx$ cargo clean
ubuntu@instance-20230508-1136:~/repos/dfdx$ nvcc
-bash: nvcc: command not found
ubuntu@instance-20230508-1136:~/repos/dfdx$ cargo build -F cuda
    Updating crates.io index
    Updating git repository `https://github.com/coreylowman/cudarc`
    Updating git repository `https://github.com/starkat99/half-rs.git`
  Downloaded cfg-if v1.0.0
  Downloaded either v1.8.1
  Downloaded num-complex v0.4.3
  Downloaded gemm-f64 v0.15.3
  Downloaded gemm-c64 v0.15.3
  Downloaded dyn-stack v0.9.0
  Downloaded ppv-lite86 v0.2.17
  Downloaded rand_chacha v0.3.1
  Downloaded rand v0.8.5
  Downloaded scopeguard v1.1.0
  Downloaded seq-macro v0.3.3
  Downloaded autocfg v1.1.0
  Downloaded bitflags v1.3.2
  Downloaded crossbeam-channel v0.5.8
  Downloaded crossbeam-epoch v0.9.14
  Downloaded rand_core v0.6.4
  Downloaded rand_distr v0.4.3
  Downloaded rayon-core v1.11.0
  Downloaded rayon v1.7.0
  Downloaded raw-cpuid v10.7.0
  Downloaded memoffset v0.8.0
  Downloaded reborrow v0.5.4
  Downloaded gemm-c32 v0.15.3
  Downloaded gemm v0.15.3
  Downloaded bytemuck v1.13.1
  Downloaded gemm-f16 v0.15.3
  Downloaded gemm-f32 v0.15.3
  Downloaded gemm-common v0.15.3
  Downloaded half v2.2.1
  Downloaded libm v0.2.6
  Downloaded lazy_static v1.4.0
  Downloaded glob v0.3.1
  Downloaded crossbeam-utils v0.8.15
  Downloaded crossbeam-deque v0.8.3
  Downloaded num_cpus v1.15.0
  Downloaded paste v1.0.12
  Downloaded libc v0.2.144
  Downloaded num-traits v0.2.15
  Downloaded 38 crates (1.9 MB) in 0.54s
   Compiling autocfg v1.1.0
   Compiling crossbeam-utils v0.8.15
   Compiling cfg-if v1.0.0
   Compiling libm v0.2.6
   Compiling libc v0.2.144
   Compiling scopeguard v1.1.0
   Compiling rayon-core v1.11.0
   Compiling paste v1.0.12
   Compiling either v1.8.1
   Compiling bitflags v1.3.2
   Compiling reborrow v0.5.4
   Compiling bytemuck v1.13.1
   Compiling lazy_static v1.4.0
   Compiling seq-macro v0.3.3
   Compiling rand_core v0.6.4
   Compiling ppv-lite86 v0.2.17
   Compiling cudarc v0.9.8 (https://github.com/coreylowman/cudarc?branch=dfdx-half#bb2d7009)
   Compiling glob v0.3.1
   Compiling raw-cpuid v10.7.0
   Compiling dyn-stack v0.9.0
   Compiling memoffset v0.8.0
   Compiling num-traits v0.2.15
   Compiling crossbeam-epoch v0.9.14
   Compiling dfdx v0.11.2 (/home/ubuntu/repos/dfdx)
   Compiling rand_chacha v0.3.1
   Compiling rand v0.8.5
   Compiling crossbeam-channel v0.5.8
   Compiling num_cpus v1.15.0
error: failed to run custom build command for `dfdx v0.11.2 (/home/ubuntu/repos/dfdx)`

Caused by:
  process didn't exit successfully: `/home/ubuntu/repos/dfdx/target/debug/build/dfdx-30e6be024c8b3335/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include
  cargo:rerun-if-changed=src/tensor_ops/utilities/binary_op_macros.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/compatibility.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/cuda_utils.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/unary_op_macros.cuh

  --- stderr
  thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', build.rs:139:22
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...

ubuntu@instance-20230508-1136:~/repos/dfdx$ locate nvcc
/home/ubuntu/.local/lib/python3.10/site-packages/cmake/data/share/cmake-3.26/Modules/FindCUDA/run_nvcc.cmake
/home/ubuntu/.local/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Modules_CUDA_fix/upstream/FindCUDA/run_nvcc.cmake
/usr/local/cuda-12.1/bin/__nvcc_device_query
/usr/local/cuda-12.1/bin/nvcc
/usr/local/cuda-12.1/bin/nvcc.profile
/usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.26/Modules/FindCUDA/run_nvcc.cmake
/usr/local/lib/python3.10/dist-packages/torch/share/cmake/Caffe2/Modules_CUDA_fix/upstream/FindCUDA/run_nvcc.cmake
/usr/share/doc/cuda-nvcc-12-1
/usr/share/doc/cuda-nvcc-12-1/changelog.Debian.gz
/usr/share/doc/cuda-nvcc-12-1/copyright
/var/cache/apt/archives/cuda-nvcc-12-1_12.1.105-1_amd64.deb
/var/lib/dpkg/info/cuda-nvcc-12-1.list
/var/lib/dpkg/info/cuda-nvcc-12-1.md5sums

ubuntu@instance-20230508-1136:~/repos/dfdx$ export PATH=$PATH:/usr/local/cuda-12.1/bin
ubuntu@instance-20230508-1136:~/repos/dfdx$ cargo build -F cuda
   Compiling num-traits v0.2.15
   Compiling crossbeam-deque v0.8.3
   Compiling dfdx v0.11.2 (/home/ubuntu/repos/dfdx)
   Compiling rayon-core v1.11.0
   Compiling num-complex v0.4.3
   Compiling half v2.2.1
   Compiling rand_distr v0.4.3
   Compiling rayon v1.7.0
warning: Compiled 48 cuda kernels in 1.152008619s
   Compiling gemm-common v0.15.3
   Compiling gemm-f32 v0.15.3
   Compiling gemm-c32 v0.15.3
   Compiling gemm-c64 v0.15.3
   Compiling gemm-f64 v0.15.3
   Compiling gemm-f16 v0.15.3
   Compiling gemm v0.15.3
    Finished dev [unoptimized + debuginfo] target(s) in 9.71s
ubuntu@instance-20230508-1136:~/repos/dfdx$

Thank you,
-steve

Can't use args

Running .\target\release\llama-dfdx.exe chat -n=1024 always results in:

error: unexpected argument '-n' found

Usage: llama-dfdx.exe chat

For more information, try '--help'.

Same for no subcommand/generate/on wsl. Any ideas?

counting the number of bin files is fragile.

The current implementation counts the number of bin files to infer the model type under use.

The authoritative implementation of model weight conversion is stored in the hugging face repository. This implementation produces two bin files for the 7B model (as an example).

curl -LO https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/llama/convert_llama_weights_to_hf.py
python3 convert_llama_weights_to_hf.py --input_dir llama --model_size 7B --output_dir llama-hf/7B
Fetching all parameters from the checkpoint at llama/7B.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|███████████████████████| 33/33 [00:07<00:00,  4.51it/s]
Saving in the Transformers format.
Saving a LlamaTokenizerFast to llama-hf/7B.

Listing the files shows only two bin files:

ubuntu@instance-20230508-1136:/models/llama-hf/7B$ ls
config.json             pytorch_model-00001-of-00002.bin  tokenizer.json
generation_config.json  pytorch_model-00002-of-00002.bin  tokenizer.model
lm_head                 pytorch_model.bin.index.json      tokenizer_config.json
model                   special_tokens_map.json

A simple solution would be to parse config.json for the number of attention heads (in this case, 32).

Here is an example where auto fails:

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model /models/llama-hf/7B generate "Why is pi round?"
thread 'main' panicked at 'Found 2 .bin files in the model directory. Expected 33, 41, or 81.', src/main.rs:129:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

When overriding the structure, all is good:

ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model /models/llama-hf/7B --structure llama7b generate "Why is pi round?"
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is the definition of a "natural number"?
What is the smallest natural number?
How to find the smallest natural number in a given range?
What is the 1000th number?
What is the last number?
What is the difference between a decimal number and a rational number?
What is the difference between a natural number and a rational number?
How to calculate the least common multiple?
How to calculate the greatest common factor?
What are the prime numbers?
How do I find the factors of a number?
What is a prime number?
How do I find the
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model /models/llama-hf/7B --structure llama7b generate "Why is pi round?"
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
What is pi, and why is it round?
Pi is a constant number. It has an infinite number of digits, however if you were to list all of the digits they would look like this: 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$ ./target/release/llama-dfdx --model /models/llama-hf/7B --structure llama7b generate "Why is pi round?"
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Why is pi round?
Thread: Why is pi round?
I've been wondering about this for a while and haven't been able to come up with an answer. I've also checked the internet but it doesn't seem to have an answer. I'm assuming the reason it's "round" is that it repeats the same pattern but I can't seem to find an answer that explains it. Why is it 3.141592653589793 instead of 3.141592653589792653
ubuntu@instance-20230508-1136:~/repos/llama-dfdx$

Thank you,
-steve

Error running

When trying to run the program, I get this error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Driver(DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain."))', /home/opfromthestart/.cargo/git/checkouts/dfdx-318e6e5ad83eea79/19da9fe/src/tensor_ops/select_and_gather/mod.rs:155:30

Auto determine how much of the model to load into RAM

Use cases:

You can fit the whole model into GPU ram
You can fit part of the model into GPU ram
You need keep all the model weights on disk

In all these cases, we should be able to detect how much GPU ram is available, and determine the max amount of model to store that way. More advanced use cases of sharing GPU with other applications may need manual control over the memory, but that can be done later.