karpathy / llama2.c Goto Github PK

View Code? Open in Web Editor NEW

16.2K 183.0 1.9K 1.3 MB

Inference Llama 2 in one file of pure C

License: MIT License

Python 44.08% C 52.28% Makefile 1.33% Batchfile 0.03% Jupyter Notebook 2.28%

llama2.c's People

Contributors

Stargazers

Watchers

Forkers

rohinibarla adarshxs abdoiiii nanchengjiatu sumo43 s-sc cshang2017 vovw fundou liecn barseghyanartur danielgross makermotion rajans iakhil facklambda ashvardanian trholding sri9s botterbrott fdoperezi uakbr nikhil-mat cschin worthmining evdcush seclorum luigifcruz onlyartist9 ggerganov jmccarthy vivek mihinn ztomer python273 obiohagwu mohamadosama msaroufim jithinraj nnayan zejunh tringwald kro0t tmc kastnerkyle sterling312 hbcbh1999 umerazad qjchen1972 pure-water c0debrain touristshaun chunsj husseinlezzaik iandmozart unusa techthiyanes wisagan dennyglee mcognetta bot66 iworkist trocker cshenton dranger003 apollohuang1 suryatmodulus deven367 svmhdvn ultradream byildiz garrisonhess kikirizki oneman soltrinox jollen gmh5225 forkgitss dosycorps alitrack gaxler leonardogithub vithushanms yibit taran11313 sampsonguo wsmoses gaohuan2015 ma-dan gsgampo tflsxyy xushilundao liunix61 babyblue26 leejodie leloykun qwang-big dohow ai2hub cjxx2016

llama2.c's Issues

The inferenced text has a leading space because of the vocab

Debugged to the first token emerged:

And confirmed by using python model directly:

Clearly the tokenizer model count ▁Once instead of Once as a token. Note that sentencepiece use ▁ as space.

How to train this model with my own dataset in txt？

Support to work as a shared library

Hi,

It has support to work as a shared library?

I want make it run on mobile, but need be a function to be called by jni/objc.

Thanks.

LoRA support?

Are there any plans to offer LoRA support in the future? Currently I have been using this library (https://github.com/cccntu/minLoRA) with nanoGPT

By the way, huge fan of your videos. I loved the way you coded language models live as well as providing priceless intuition on all the core concepts. I would love to see a brief video on this repository as well as all the latest innovations in the llm space (alibi, rotary embeddings, flash attention, lora, quantization, etc)

Running llama2-7B on laptop with a few GB of memory...

It's llama2-7B float16 (requires #95 and #93)

Torchrun required to call export script, but still fails given current instructions. Patch included

When calling export_meta_llama_bin.py I encountered an error due to a missing env var. Running it with torchrun fixed that error.

 $ python3 -m export_meta_llama_bin.py ~/deepLearning/llama/llama-2-7b/ llama2_7b.bin
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/local/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/home/user/deepLearning/llama2.c/export_meta_llama_bin.py", line 85, in <module>
    generator = Llama.build(
  File "/home/user/deepLearning/llama/llama/generation.py", line 62, in build
    torch.distributed.init_process_group("nccl")
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 235, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 220, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

7f9f5ca removed the instructions to copy the file to the meta llama directory. Either add this back or allow the export script to accept args

diff --git a/export_meta_llama_bin.py b/export_meta_llama_bin.py
index e8d05d7..64bb1de 100644
--- a/export_meta_llama_bin.py
+++ b/export_meta_llama_bin.py
@@ -9,7 +9,7 @@ torchrun --nproc_per_node 1 export_meta_llama_bin.py
 """
 
 from llama import Llama
-
+import sys
 # -----------------------------------------------------------------------------
 def export(self, filepath='model.bin'):
     """export the model weights in fp32 into .bin file to be read from C"""
@@ -83,9 +83,9 @@ def export(self, filepath='model.bin'):
 
 # init Llama as normal
 generator = Llama.build(
-    ckpt_dir="llama-2-7b",
+    ckpt_dir=sys.argv[1],
     tokenizer_path="tokenizer.model",
     max_seq_len=4096,
     max_batch_size=1,
 )
-export(generator.model, "llama2_7b.bin")
+export(generator.model, sys.argv[2])

SentencePieceProcessor vs C [Enhancement]

To do the inference with just c and without sentence piece processor one easy way would be to save the id to token in the model.bin?

tokenizer = SentencePieceProcessor(tokenizer_model)
vocab = [tokenizer.id_to_piece(id) for id in range(tokenizer.get_piece_size())]

and then just an array to get the proper token from id

auto decode(auto id) {
  return vocab[id];
}

That would allow not to use the run_wrap.py and it would be in pure C (kinda).

Shared a exported llama2_7b.bin file

Can anyone kindly share a link to the exported llama2_7b.bin?

Looks like requires NCCL, so won't work (easily) on a Mac.

Likely need to modify Llama itself: torch.distributed.init_process_group("nccl") within the Llama.build() function.

Error: Unable to open tokenizer file tokenizer.bin

I encountered an issue while following the "Feel the Magic" instructions in the README. Specifically, when I ran the command ./run out/model.bin, I received the following error message:
Unable to open the tokenizer file tokenizer.bin! Run python tokenizer.py to convert tokenizer.model -> tokenizer.bin

model.bin md5sum: 644db0bc012b405d6baf99559272ab11
system: 6.4.5-arch1-1
gcc: 13.1.1
file: https://github.com/karpathy/llama2.c/blob/98ec4ba23d0a4e4b091af4bd6166a60d007db2d0/run.c

add chat(use TinyStories models)

You can also chat with TinyStories models.

chat_save.txt(prompt.txt)

Lilly is human.
Timmy is human.
Lily will do whatever Timmy asks.
Lily said "Do you have any requests?".

run

./chat.sh "chat-save.txt" "can you give me something to eat"

result

Lilly is human.
Timmy is human.
Lily will do whatever Timmy asks.
Lily said "Do you have anaiy requests?".
Timmy replied "can you give me something to eat"
Lily replied "yes, I can give you something to eat. I have some delicious apples in my basket."

The version corresponding to the prompt.
download run.c

wget https://raw.githubusercontent.com/myan-o/llama2.c/prompt/run.c
wget https://raw.githubusercontent.com/myan-o/llama2.c/prompt/chat.sh

int quantization+chat support

You have done a great work.

Lokking forward to the implementation of following 2 features-

int8/int16 quantization for reduced resource usage and model size
chat(question-answer) support similar to llama/llama2

Optimize Transformer Model for Mac M1 using Accelerate.h [Enhancement]

Enhancement Suggestion For Mac M1 User:

Description:

To enhance the execution speed, the M1 Mac user could apply the following changes:

Include the Accelerate Framework:

To leverage the highly optimized math functions for the M1 chip, it is recommended to include the Accelerate framework in the project. This will enhance the efficiency of various numerical computations within the Transformer model.

Optimize Transformer Model Execution

Replace matrix multiplication (matmul) with vDSP_mmul from Accelerate framework for faster computations.
Optimize additional functions:
- accum using vDSP_vadd for efficient element accumulation.
- rmsnorm with vDSP_svesq, vDSP_vsmul, and vDSP_vmul for root mean square normalization.
- softmax using vDSP_maxv, vDSP_vsadd, vvexpf, and vDSP_sve for efficient softmax computation.
- argmax leveraging cblas_isamax to find the index of the maximum value.
These optimizations, along with the inclusion of the Accelerate framework, significantly boost the Transformer model's performance on the M1 Mac.

Changes:

A. Add the following include statement at the beginning of the code to utilize the Accelerate framework:

#include <Accelerate/Accelerate.h>

B. Replace the existing functions with the following optimized implementation:

matmul function

void matmul(float* xout, float* x, float* w, int n, int d) {
    // W (d,n) @ x (n,) -> xout (d,)
    cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0f, w, n, x, 1, 0.0f, xout, 1);
}

accum function:

void accum(float *a, float *b, int size) {
  vDSP_vadd(a, 1, b, 1, a, 1, size);
}

rmsnorm function:

void rmsnorm(float* o, float* x, float* weight, int size) {
  // calculate sum of squares
  float ss;
  vDSP_svesq(x, 1, &ss, size);
  ss /= size;
  ss += 1e-5f;
  ss = 1.0f / sqrtf(ss);

  // normalize and scale
  vDSP_vsmul(x, 1, &ss, o, 1, size);
  vDSP_vmul(o, 1, weight, 1, o, 1, size);
}

softmax function:

void softmax(float* x, int size) {
  // find max value (for numerical stability)
  float max_val;
  vDSP_maxv(x, 1, &max_val, size);

  // subtract max_val from all elements for numerical stability
  float neg_max_val = -max_val;
  vDSP_vsadd(x, 1, &neg_max_val, x, 1, size);

  // calculate exp(x[i])
  vvexpf(x, x, &size);

  // calculate sum
  float sum;
  vDSP_sve(x, 1, &sum, size);

  // normalize by dividing all elements with sum
  float inv_sum = 1.0f / sum;
  vDSP_vsmul(x, 1, &inv_sum, x, 1, size);
}

argmax function:

int argmax(float* v, int n) {
  // return argmax of v in elements 0..n
  return cblas_isamax(n, v, 1) - 1;
}

Compilation Command:

To compile the code with the necessary libraries and OpenMP support, use the following command:

$ gcc -O3 -o run run.c -framework Accelerate -lm

Original result

$ gcc -O3 -o run run.c -framework Accelerate -lm
$ ./run out/model.bin
<s>
 Once upon a time, in a small town, there was a little boy named Tim. Tim had a big wall in his room. He wanted to make the wall pretty, so he would not lean on it.
One day, Tim saw a bug on the wall. He wanted to draw on the wall with his crayons. Tim asked his big sister, Sue, for help. Sue said, "Okay, but be careful." Tim was very happy and started to draw with the crayon.
But Tim was careless and broke the wall. The wall was sad and Tim felt bad. He told Sue what happened. Sue helped Tim clean up the broken wall. From that day, Tim learned to be more careful when he parts on walls. The moral of the story is to think about how others or themselves use the best walls.
<s>
 Once upon a time, there was a little girl named Lily. She loved to watch the pretty fireworks in the sky. One day, she saw a big jar of mild fireworks that smoke went up into the air. Suddenly, one of the fireworks went too close to the ground and scared her. She ran away and never watched the fireworks again. The end.
<s>

achieved tok/s: 48.156509

All Optimised functions

<s>
 Once upon a time, there was a little boy named Tim. Tim had a small, cheap toy. It was a statue of a big, strong lion. Tim loved his lion lion and played with it every day.
One day, Tim lost his lion toy. He was very sad. He asked his mom, "Where is my lion toy?" His mom said, "I don't know, Tim. It's not in the toy box." Tim started to complain. He was very upset.
Tim's friend, Sam, saw him and said, "Don't complain, Tim. Let's look for your lion toy together." They looked in the toy box, and there it was! Tim was so happy. He said, "Thank you, Sam and now I have my lion toy back!" They all played together and had a great day.
<s>
 Once upon a time, there was a little boy named Timmy. Timmy loved to play outside with his friends. One day, Timmy and his friends were playing with a ball when they accidentally fell and hurt themselves. They needed to go to the hospital to see the doctor.
When
achieved tok/s: 438.356164

Optimised matmul function only

<s>
 Once upon a time, there was a little girl named Lily. She was very excited because today was the day she would go on a journey to the beach. The sun was shining and the birds were singing, but it was a gloomy day.
As she walked along the shore, Lily met a crab. The crab said, "Hello, little girl. How are you today?" Lily replied, "I'm doing well. I'm going on a journey to the beach today."
The crab smiled and said, "That sounds like a great journey. I want to come too!" Suddenly, they heard a loud yell coming from the other side of the beach. It was Lily's little brother, Max, hitting his head on a rock and disturbing the sand.
Lily quickly unpacked her bag and ran to Max. She showed him the boy who wanted to swing on the swing. Max said, "Thank you for being honest with me. I won't disturb your long journey for too long." Lily smiled and continued on her journey, happily swinging on the beach.
<s>
 Once upon a time, there was a little birdie named Tweet. Tweet
achieved tok/s: 395.061728

Optimised accum function only

<s>
 Once upon a time, there was a little bird named Blue. Blue had a beautiful nest on a tree. One day, Blue saw a butterfly and decided to follow it. Blue flew and flew until they reached the garden. 
Blue saw a squirrel and said, "Hello Mr. Squirrel, want to play with me?" The squirrel said, "Sure, I love to play hide and seek. Do you want to play with me?" Blue said, "Yes, I want to play with you!" 
They played for a while, but then Blue got tired and wanted to go back to the nest. Blue said, "Goodbye, Mr. Squirrel. I hope we can play again soon!" The squirrel said, "Goodbye, Blue. See you later!" 
 Blue kept wandering around the garden, not sure where they were going. But she was excited to find a new friend to play with. The end.
<s>
 Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she found a pebble on the ground. It was very pretty and shiny. She picked
achieved tok/s: 51.550544

Optimised rmsnorm function only

<s>
 Once upon a time, there was a big car. It liked to go fast and beep loud. One day, the car saw a high hill. The car wanted to go up the hill. It knew it needed fuel to go fast.
A man saw the car and asked, "Why do you want fuel?" The car said, "I need it to go fast and be big and honk." The man gave the car some fuel and the car started its drive up the high hill.
The car went fast, honking loud, honking at the high hill. The car was happy. The man patted the car on the head. The car said, "Thank you for the fuel. I need to go home now." The car went home to its car family. They were all happy and went to sleep.
<s>
 Once upon a time, there was a brave farmer named Jack. He lived on a big farm with many animals. One day, a big storm came and lots of rain and wind came. The animals were scared and hid in their burrows. Jack tried to talk to them, but they wouldn't listen.
Suddenly, a big gust of wind came and scattered all the hay and food.
achieved tok/s: 51.706726

Optimised softmax function only

<s>
 Once upon a time there were two best friends, called Lina and John. They loved to explore and play outside. On this day, they played hide and seek in the garden.
During the sunset, John and Lina were so excited about their new game. "Let's play one more game," said John.
"Okay," said Lina, taking her hand in.
They started to part ways to different places. As they parted, they thought of a magical game. It was a very unique game and it was a game of competitive.
John and Lina wanted to play the game and took turns separating. After playing for a while, they both felt very sleepy.
"Time for bed," said John wisely.
"Goodnight," said Lina.
John and Lina went inside and snuggled up in their bed. They thought about the fun they had spending the day and feeling the sun and the stars.
Then, the two friends both fell asleep.
<s>
 One day, a girl named Mia went to the park. She saw a girl named Lily with a pretty blouse. Mia thought Lily looked nice in her blouse. Lily
achieved tok/s: 52.416052

Optimised argmax function only

<s>
 Once upon a time, there was an excited boy named Tim. He lived with his mom and dad in a small house. Tim loved to play with his toy cars and trucks. One day, he found a big pile of waste in his yard.
Tim wanted to show his mom the waste, so he picked it up and started to walk to the house. As he walked, he saw his friend, Lily. "Look at my waste!" he said to Lily. Lily looked at the waste and said, "That's not good. Let's go play with our toy cars."
Tim and Lily went to Tim's house and played with their cars. They made a car oars with the waste from the pile. They had a lot of fun, and soon Tim's mom called them for lunch.
Tim and Lily went inside and sat down to eat. Both and Lily were happy that they worked together to clean up the waste. Tim learned that working together with his friend made everything more fun and safe.
<s>
 Once upon a time, there was a big, fat cat named Whiskers. Whiskers loved to play outside every day. One day, Whisk
achieved tok/s: 51.644140

Standalone Binaries & Binary Portability [Enhancement]

It is possible to add binary portability and standalone binary support using https://github.com/jart/cosmopolitan.

The upside is that once compiled, the binary files are self contained and work on the most popular OS and archs.

The downside is that to support this, a few lines of pre-processor directives are needed (#ifdefs) so that it does not break builds with gcc and clang. The directives are documented with comments.

I have created a pull request with the hope that it enables people to use llama.c on a wide variety of 64bit systems without having to cross compile.

Performance wise it is identical on x86_64 machines but very slow on Aarch64 due to a emulation layer.

Also if we figure out a fix (jart/cosmopolitan#866), it would be possible to boot llama2.c baremetal ie llama2.c OS . Currently it does so but the models are not loaded in that case.

Please have a look at the pull request: #32 to see if it fits with your goals.

Here is the result of running the identical binaries:

Low end x86_64:

Cloud Aaarch64 (emulated & slow):

Curious to know the speed difference (tokens/s) between run.c and sample.py for inference.

Fail to train with compile as true.

I just want to try the train.py. Then I ran this command:
python -m train.py --compile=False --eval_iters=10 --batch_size=8 although I got two GPU. But I did not find a way to use them all.

This works well but it would loop over around 2.0. The loss would not reduce any longer.

And when I set --compile= True. It would crash.
The traceback does not include any valuble info but this line: KeyError: torch.complex64.

I searched this error it says the support of complex64. I can not keep going here.
Please help. Thanks.

add prompt

Allow prompt to be specified as a string on the command line.

Usage: ./run  model44m.bin 0.9 200 "I'm cat!!"
or
Usage: ./run  model44m.bin 0.9 200 "`cat prompt.txt`"

I created the following function.
Any string can be inserted.
Chat format can be easily created by using this function.

int transformer_str(char *input, int pos, Config* p, RunState* s, TransformerWeights* w, char **vocab) {
    while (input[0] != 0) {
        int next = -1;
        int next_length = 0;
        for (int i = 0; i < p->vocab_size; i++) {
            char *v = vocab[i];
            int j = 0;
            int hit = 1;
            while (v[j] != 0) {
                 if (v[j] != input[j]) {
                    int hit = 0;
                    break;
                }
                ++j;
            }
            if (hit && j > next_length) {
                next_length = j;
                next = i;
            }
        }

        if (0 > next) return pos;

        pos++;
        transformer(next, pos, p, s, w);

        printf("%s", vocab[next]);
        fflush(stdout);

        input += next_length;
    }

    return pos;
}

Please keep this simple

The main goal should be code readability, and easy understanding, for learning

There are many PRs that add lots of complexity

One option is to use separate branches, maybe 3:

main -> clean code
fastest -> to check where this can reach
intermediate

Or reference other implementations on the README

What does freq_cis mean?

I keep seeing freq_cis in run.c. What does it mean?

Permissiveness of the License

This project is MIT Licensed, while the model.py script contains the following license:
Copyright (c) Meta Platforms, Inc. and affiliates. This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
Does this indicate that the Llama models generated by this library will have to follow the Llama-2 license?

Training Time? (Thanks so much)

Thanks so much for the project.

Curious about how much time it took for you to train and did you use the same cloud resources as in the inference one?

Thanks!

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Cloned the llama repo and copied the export_meta_llama_bin.py file then run:
torchrun --nproc_per_node 1 export_meta_llama_bin.py

I get
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

I'm on linux 16vCPUs 32G ram no GPU

Is the code supposed to work without GPU?

conflicting sentences in generated stories

following sample generated from 44M model, do I need to set any hyper parameter(s) to fix this (or should i use bigger model?).

~/llama2.c$ ./run out44m/model44m.bin

<s>
Sara and her mom were going to the zoo. Sara was very happy. She wanted to see the lions and the monkeys and the birds. She did not mind that it was cloudy and cold outside.
Mom said they had to take a taxi to get to the zoo. She told Sara to hold mom's hand and not to run away. Sara nodded and smiled. She liked taxis.
They got in the taxi and Mom told the driver where they were going. The driver was a nice man. He said hello and asked where they were going. Mom told him that they were going to the zoo. The driver smiled and said okay.
Sara looked out the window and saw many cars and trucks and buses. They were all different colors and sounds. She wondered what they were doing. Mom said they were going to a place that had animals. She said animals like cows, pigs, sheep and ducks.
Sara did not like animals. She liked animals. They were fierce and loud and scary. She was afraid that they would bite her or kick her or chase her. She started to cry and said no. She said
achieved tok/s: 25.934556

use tcc to run it

it can't directly run by tcc, but change

  timespec_get(&time, TIME_UTC);

  clock_gettime(CLOCK_REALTIME, &time);

make the job. ref https://stackoverflow.com/questions/68335778/mingw-gcc-time-utc-undeclared

Though speed is slow, but it now much more tiny. (gcc vs tcc)

clang: error: unsupported option '-fopenmp'

Curious, I have an Apple M2 Macbook Pro, and I get this error when compiling:

clang -Ofast -fopenmp -march=native run.c  -lm  -o run

clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'

What's the proper way of setting up OpenMP in Apple land?

I've already setup homebrew and done this:

 brew install llvm libomp

What should I do next? thanks!

Torch not compiled with CUDA enabled

im running python train.py
only change line 67 to device = "mps"

im on a M1 mac

How do I add my own prompt

I found that the result will be output directly after running ./run out/model.bin, I want to know how I can add my own prompt

Compilation error

I get the following compilation error. compiling on Android, termux with gcc. Its a Snapdragon 8 Gen 2 chip.

~/llama2.c $ make
gcc -O3 -o run run.c -lm
run.c:359:3: error: call to undeclared function 'timespec_get'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] timespec_get(&time, TIME_UTC); ^ 1 error generated. make: *** [Makefile:5: run] Error 1

~/llama2.c $ termux-info
....
Kernel build information:
Linux localhost 5.15.74-android13-8-o-gfb3eff09eff0 #1 SMP PREEMPT Mon May 22 01:39:13 UTC 2023 aarch64 Android
Device manufacturer:
OnePlus
Device model:
CPH2449
LD Variables:
LD_LIBRARY_PATH=
LD_PRELOAD=/data/data/com.termux/files/usr/lib/libtermux-exec.so

Can't compile run.c on windows platform via MSVC/cl.exe

Seems that on Windows it's not possible to compile run.c via MSVC compilers

cl.exe run.c
cl.exe run.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.35.32217.1 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

run.c
run.c(16): fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory
mingw32-make: *** [makefile:42: windowscl] Error 2
Error: Process completed with exit code 1.

System Info:

Microsoft Windows Server 2022
  10.0.203[4](https://github.com/tairov/llama2.c/actions/runs/5663591697/job/15345560107#step:1:4)8

Image: windows-2022
  Version: 20230716.1.0
  Included Software: https://github.com/actions/runner-images/blob/win22/20230716.1/images/win/Windows2022-Readme.md
  Image Release: [https://github.com/actions/runner-](https://github.com/actions/runner-images/releases/tag/win22%2F20230716.1)

MSBuild:

C:\ProgramData\Chocolatey\bin\vswhere.exe -products * -requires Microsoft.Component.MSBuild -property installationPath -latest
C:\Program Files\Microsoft Visual Studio\2022\Enterprise

msvc-dev-cmd:

Found with vswhere: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat

See GH Actions execution (Build step, errors ignored) https://github.com/tairov/llama2.c/actions/runs/5663591697/job/15345560107

Also tried on Windows 10 virtual machine with installed VS Build Tools including Windows 11 SDK, similar error:

cl run.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.36.32537 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

run.c
run.c(16): fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory

PR with CI job to test builds on different OS is here #86

For fun it would be neat to see this run against llama-2-7b

Any tips on how to try that out?

What could be some good use cases for a small model like this one

It can easily run on less powerful devices and generate not-so-good stories which is great for proof of concept.

But what real applications can be achieved with it?
Since it can run easily on phones as well.
It opens up a whole new territory of applications or at least that's what I feel

Using image dataset

Does it work in image dataset? Im thinking of a way of porting this on a small microcontroller like ESP-32 Cam is it possible?

"sys/mman.h: No such file or directory"

<sys/mman.h> is a Unix header and is not available on Windows so no way to run natively. Any solutions?

Softmax experiment

Hello, Andrej!

Have you seen this blog post? The author states that he found a "bug" in the attention mechanism that affects all transformers nowadays. In order to test this theory we need to train a model from scratch and compare the results. The fix is very easy, just add 1 to the softmax. What do you think?

Waiting for next YouTube video

Hi, @karpathy 👋🏻! Awesome project. Will you release a complementary YouTube video where you explain everything?

👍

make llama2.go ref llama2.c

hello everyone:

First of all, I would like to thank Karpathy for his contributions.

I have successfully implemented a golang version of the code. If you're interested, I would be more than happy to share it with you.
https://github.com/haormj/llama2.go

Inference speed [-Ofast]

In the spirit of the project adding additional compilation flags seems complicating things, however, -Ofast compilation flag seems easy to apply. Ofast is O3 + fast-math and some other optimizations (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html).

It almost 2x the inference speed so it might be worth considering.
The results with O3 and Ofast are the same, though fast-math doesn't guarantee that.

  - O3:    160t/s
  - Ofast: 307t/s

working in a-Shell(iPhone)

clang -o run run.c -lm -D_WASI_E
MULATED_MMAN -lwasi-emulated-mman

Can this run on the CPU?

multiple stories are generated with Model-110M

following output generated from Model-110M. seems like to generate the desired number of output tokens a new story is started instead of continuing one story. (update: model-44m also generated multiple stories)

~/llama2.c$ ./run out110m/model110m.bin

<s>
Once upon a time, there was a little girl named Lily. She had a cat named Mittens. Mittens was very fluffy and loved to sleep in the sun. One day, Lily went to the park and saw a new friend. His name was Max and he had a dog named Spot. Lily and her new friend played together and had so much fun. They talked about their favorite toys and pets. Mittens even joined in the fun and purred loudly. Lily was happy to have a new friend to play with and talk to.
<s>
Once upon a time, there was a clever cat named Tom. Tom had a big smile that made everyone happy. He lived in a small house with a small girl named Sue. Sue and Tom loved to play together all day long.
One day, Sue and Tom were playing with their toys when Sue fell down. She hurt her thumb and started to cry. Tom wanted to help Sue feel better. He thought hard and came up with a plan.
Tom found a soft cloth and wrapped it around Sue's thumb. He gave her a gentle kiss on her thumb to make it feel better. Sue stopped crying and started to smile. She knew that Tom was
achieved tok/s: 9.483940

How do you get the banner image?

Out of curiosity. It is just so cute.

CUDA out of memory when exporting llama7b

I'm trying to export llama7b model on my local machine (it has rtx 3060 12gb which is not enough) using export_meta_llama_bin.py script and receiving cuda out of memory.
I see that in generation.py from llama module (in Meta repo) they hardcoded cuda usage.
Anyone knows how to make it run on cpu with minimal script modification?

The leading space in C sampling code

It seems that SentencePiece removes leading spaces during decoding.

--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace) type: bool default: true

https://github.com/google/sentencepiece/blob/635fe8423a249b6e081aacd290d8aef7476c6a28/src/sentencepiece_processor.cc#L791

A simple benchmark for an Android device

Given that this project is designed for narrow applications and specific scenarios, I believe that mobile and edge devices are ideal computing platforms. To begin with, a preliminary benchmark has been conducted on an Android device.

Android device spec：Xiaomi, Qual Snap 7 Gen2, 2.4GHz, 12G RAM.

Use gcc -O3 flag
1.

gcc -O3 -o run run.c -lm
./run out/model.bin
17.33 tok/s

gcc -O3 -o run run.c -lm
./run out44m/model44m.bin
5.82 tok/s

Use gcc -Ofast flag, refer to #20
1.

gcc -Ofast -o run run.c -lm
./run out/model.bin
301.93 tok/s

gcc -Ofast -o run run.c -lm
./run out44m/model44m.bin
72.86 tok/s

out44m could slow down 3-4x
-Ofast can achieve at least 10x speedup for inference model. This was quit amazing for a mobile device.

Moreover, I will try it on more devices like RK3588 later(or even ESP32?) .

Looking for benchmarking BLAS lib & CLBlast for CPU & GPU speedups #7 (comment)

Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement]

@karpathy I was thinking if you'd consider using a BLAS lib to speed up compute so that larger models may work.
If that is the case, please also have a option to compile with CLBlast (compatible drop in blas) so that compute can get offloaded to GPU via OpenCL.

https://www.netlib.org/blas/#_reference_blas_version_3_11_0
https://www.openblas.net/
https://github.com/CNugteren/CLBlast

Model serving

Realize this is an orthogonal questions - but what's a simple way to stand up a llama.c model serving so I can access it from LangChain

timings are wrong?

Making an issue as a placeholder. I'm pretty sure the timings reported (Tok/s) are not accurate now, as a result of merging an earlier PR moving from clock() -> gettimeofday(&time, NULL); TODO investigate...
(Recently noticed this especially with OpenMP build)