jzhang38 / tinyllama Goto Github PK

View Code? Open in Web Editor NEW

7.4K 7.4K 434.0 2.5 MB

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

License: Apache License 2.0

Python 98.72% Shell 1.28%

tinyllama's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory techthiyanes wangxidong06 gj-raza mhannani green-sky theolivenbaum roguesupport-scott cxz graphgrailai hbcbh1999 spacelabllm hongdangshao judepark96 uk0 allthingsllm amadeobrands binshi-bing tonywhite11 while-basic thanhpham1987 yuimo realsrisri kumar045 vu1seek ai-jie01 yibit karthikeyan1207 cryptobeast88 genezc leejw51 fwtan perfmjs ai-app davgit christindbose mbrukman evdcush damodaran013 cksac suryatmodulus liangofthechen 331319341 jesusoctavioas sk-15 345ishaan ukaserge woshiyigebing zzddxx520 daddyunikii ailunai258 weiweian88 leevaleeth mbakpur123 adeliavale kenyony sysujayce kendrickcheung007 soon14 ganbayard sorokinvld lightning-sandbox deemonych lordunix surpriso1997 pterameta thibaultcastells noorahsmith hhy5277 cleardry abduibasit hadryan yangtzehina vatsadev botsbrain rsai0 spydaz greatheart1000 thetechoddbug aquiferer hitorilabs hunter-lee1 xuejianbujia yueyedeai zouweidong91 expert68 xaiocaibi nick-2008 s1ghhh xxxx001 weifengchiu sterling312 actiquest-dev moxmoussa gfxblit xiechengmude gaofeifei siviltaram muzammilpervaiz magician-blue

tinyllama's Issues

when will be 1T check point ready?

Very very poor perf using faraday and amd gpu ?

Hello, tiny lama takes all my ram and has very very poor perfs' like lower than 7b models, it takes a very long time to load and is worse than most model, I don't unstand what I'm doing wrong, usually I use ggml gguf ? version but you have bin that is 4GB for 1B .... I guess that's the issue, maybe you have somewhere the ggml or gguf model ?
I'm pretty sure something is wrong ... Maybe I can convert it ? (the real issue is that I have an AMD high end gpu, it useless .............)

I used the base model Last version and not the chat model, since it's a 1b params maybe I can convert it to gguf ?

How to compute token numbers for a dataset？

Whether to use tokenizer to encode each sample and then calculate the total tokens？

Hardware requirements

Hello, It's a very nice and much needed development. How much storage will be required for complete model training. As around 1.9TB is required only for datasets. Also how much RAM is required.
Best wishes to the team!

where is flash attention 2

Replay Finetuning & store as GGML

I have been trying to redo TinyLlama finetuning starting from PY007/TinyLlama-1.1B-intermediate-step-480k-1T using both finetuning.py and using the last command from script.sh.
I used one A100 40G (using only 27GB of VRAM). Everything went well apparently.

I just added:

final_model="path to last checkpoint"

tokenizer = AutoTokenizer.from_pretrained(final_model)

model = model = AutoModelForCausalLM.from_pretrained(
        final_model,
        device_map="auto",
        trust_remote_code=True,
    )

model.save_pretrained("TinyLlama-1.1B-chat-hf")
tokenizer.save_pretrained("TinyLlama-1.1B-chat-hf")

Then I tried to convert the model to a GGML format using convert.py from llama.cpp

!python convert.py <path to TinyLlama-1.1B-chat-hf>

This lead to the following error:

Loading model file <path to TinyLlama-1.1B-chat-hf/pytorch_model.bin>
params = Params(n_vocab=32003, n_embd=2048, n_layer=22, n_ctx=2048, n_ff=5632, n_head=32, n_head_kv=4, f_norm_eps=1e-05, f_rope_freq_base=10000.0, f_rope_scale=None, ftype=None, path_model=PosixPath('/content/drive/MyDrive/TinyLlama/TinyLlama/sft/TinyLlama-1.1B-chat-hf'))
Loading vocab file '/content/drive/MyDrive/TinyLlama/TinyLlama/sft/TinyLlama-1.1B-chat-hf/tokenizer.model', type 'spm'
Traceback (most recent call last):
  File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 1193, in <module>
    main()
  File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 1175, in main
    vocab = load_vocab(vocab_dir, args.vocabtype)
  File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 1086, in load_vocab
    return SentencePieceVocab(path, added_tokens_path if added_tokens_path.exists() else None)
  File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 372, in __init__
    raise Exception(f"Expected added token IDs to be sequential and start at {len(added_tokens)}; got {actual_ids}")
Exception: Expected added token IDs to be sequential and start at 6; got [0, 1, 2, 32000, 32001, 32002]

Any idea what I am doing wrong ?

Some more info contained in related files:

special_tokens_map.json

{
  "additional_special_tokens": [
    "<unk>",
    "<s>",
    "</s>",
    "[PAD]",
    "<|im_end|>",
    "<|im_start|>"
  ],
  "bos_token": "<s>",
  "eos_token": "</s>",
  "pad_token": "[PAD]",
  "unk_token": "<unk>"
}

added_tokens.json

{
  "</s>": 2,
  "<s>": 1,
  "<unk>": 0,
  "<|im_end|>": 32001,
  "<|im_start|>": 32002,
  "[PAD]": 32000
}

Those files are slightly different from what can be found in PY007/TinyLlama-1.1B-Chat-v0.3 and I don't understand why.

.

Running on CPU using llama.cpp

Hi,

Posting here even though this is not related to the code itself.

Context:
I have tried to used Chat-v0.3 directly using the checpoints [code](<script src="https://gist.github.com/galleon/ca73c87542e9110dea4220bb143e70a5.js"></script>) I just added eos_token_id=tokenizer.eos_token_id to the example to make it finish as expected.

I obtain an answer that I consider ok even though it is made of three sentences (I have not looked into the details on how you generated the chat version. Any info avail ?)

Then I decided to move to llama.cpp making sure to update my version to get the fix for the issue you recently ran into.

I did generate the F32 version (which should be the same as the checkpoint).
Here is the result I got to this CLI
./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.gguf -p "Please answer in one sentence to this question: What is a Large Language Model?" --n-gpu-layers 0 --temp 0 --escape --seed 42 --color --n-predict -2

Do you know why it continue to generate after the EOS ?

Then I moved to Q5_K quntized version and get the following output

Completely AWOL which make me consider I have done something wrong. Did someone have similar issues ?

下载数据集不便

需要的数据集好大，而且还得科学上网下载的。不知道大佬能不能提供一下下载数据集的网盘或者torrent之类的下载渠道？

Converting Saved Model Files to Hugging Face Transformers Format

Hello, I have been using your pre-trained code, and I'm wondering how to convert the saved model files into the Hugging Face Transformers format, similar to the ones you upload to Hugging Face's repository?

Question Regarding the Absence of BOS and EOS Tokens in Tokenizer Encoding

I noticed that your tokenizer doesn't add the bos and eos token to the final tensor during encoding. Does this have any impact on pretraining? If it's intentional not to add them, what is the reason behind it?

Any plans for the ONNX runtime?

One of your potential Usecases is to deploy on edge devices.

For that, the ONNX runtime is probably the most likely candidate, supporting a lot of platforms/apis/architectures/hardware-accelerators.
Its supposably easy to convert any huggingface hosted models to ONNX with optimum, though I haven't done it personally.

Any thoughts?

Release format + artefact

Dear Authors,
Thanks so much for your amazing project.
Would it be possible for you plan to release the following:

the optimizer states
the scheduler
a checkpoint just before cooling down the model

This would be a highly valuable artefact for keeping training the model !

Thanks so much and congratulation for your work !
Pierre

How to finetune on custom dataset

How does on finetune the model on custom dataset? Should we modify finetune.py?

Support Chinese Language？

how many chinese data in training？ Would you plan to support chinese language？

代码能在windows 环境下跑吗？

Have you considered code llama?

With https://github.com/KillianLucas/open-interpreter gaining steam, it would be cool if you can do a similar project with Code LLaMa

Has anyone used this code base for incremental pretraining of llama-2-7b?

Why does a dimension mismatch occur when I use AutoModelForCausalLM to load a model?

model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path)

File "/usr/local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3173, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).

How do you plan on dealing with hallucinations due to knowledge compression?

Hi, I'm very interested in this project, but I would like to know how you plan to deal with the amount of hallucinations made having a very high compression ratio, or training tokens to model params? 3T tokens to 1.1B is a far larger compression than 7B params to 2T tokens for llama2?

where is "rotary_emb"?

in "fused_rotary_embedding.py", where is the "rotary_emb"?

Would this be possible to finetune on a weaker gpu like a t4?

When a finetuneable version of this model comes out, will we be able to fine-tune it on a Google colab T4 gpu?

TinyLlama-1.1B-orca-gpt4

First, I want to express my gratitude about this project. I think TinyLlama has a lot of potential and we're just starting to see it. Cudos!

I'm pretty new to this exciting field and this is the first time I fine-tuned a model. I used the "base" TinyLlama model (step-240k) to fine-tune using the sam-mosaic/orca-gpt4-chatml dataset but the result seems not as good as your v0.2 chat model.

I will keep working on this and I will share with you the models I create. I think that the RAG approach you guys are experimenting now is the good direction and I'll going to do some experiments with that too.

Anyway the model I produced is here in case you want to take a look: TinyLlama-1.1B-orca-gpt4

Request: Finetune the Model on more Data?

This might be unorthodox, but I had to ask.

I've been trying to run the sft script on colab T4, and on Kaggle double T4, P100 and It instantly ran out of memory.
I've been Trying to perform a QLora run, and It was successful for a very small dataset, but the dataset I'm trying to finetune this with is around 20GB, and takes anywhere from 81 to 135 hrs to map, trying to stream the dataset makes it load nothing, and I can't run any CPU or GPU instances that long.

If the SFT script isn't meant to take up that much memory, could you please fix it?
If it is meant to use that much memory, I would like to request that you train a checkpoint or the final model on the UnagamiData dataset

Its the dataset I used to train my previous model, Unagami. Its a Mixture of several high quality Datasets, Including Open-Platypus, Oasst1, and OpenOrca. It also has some QA from context datasets, like Dolly DataBricks, etc, which could make it better for RAG.

Its currently Formatted with HTML-like tokens, like <system>, <human> I can switch to ### System: , ### Human: if needed.

Considerable?

TinyLlama-chat outputs truncated/small?

From vLLM
Colab --> https://colab.research.google.com/drive/1HOxyJVxo0NeVk8oidvR3dvouGBTYO60X?usp=sharing

I've noticed that the outputs are rather small/truncated compared to the usual models trained on openassistant?

'### Human: Give me a hello world in python? ### Assistant:' 'Sure, here is a simple "hello world" program in Python:\n\n'
'### Human: Give me a hello world in python? ### Assistant:' 'Sure! Here\'s a simple Python program that says "Hello, world!"'
'### Human: Give me a hello world in python? ### Assistant:' 'Here\'s a simple "hello world" program in Python:\n\n```'
'### Human: Give me a hello world in python? ### Assistant:' 'Sure! Here is a sample code in Python:\n```python\nprint("'
'### Human: Give me a hello world in python? ### Assistant:' "Sure, here's a simple `print()` statement:\n```python\n"

Resuming training

I am training a 120M model from scratch because I would like to do some experiments myself. When I stop and try to resume, it requires me to drop the batch size significantly otherwise I get memory error. Any ideas why?

Also please consider making a discord server where people can discuss about the project.

Are there any provided 4bit quant weights, or like a colab detailing quantization?

I've seen 4bit weights mentioned a lot, but I can't find any references to them

Where can I find the 4bit quant weights, or do I have to quantize them? If so, is there a colab notebook for the process or something like that?

info when load model

hi,when I load the model for train:
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
# device_map="auto",
trust_remote_code=True,
)

info like this:(is it right?)
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /root/bert_path/TinyLlama-1.1B-intermediate-step-240k-503b/TinyLlama-1.1B-intermediate-step-240k-503b and are newly initialized: ['model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

How to train model with databricks-dolly-15k.jsonl dataset format.

Can we Finetuning using BitsandBytes and SFT ?

A guide to adding more datasets

One of the requirements is

Add scripts for pretraining on other datasets.

I'm assuming that the pretrain dataset script would still work for a finetune script, as the data is processed the same?

I was looking through prepare_slimpajama.py and from what I can tell,

Data is taken in as JsonL files, and tokenized into a "packed dataset"

When I tried to look into the packed dataset, I notice its supposed to be a custom format dataset?

I think it would be very useful if you made a guide on preparing a dataset, like maybe an example of a small dataset on Colab, because most of our PCs can't handle the sheer file size of the tokens in the slimpajama and starcoder datasets.

Cleaned Notebook

Just a Cleaned Version of the Colab Notebook from #6 , as Its ~15 lines of very self explanatory code, with a couple comments already in it. I just Cleaned the empty cells at the bottom, and removed the large numbers of big headers not needed for a Notebook this size.

https://colab.research.google.com/drive/1HOxyJVxo0NeVk8oidvR3dvouGBTYO60X?usp=sharing

Why is tokenizer.model_max_length set to 1000000000000000019884624838656?

See https://huggingface.co/PY007/TinyLlama-1.1B-step-50K-105b/blob/main/tokenizer_config.json#L22

国内模型镜像

非常有意思的工作，但是huggingface 最近总是连接超时，是否可以放一个国内可以下载的链接呢。

Add support for VEDV (https://github.com/yunielrc/vedv)

Can it run on CPU?

Hello, can it run on cpu? with 4GB RAM.
Can you please guide me regarding the minimum hardware requirements?

Thanks in advance.

Why train three epochs? not one epoch?

Hi, Thanks for your great works!
In all previous works(i.e. GPTs, LLaMAs, ...), them both pretrain one epoch. But I found you train three epoch? why set three?
Looking forward to hearing from you in your free time. Thank you very much.

Colab

Can it be run on colab?

Why is Swiglu packed_weights = False?

Hi, can i check why did you set the Swiglu packed_weights to be False? From this discussion, "you shouldn't set "_pack_weights=False" as it prevents from fusing a few kernels during the BW pass".

TinyLlama/lit_gpt/model.py

Line 301 in 3cffb3c

 self.swiglu = SwiGLU(config.n_embd,config.intermediate_size, bias=False, _pack_weights=False) 

Where dose the rotary_emb import come from?

Hi,

Where dose the import rotary_emb come from? I don't see it in the requirements, and a google search is not reveling a GitHub that follows this structure.

Thanks,

Notes on chat fine-tuning and datacontent

I adapted TimDettmers filtered Openassistant dataset in order for it to take the Llama 2 prompt format (e.g. with INST), see here.

I then fine-tuned TinyLlama (using a full fine-tune of all LoRA modules) at the 1T token checkpoint, see here.

Observations:
A. TinyLlama seems to have issues emitting an EOS (< /s > token). For example:

<s> [INST] What planets are in our solar system? [/INST] 1. Mercury

2. Venus

3. Earth

4. Mars

5. Jupiter

6. Saturn

7. Uranus

8. Neptune

9. Pluto

10. Ceres

11. Callisto

12. ...

This leads me to wonder are BOS and, particularly, EOS tokens being used in pre-training (e.g. < s > and < /s >)?

B. I notice that when inferencing the raw 1T checkpoint (i.e. not chat fine-tuned), it is common to see ### in the response:

<s> [INST] Generate a python code snippet to add two numbers. [/INST] 

### [INST] Generate a python code snippet to add two numbers.

### [INST] Generate a python code snippet to add two numbers.

...

I'm somewhat surprised to see this '###'. Does this mean there are some chat fine-tuning or instruct fine-tuning datasets in the pre-training datasets?

How to speedup tokenizer.encode?

I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.

How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?

Looking forward to hearing from you in your free time. Thank you very much.

Getting gibberish output when running on llama.cpp

Hi, I see the mention of running this model on llama.cpp in README. Did you get a manage to get it to run and quantize with good output? I'm trying to evaluate if this model can be used for speculative decoding for llama 2 7B

With the first checkpoint https://huggingface.co/PY007/TinyLlama-1.1B-step-50K-105b - seems like there might be some issue converting to gguf

python convert.py ../TinyLlama-1.1B-step-50K-105b/

./main -m ../TinyLlama-1.1B-step-50K-105b/ggml-model-f32.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -ngl 0 --temp 0

Is resulting in the following - Either f16 or f32 would result in this, adding a <s> token at the beginning didn't help either:

(...)
Building a website can be done in 10 simple steps:\nStep 1:12000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
(...)

I can see that running with huggingface/torch is giving a more reasonable result, although it quickly becomes repeated

<s> Building a website can be done in 10 simple steps:
Step 1: Create a website.
Step 2: Add a logo.
Step 3: Add a contact form.
Step 4: Add a blog.
Step 5: Add a social media links.
Step 6: Add a contact page.
Step 7: Add a contact form.
Step 8: Add a contact form.
Step 9: Add a contact form.

Not sure where this mismatch is coming from

Thanks

Where is RLHF?

Is it only support SFT?

eval loss become nan after a single batch

Hello,
I am trying to finetune the model with the script you provided, on four RTX 3090 GPUs.
However, I was getting a CUDA out of memory issue, so I made the following change:

model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    device_map=device_map,
    trust_remote_code=args.trust_remote_code
)
model = model.half()

It now fits on my gpu, but the training loss becomes 0 after a single batch, and the evaluation loss is nan.
I tried to check the predictions of the model after training, but its output contains nan so it does not work.
What I already tried to solve the issue:

different hyper-parameters (lr and wd)
different datasets (alpaca-cleaned and osst1)
different checkpoints (TinyLlama-1.1B-intermediate-step-240k-503b and TinyLlama-1.1B-step-50K-105b)

But I get the same result every time. I am assuming this is due to the use of float16, since it is the main difference between my code and the original code. Do you have an idea of what is happening, and of what I could do about it?
Thank you!

Why is the vocab size of `TinyLlama-1.1B-Chat-V0.1` 32001?

Makes it somewhat more annoying to use.

Also, were there any changes in how the weights are saved between TinyLlama-1.1B-intermediate-step-240k-503b and TinyLlama-1.1B-intermediate-step-50k-105b? I'm getting incorrect output with the newer checkpoint for code that worked with the first checkpoint.

我想要使用这个模型

最低显卡需求要多少
能不能进行 cpu 推理
能不能模型微调

Minimum learning rate

The minimum learning rate is the same as the "max". Is this intentional or a mistake? If yes, why (you can skip explanation if it is too bothersome)?

How did you determine the size of the TinyLlama model?

Were there any trade-offs or considerations you made when deciding on the model's size? Or What criteria did you use to select the specific number of layers, attention heads and Embedding Size etc. in your model?

Problem with TinyLlama-1.1B-Chat-v0.3 tokenizer

I am wondering if this behavior is correct:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("PY007/TinyLlama-1.1B-Chat-v0.3")
print(f"vocab_size: {tokenizer.vocab_size}")
print(f"length get_vocab: {len(tokenizer.get_vocab())}")
print(list(vocab.keys())[list(vocab.values()).index(32000)])
print(list(vocab.keys())[list(vocab.values()).index(32001)])
print(list(vocab.keys())[list(vocab.values()).index(32002)])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vocab_size: 32000
length get_vocab: 32003
[PAD]
<|im_start|>
<|im_end|>

I compared tokenizers from TinyLlama-1.1B TinyLlama-1.1B-Chat Llama-2-7b-hf and Llama-2-7b-chat-hf which are supposed to be the same and only TinyLlama-1.1B-Chat has this discrepancy.

python环境包有可能发出来一份吗

虽然它很基础，但确实很重要并且也很耗时，看着当前并非使用稳定版本，自己构建环境包遇到各种问题，目前会挂在xformer源码构建上，请问有可能发出来份python包链接吗？例如conda环境放到一个网盘上，感谢