jquesnelle / yarn Goto Github PK

View Code? Open in Web Editor NEW

1.2K 14.0 108.0 1.48 MB

YaRN: Efficient Context Window Extension of Large Language Models

License: MIT License

Python 96.11% Shell 3.89%

yarn's Introduction

YaRN

This repo contains the code and data for the YaRN context window extension method.

Paper

Paper (ICLR 2024): YaRN: Efficient Context Window Extension of Large Language Models
Old Preprint (arXiv)

Models

LLaMA

We publish variants of Llama 2 fine-tuned with YaRN at 32K, 64K and 128K context window length. They are available under the Llama 2 license on 🤗 Hugging Face.

Size	Context	Link
7B	64K	NousResearch/Yarn-Llama-2-7b-64k
7B	128K	NousResearch/Yarn-Llama-2-7b-128k
13B	64K	NousResearch/Yarn-Llama-2-13b-64k
13B	128K	NousResearch/Yarn-Llama-2-13b-128k
70B	32K	NousResearch/Yarn-Llama-2-70b-32k

In addition, we also publish 8K context window versions of Llama 2 7B fine-tuned with NTK-aware and YaRN (Table 1 in the conference paper).

Mistral

With the release of v2 of our paper we are also publishing 64K and 128K variants of Mistral 7B v0.1.

Size	Context	Link
7B	64K	NousResearch/Yarn-Mistral-7b-64k
7B	128K	NousResearch/Yarn-Mistral-7b-128k

SOLAR

The SOLAR 10.7B v1.0 model utilizes depth-up scaling to add layers to Mistral 7B v0.1, which may potentially improve long context performance on a per-parameter basis. We publish 32K and 64K variants.

Size	Context	Link
10.7B	32K	NousResearch/Yarn-Solar-10b-32k
10.7B	64K	NousResearch/Yarn-Solar-10b-64k

Reproduction

We strongly believe in open science, and thus publish all code and data to reproduce the results in our paper. To reproduce, clone the repository and perform a local installation.

git clone https://github.com/jquesnelle/yarn
cd yarn
pip install -e .

Training

To train the models, run accelerate config and enable DeepSpeed acceleration. deepspeed/zero3.json was the configuration file used for training.

# ./train.sh

The tokenized training data is available on 🤗Hugging Face and was derived from the pg19 dataset. For the Mistral models, a mix of the pretrain and fine-tune splits of Long-Data-Collections was used and the tokenized dataset is also available on 🤗Hugging Face.

Evaluation

To reproduce the evaluations, install lm-evaluation-harness with pip install git+https://github.com/EleutherAI/lm-evaluation-harness and then run the two provided scripts.

# ./eval.sh
# ./eval-harness.sh

Citation

@inproceedings{
      peng2024yarn,
      title={Ya{RN}: Efficient Context Window Extension of Large Language Models},
      author={Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole},
      booktitle={The Twelfth International Conference on Learning Representations},
      year={2024},
      url={https://openreview.net/forum?id=wHBfxhZu1u}
}

yarn's People

Contributors

Stargazers

Watchers

Forkers

1980dragon dumpmemory rayjue jason-cky conceptofmind bloc97 shahules786 jordiclive arnavdantuluri peytontolbert shuxiangzhang jon-tow darcstar-solutions-tech touristshaun tomchapin danielbichuetti techthiyanes plowsai adarshxs syaikhipin uakbr danielmiranda greydoubt mohamadmansourx dattgoswami elkayvee hamidshojanazeri samadwar apollohuang1 enkienil cyrilmagsuci gary109 eltociear ricklentz stjordanis xiechengmude hubayirp ai-natural-language-processing-lab sosofun sreenivasmrpivot mgrankin sundogs8603 superxiang shossain areafather sparverius ultima-insights muzammilpervaiz griff4692 hephaex pratik-behera hbcbh1999 okoge-kaz dapper-magician zaydzuhri honglu2875 tejaswi-kashyap-006 darcstar-solutions-tech albud187 mazicwong tattrongvu fastrocket toandreyhse weearis deygenson anak10thn lowdias srimouli04 jdwebprogrammer usuarioanaconda quijoteshin id-2 talharb ishine zhuango gloriousibc lihuibng cebtenzzre tpoisonooo guhaifudeng timloewel langkhachhoha soacker anubrag epinnock hkhdair gullyb bruinxiong taishi-n324 curiousprogrammer yoonseokheo kaykyr xiyang-aads-lilly sthsf vangogh0318 uwsampl michaellin99999 uschen-thirdparty daxuewuli007 ethan-tz

yarn's Issues

OOM on two 80GB GPUs

accelerate launch finetune.py \
    --output-dir output/mistral-yarn-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 2 \
    --max-position-embeddings 16384 \
    --dataset emozilla/yarn-train-tokenized-8k-mistral \
    --sliding-window-attention-schedule 4096 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

Both with or without lora hits the OOM error, this is on only 8K sequence length, so memory consumption should be around 4x smaller compared with training on 16K sequence length.

accelerate is configured to use two GPU and FSDP.

What is the purpose of `finetuned` parameter in `LlamaDynamicYaRNScaledRotaryEmbedding`?

I see that in __init__ method of LlamaDynamicYaRNScaledRotaryEmbedding there is a parameter called finetuned which is a boolean. What is the purpose of that parameter? Should we set it to False while finetuning the model and then set it to True for inference after finetuning? What could be the problem if we keep it False regardless the model is finetuned or not?

Could this repository be used for sft based on YaRN?

Thank you for your team's open-source contributions!
From the code, it seems to only support pre-training. I want to conduct extrapolation training in the SFT phase, taking the Instruct version from 4k to 16k. How should I proceed?

dataset preprocessing script

Hi, can you also share the preprocessing script to convert the dataset to the standard format? also why the attention_mask in the dataset is required?

Compute Requirements

Great stuff!

Just out of curiosity what was the compute setup used?
I couldn't seem to find details such as GPU type and cluster size used in the paper

Thanks!

inv_freq seems not calculated right

Hello, I'm thrilled to see that linear and NTK interpolation have been elegantly combined to create a much stronger interpolation strategy—YARN. However, while going through the code in modeling_llama.py, I find myself a bit confused by the calculation of inv_freq, particularly at line398.

According to the YaRN paper, in equation 23, it is stated as follows:

$$ \lambda_d'=(1-\gamma_d)s\lambda_d+\gamma_d\lambda_d $$

Consequently, we can derive:

$$ h(\theta_d) = \frac{2\pi}{\lambda_d'} = \frac{2\pi}{(1-\gamma_d)s\lambda_d+\gamma_d\lambda_d} = \frac{\theta_d}{(1-\gamma_d)s+\gamma_d} $$

However, in the paper, the calculation of $h(\theta_d)$ in equation 25 is different:

$$ h(\theta_d) = \left(\frac{(1-\gamma_d)}{s}+\gamma_d\right)\theta_d \neq \frac{2\pi}{\lambda_d'} $$

Hence, I think there might be some problem with equation 25 and also with line398. Perhaps we can revise the yarn function as follows, since I've empirically found that this fix can further enhance performance:

def revised_yarn(self, device):
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))

        low, high = _yarn_find_correction_range(self.beta_fast, self.beta_slow, self.dim, self.base, self.original_max_position_embeddings)
        inv_freq_mask = (1 - _yarn_linear_ramp_mask(low, high, self.dim // 2).float().to(device)) * self.extrapolation_factor
        inv_freq = inv_freq / ((1-inv_freq_mask)*self.scale + inv_freq_mask)

        self.register_buffer("inv_freq", inv_freq, persistent=False)
        self.mscale = float(_yarn_get_mscale(self.scale) * self.attn_factor)

Discussion: how to apply this experiment to the llama2 70B model?

I am curious what is required to apply this method to the 70B parameter version of the llama2 model?
On reddit, noticed you mention: "For training, these models barely fit in 128 80GB A100s using DeepSpeed and FA2"
Would the computer at OSC be enough? https://www.osc.edu/resources/technical_support/supercomputers/ascend
Only 96 80GB A100 GPUs: Is that enough to contribute to the SoTA (State of the art)?

deepspeed config crashed for `auto` and OOM

1. 使用 deepspeed

配置文件 deepseed/zero3.json 报错，不能用 auto，自己改了 config，也不知道对不对，先跑起来：

使用命令

accelerate launch finetune.py     --output-dir output/yarn-7b-32k    
--model NousResearch/Llama-2-7b-hf  --learning-rate 0.00001 
--lr-schedule constant --scaling-factor 8  --deepspeed

然后 OOM

2. 不用 deepspeed

accelerate config 取消掉 deepspeed 和 dynamo，默认 train.sh 第一个配置应该是 64k 长度的， OOM

# run `accelerate config` first. pass --deepspeed to finetune.py if using DeepSpeed

accelerate launch finetune.py \
    --output-dir output/yarn-7b-64k \
    --model NousResearch/Llama-2-7b-hf

3. 代码错误

DDP 是不是要改一下，头一回用，不知道对不对

疑问

看了下 x.shape 是 torch.Size([1, 65536, 4096]), 单卡 80G 显存似乎也不够。

所以是不是哪里应该设置个 tp ? 然而 README 对新人并不友好的样子 QAQ

An OOM error occurred while computing the perplexity of 128k Proofpoint documents with a maximum token count set to 128k.

Thank you so much for your open source work.

I evaluated the 128K context capacity of the LLaMA-27B model using an NVIDIA A100 (80G) GPU. However, I encountered an OOM error. Here is my script:

PG19="--tokenized emozilla/pg19-test-tokenized"
PROOFPILE_LONG_SMALL="--tokenized emozilla/proofpile-test-tokenized --dataset-min-tokens 131072 --samples 10 --truncate"
CUSTOM="--custom-model-together"

python eval/perplexity.py \
    ${PROOFPILE_LONG_SMALL} ${CUSTOM} \
    --output-file data/proofpile-long-small.csv \
    --min-tokens 131072 --max-tokens 131072 --tokens-step 2048 --aggressive-memory \
    -m llama2_7b_yarn_64k

How to generate plot?

Thank you for sharing this!

I'd like to review your steps for generating the plots I've seen on twitter.

Could you please include your plot generation script.. I know it's calling perplexity.py, but I'd like to re-trace your steps exactly. Then I can tweak it :)

Can it be debug with deepspeed + trainer

1.accelerate need more configuration than deepspeed with trainer, can it be realised in the deepspeed mode
2.ecosystem more llm learners use Fastchat, can it be reproduce in https://github.com/lm-sys/FastChat
3.this postition embedding method need more open source developers to do more investigate

Training system configuration

Could you share the number of GPUs, VRAM size used for finetuning?

Thanks!

Can we run the replication of the results，8 * 80 A100

Hello, can we run the project on 8 80G A100 cards? If not, could you please provide a reference configuration

A potential bug in scaled_rope/LlamaDynamicScaledRotaryEmbedding.py

The comment "# This if block is unlikely to be run after we build sin/cos in __init__. Keep the logic here just in case." might be incorrect. From what I understand, the code following this comment calculates the scale value based on the actual length of the input. However, the value cached in __init__ is unscaled. Therefore, this branch should be executed frequently.
The new values for cos_cached and sin_cached shouldn't be cached. If they are, after encountering a long sample, all subsequent samples will use the scaled values, regardless of their length.

Finetune Example

Awesome job on this

Do you have any examples of a fine-tune cli / setup to show llama3b 4096 | 6144?

Error about eval/passkey.py

/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [21,0,0], thread: [61,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

when I run the eval/passkey.py, then report the above error.
how can I solve it

Please, writting LISENSE

Hello, I am interested in using your models.

What's LISENSE these models？
I want to know that.

Why the updated cache is initialized with seqlen=256?

Hi~
I am currently following the hf version for exploration.
But find that when update KV cache in Llama (NousResearch/Yarn-Llama-2-7b-128k).
The updated empty caches' length is always 256 (line 528):

past_kv = torch.cat([past_kv, torch.empty(bsz, 256, 2, kv.size(3), kv.size(4), dtype=kv.dtype, device=kv.device)], 1)

I think it should be

past_kv = torch.cat([past_kv, torch.empty(bsz, kv.size(1), 2, kv.size(3), kv.size(4), dtype=kv.dtype, device=kv.device)], 1)

Is that right? Or I misunderstand this procedure?

Linear Scaled Embedding Has Different Implementation?

I compare your code with The Bloke code for Linear Scaled Embedding. Somehow there are some difference:

Your code change the scale self.scale = 1/scale which make it fraction but then divide t with fractioned scale (t /= self.scale). But The bloke code multiply t with fractioned scale. Which one is right?
Your code max_position_embeddings seems stays at 2048. But The Bloke code change it according to max context length. Or did you actualy change the max_position_embeddings in the config file?

Which one follow the implementation from kaiokendev?

RoPE scaling config confusing

Hi Yarn team,
thank you guys for the awesome work. Currently I'm trying to evaluate several rope scaling methods and fortunately there are all available in this git. I have some question related to the Config of rope scaling.
I see that in the requirements.txt you already include transformers >= 4.34.0, so it mean I could use the "linear" and "dynamic-ntk" out of the box with transformers, just by add the rope scaling in AutoConfig.from_pretrained() like that:
config.rope_scaling = { "type": "linear", "factor": args.linear }
or
config.rope_scaling = { "type": "dynamic", "factor": args.dynamic_ntk }
I tried that and remove the patch for linear & dynamic-ntk and the result look identical when using your implemented patch.
Moreover, it also support Falcon architecture. (https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/modeling_falcon.py#L162)
So my question is that, are there any different between this two implementation? or your implementation for linear & dynamic-ntk patch is for keeping the reproduction eval consistent?

OSError: [Errno 28] No space left on device

Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 296941.88it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.27s/it]
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 284501.42it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.01it/s]
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 121937.87it/s]
Generating train split: 61410 examples [01:49, 561.64 examples/s]
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
    writer.write_table(table)
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/arrow_writer.py", line 577, in write_table
    self.pa_writer.write_table(pa_table, writer_batch_size)
  File "pyarrow/ipc.pxi", line 525, in pyarrow.lib._CRecordBatchWriter.write_table
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/implementations/local.py", line 365, in write
    return self.f.write(*args, **kwargs)
OSError: [Errno 28] No space left on device

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "finetune.py", line 193, in <module>
    main(args.parse_args())
  File "finetune.py", line 67, in main
    train_dataset = load_dataset('/root/autodl-tmp/data/emozilla___pg_books-tokenized-bos-eos-chunked-65536/default/0.0.0/9107755b15521c04', split='train',
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 2136, in load_dataset
    builder_instance.download_and_prepare(
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1813, in _prepare_split

Sliding window perplexity with truncated documents

Hi! Thanks for sharing your nice work.
I have some questions about the perplexity evaluation setup.

In Figure 1, it is mentioned that the sliding window perplexity is reported, with documents truncated to the evaluation context length.
I was wondering if that makes sense, because (as far as I understood) sliding window evaluation is something that you do when the document length is longer than the evaluation context length.

Also, is there a particular reason you used truncation for proof-pile and not for gov_report?

Thanks in advance!

Question about Yarn environment configuration (v2)

Hi Yarn team,

I hope this issue finds you well. I clone your git code (v2, 2weeks ago) in our machine and found mistakes:

Traceback (most recent call last):
  File "/app/yarn_4/finetune.py", line 293, in <module>
    main(args.parse_args())
  File "/app/yarn_4/finetune.py", line 52, in main
    from scaled_rope.modeling_llama_yarn import LlamaForCausalLM
  File "/app/yarn_4/scaled_rope/modeling_llama_yarn.py", line 34, in <module>
    from transformers.utils import (
ImportError: cannot import name 'is_flash_attn_2_available' from 'transformers.utils' (/opt/conda/lib/python3.10/site-pack
ages/transformers/utils/__init__.py)

In our current environment, we are using the following versions:

Python: 3.10
PyTorch: 2.1.0
CUDA: 11.8
Transformers: 4.34.0
PyTorch-CUDA: 11.8
Torchtriton: 2.1.0
Torchvision: 0.16.0
Accelerate: 0.24.1
Deepspeed: 0.12.3
flash-attn: 2.3.3

We are interested in fine-tuning the Yarn environment for our specific setup. Specifically, we would like to inquire about the versions of transformers, accelerate and deepspeed used in the Yarn environment. Could you please provide details on how these tools are configured in your environment?

Any guidance or information you can offer regarding this matter would be greatly appreciated.

Thank you for your time and assistance!

Testing yarn on practical tasks.

Hello, this is Chenxin.

I am sooo excited to see the first open-source model with more than 100k context !!! This is undoubtedly a very significant progress that the open-source community has made in LCLMs.
I've noticed that the current version of Yarn only has PPL (Perplexity) experiments, which do not always correlate with practical long-context understanding tasks. I am glad😁 to help test llama2-yarn-128k on LEval but I do not have resources to do SFT based on llama2-yarn-128k. Would you mind providing a instruction-following version?

Thanks again for the great work!

Running Error

When launching finetune. py using the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4 accelerate launch finetune.py --output-dir output/yarn-7b-64k --model /data/wy/llm_base/Llama-2-7b-hf --dataset /data/wy/LLMScaledData/pg_books-tokenized-bos-eos-chunked-6/data

The following error occurred:
Traceback (most recent call last):
File "/data/wy/yarn/finetune.py", line 293, in
main(args.parse_args())
File "/data/wy/yarn/finetune.py", line 156, in main
model.gradient_checkpointing_enable()
File "/home/centos/anaconda3/envs/llm_sacled/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'gradient_checkpointing_enable'

Need to modify 'model.gradient_checkpointing_enable()' to 'model.module.gradient_checkpointing_enable()'

Confirmation of License

Nice work making this.

Could you clarify/confirm the license here? I see MIT License here on Github and no license on HuggingFace.

I would have assumed this has to at least be Meta Community License as that would transfer through because of using Llama 2.

It looks like the only new training data added is PG-19, which seems to be Apache 2- so it seems that YaRN could take on a Meta Community License.

Questions about max-position-embeddings

Hi, I want to know in passkey retrieval, does this parameter max-position-embeddings need to be set to the length after the scale

Mistral-train error on deepspeed config

File "/workspace/long/yarn/finetune.py", line 143, in main
model = accelerator.prepare(model)
File "/root/miniconda3/envs/yarn/lib/python3.10/site-packages/accelerate/accelerator.py", line 1280, in prepare
result = self._prepare_deepspeed(*args)
File "/root/miniconda3/envs/yarn/lib/python3.10/site-packages/accelerate/accelerator.py", line 1515, in _prepare_deepspeed
raise ValueError(
ValueError: When using DeepSpeed accelerate.prepare() requires you to pass at least one of training or evaluation dataloaders or alternatively set an integer value in train_micro_batch_size_per_gpu in the deepspeed config fileor assign integer value to AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu'].
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1523510) of binary: /root/miniconda3/envs/yarn/bin/python

Training takes a long time

Why does it take so long for me to fine-tune llama2-7b-64k?
Each epoch takes 300+ seconds
I used 8xA100, turned on deepspeed, and used "yarn" for rope type.
Is it a problem with flash attention? But I see that modeling_llama_together_yarn.py uses flash attention by default?
Thanks a lot.

Phi 2

Hi,
Thank you for releasing this code! Are there any plans to train a Phi 2 model?
Thanks!

Yarn gets worse results than NTK-aware-scaling policy, under non-fine-tuned scenarios

model info

base-model : baichuan-7b
base-context-size : 4096

Did this phenomenon oberserved in your experiments?
In short context-window: Ntk > Yarn

In longer context-window Yarn > Ntk

Best datasets to use for finetuning?

I suppose that'd depend on the specific RoPE variant to be used, but I wonder if you've conducted any experiments?

Runtime error

Hi,
I am trying to fine-tune a 7b model for 16k context length on a 8 GPU, A100, 40 GB machine. But, I am getting the following runtime error:

Traceback (most recent call last):
File "/home/ec2-user/data/yarn/finetune.py", line 222, in <module>
    main(args.parse_args())
  File "/home/ec2-user/data/yarn/finetune.py", line 150, in main
    loss = model(**batch).loss
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 985, in forward
    outputs = self.model(
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 860, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 856, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 620, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 555, in forward
    ).reshape(bsz, q_len, h_size)
RuntimeError: shape '[1, 16384, 4096]' is invalid for input of size 13459456

Here is the command:

accelerate launch finetune.py --wandb yarn --output-dir output/yarn-7b-16k --model meta-llama/Llama-2-7b-chat-hf --max-train-steps 20 --scaling-factor 4 --scaling-type yarn --seed 31337 --dataset shossain/govreport-qa-5-16384 --gradient-accumulate-every 1

Please suggest.

Trying to set a tensor of shape torch.Size([257, 1024]) in "weight" (which has shape torch.Size([1226, 1024])), this look incorrect

麻烦大家看看：在部署模型的时候显示量化图层失败的信息，以及Trying to set a tensor of shape torch.Size([257, 1024]) in "weight" (which has shape torch.Size([1226, 1024])), this look incorrect错误。
目前找了网上的方法，说Transformers更新，我按照了官方文档的要求更新为pip install auto_gptq transformers==4.33.1了，也还是报同样的错。

How do i increases the context of already fined tuned or base model of llama2 ?

I want to increase the context of the llama2 model, I have finetuned a model(70b and 7b) as well on my data now I want to increase their input context using yarn I was able to understand that we need data of 16k context if we want to increase the context can anyone clarify the procedures that follows after that

v2版本的论文什么时候提交到arxiv上？

如题。对于强迫症来说，目前的版本看着有点难受。

作者是期望v2版本增加新内容吗？我认为先把typo改过来就好。

cannot connect to hugging face

If my server cannot connect to hugging face, then I have downloaded your model on hugging face. How can I run the code in your warehouse? Thanks

Should the training incremental? From 64k to 128k having the output of first training passed as input to the next?

@jquesnelle @bloc97

The train.sh starts off with the default yarn-factor=16 for both the 7b and 13b cases to generate the output output/yarn-7b-64k or output/yarn-13b-64k and passes the corresponding output as model to generate the respective 128k outputs.

Is it mandatory to go from 4k to 64k and then to 128k in incremental steps? Is it not possible to go from 4k to 128k directly?

What the recommended GPU setup for fine-tuning ?

I run into OOM error with default setup on 8*A100 with train.sh script, could you please share the GPU requirements for fine-tuning ?

Are 7B and 13B Models fine-tuned?

I've been running on a 40 GB A100 using transformers and GPTQ. To get the model working at all, there seems to be a specific order in which the packages have to be installed.

!pip3 install git+https://github.com/huggingface/transformers.git
!pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip3 install git+https://github.com/huggingface/optimum.git
!pip3 install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
!pip3 install flash-attn==2.1.1 --no-build-isolation

With the above, I'm running out of memory around 8000 tokens of input (using the 7B model) and the output becomes garbled.

I've tried GPTQ, bnb nf4, and bf16 loading.

On bf16 loading, the output is garbled at 4k tokens of input.

Are the 7B and 13B yarn models fine-tuned? Do you have recommendations on how better to run them?

Questions about DynamicNTK

yarn/scaled_rope/modeling_llama_yarn.py

Line 214 in ff9321f

 (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1) 

Please tell me here, if I want to expand from 2K to 16K, then the factor multiplied by the base here is
$(8 * 16K / 2K) - (8 - 1) = 57$,
Is this multiple reasonable? Are there some problems here?
Please correct me if I'm wrong.

@bloc97

context length and dataset size

I am looking at the training command for mistral:

yarn/train.sh

Line 60 in 0ae3b2d

--dataset emozilla/yarn-train-tokenized-16k-mistral \

Can I train a 64k context length model with 16k long dataset? Or is it just an example?

cannot load safetensor: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32000, 4096]))

After check #45 , #40 and some hard-code modification, these command passed

# trainning
accelerate launch finetune.py   --output-dir output/yarn-7b-8k   
--model NousResearch/Llama-2-7b-hf  --scaling-factor 2  
--wandb yarn  --dataset
 emozilla/yarn-train-tokenized-8k-llama    --deepspeed

# save
accelerate launch finetune.py   --output-dir output/yarn-7b-8k   
--model NousResearch/Llama-2-7b-hf --save-only  --scaling-factor 2  
--wandb yarn --output-dir output-8k-save  --dataset
 emozilla/yarn-train-tokenized-8k-llama    --deepspeed

And I got these files:

(torch2) root@9b2ed2383075:/workspace/yarn/output/yarn-7b-8k# tree
.
|-- config.json
|-- model-00001-of-00003.safetensors
|-- model-00002-of-00003.safetensors
|-- model-00003-of-00003.safetensors
|-- model.safetensors
`-- model.safetensors.index.json

For load it with passkey.py, I merge these safetensors into original NousResearch/Llama-2-7b-hf and got this error

torch2) root@9b2ed2383075:/workspace/yarn# python3 eval/passkey.py -m /workspace/models/Llama-2-7b-hf/
Determining sequence lengths: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00,  1.48it/s]
Model:   0%|                                                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                                                                                                                                                                                            
  File "/workspace/yarn/eval/passkey.py", line 127, in <module>
    main(add_args(parser).parse_args())
  File "/workspace/yarn/eval/passkey.py", line 90, in main
    loaded = load_model_and_apply_patches(model, args)
  File "/workspace/yarn/eval/model_loader.py", line 215, in load_model_and_apply_patches
    return apply_patches(load_model(model, args), args)
  File "/workspace/yarn/eval/model_loader.py", line 90, in load_model
    loaded = model_cls.from_pretrained(
  File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3870, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 743, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32000, 4096])), this look incorrect.

I noticed that your official https://huggingface.co/NousResearch/Yarn-Llama-2-7b-64k does not need any safetensor and can be test succesfully,

Did I missed any model conversion script ?

Inquiry Regarding Evaluation Metrics in Your Paper

@bloc97 @jquesnelle Dear Authors,

Firstly, I would like to extend my sincere appreciation for your remarkable work. It is truly commendable and has served as a valuable resource for the community.

Upon reading your paper, I encountered some confusion regarding the evaluation metrics employed. Specifically, in Section 4.3.1, you state: "...selected 10 random samples from Proof-pile with at least 128k tokens each and evaluated the perplexity of each of these samples when truncated at 2k steps from a sequence length of 2k tokens through 128k tokens." Could you kindly clarify what is meant by "2k steps" in this context?

Additionally, the term "Sliding window perplexity (S = 256) of ten 128k Proof-pile documents truncated to evaluation context window size" is used multiple times. However, I am uncertain how sliding window perplexity is applied if the documents are truncated to the evaluation context window size. Does it mean the documents are truncated to the maximum evaluation context window size (128k)?

Your insights and clarifications on these points would be greatly appreciated, as they might resolve some misunderstandings I have regarding the paper.

Thank you for your time and consideration.

How should I proceed with conducting an evaluation for lm-evaluation-harness?

Hello developer, I've been trying to conduct an evaluation of lm-evaluation-harness based on your paper, but I'm encountering an issue stating that the directory doesn't exist.

Could you provide more detailed instructions on how to conduct the evaluation?

Here is the command I've been using and the error that occurs.

command
pip install git+https://github.com/EleutherAI/lm-evaluation-harness
./eval-harness.sh

error-command
python: can't open file '/workspace/yarn/../lm-evaluation-harness/main.py': [Errno 2] No such file or directory

Your assistance would be greatly appreciated!
(help me plz..!!!)

Hardware equipments and training time?

I am very curious about the hardware equipment you use for training and the time it takes for the training. Do you have a detailed introduction? If so, I would be extremely grateful.

It is the highest at the lowest dimension and the lowest at the highest dimension.

Why the wavelength is lowest at the highest dimension?
According to the eq(14), when the d is bigger, wavelength gets bigger.

Unexpected larger perplexity on PG19

Hi Yarn team,

I hope this finds you well. I've been using your code jquesnelle/yarn for testing the PG19 dataset. While reviewing the eval.sh script, I noticed some definitions related to the PG19 dataset, but the code for testing perplexity results seems somewhat unclear.

Settings:

Base Model: llama2-7b
Base Context Size: 4096
Sliding Window: 256, 4096
Scale to: 8192

In eval.sh, I found the following definition for the PG19 dataset:

# python eval/perplexity.py -m meta-llama/Llama-2-7b-hf --dataset pg19 --split test --feature text --save-tokenized output/pg19-test-tokenized
PG19="--tokenized emozilla/pg19-test-tokenized"

However, I did not find the actual code for testing perplexity results. Therefore, I attempted to use our own defined code for testing:

python eval/perplexity.py --dataset pg19 --feature "text" --samples 5 -m meta-llama/Llama-2-7b-hf --max-tokens $max_tokens --min-tokens $max_tokens --tokens-step 4000 --tokenized emozilla/pg19-test-tokenized --yarn $((max_tokens / 4096)) --max-position-embeddings 4096 --original-max-position-embeddings 4096 --dataset-min-tokens $max_tokens --sliding-window 4096 --custom-model --aggressive-memory --flash-attention

I observed that the results differ when the sliding window is set to 4096 and 256. In comparison to other PI and dy-ntk methods, the performance is unstable with a sliding window set to 256 and stable with a sliding window set to 4096.

Results:

--sliding-window 4096:
- meta-llama/Llama-2-7b-hf: 8192=9.89344
--sliding-window 256:
- meta-llama/Llama-2-7b-hf: 8192=32.76145

In contrast, other PI and dy-ntk methods maintain relatively stable performance when the sliding window is set to 256 and 4096:

Sliding window: 4096 / 256
- PI: 10.79598 / 10.65644
- dy-ntk: 10.19125 / 10.214816

I would appreciate your insights on this phenomenon. Is this behavior considered normal, or could there be potential configuration issues? If possible, could you provide more detailed information about the PG19 dataset testing script to help me better understand and adjust the testing configuration?

Thank you very much for your time and assistance. I look forward to your response.

Best regards,
Yiran

OOM when doing text generation

Hi,

I have been running into out of memory issues when trying to generate some text using the model "NousResearch/Yarn-Llama-2-7b-128k". I am using a prompt with 126k tokens and running things on 1 GPU. The script that I am using is the "eval/prompt-loop.py". I tried to set load_in_4bit = True but it didn't help.

Do you have advice to solve this issue?

Thanks !

License

Currently this repository doesn't contain a license file. It would be great if you could add one to clarify under which license the code is made available. Thanks!