princeton-nlp / autocompressors Goto Github PK

[EMNLP 2023] Adapting Language Models to Compress Long Contexts

Home Page: https://arxiv.org/abs/2305.14788

Python 93.17% Shell 6.83%

autocompressors's Issues

AttributeError: 'SubstepTrainer' object has no attribute 'do_grad_scaling'

Hi,

I am trying to train an AutoCompressor from scratch but I am encountering the following error when executing bash run/train_llama.sh --fast_attention:

Traceback (most recent call last):
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/train.py", line 285, in <module>
    main()
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/train.py", line 240, in main
    train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/gscratch/xlab/msclar/anaconda3/envs/embed/lib/python3.9/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/gscratch/xlab/msclar/anaconda3/envs/embed/lib/python3.9/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/substep_trainer.py", line 195, in training_step
    loss, softprompt = self.training_substep(model, input_slice, softprompt, segment_lengths)
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/substep_trainer.py", line 174, in training_substep
    if self.do_grad_scaling:
AttributeError: 'SubstepTrainer' object has no attribute 'do_grad_scaling'

I used to be able to train AutoCompressors a few months ago, so I wonder if the issue is related to any updates either in AutoCompressor or a dependency. Do you happen to know what the issue may be? I am trying to train the model in a single A100.

Also, I believe that to be able to run the current code with the original OPT models one needs to change line 170 of train.py, changing LlamaAutoCompressorModel to AutoCompressorModel.

Thanks in advance!

substep & segment

Hello, thanks for the great work! But I wonder why the training procedure is split to several substeps with each substeps has several segments. As from the code, the softprompt is accumulate through each input. So why do we just divide the inputs to several segments and accumulate the loss and softprompt during the forward_segment function and divide the gradient_accumulate_steps in training_step.

Timeline for Release of Code and Pre-Trained Models

The paper says

Our code and pre-trained models are publicly available at https://github.com/princeton-nlp/AutoCompressors.

Right now, that doesn't seem to be the case. Is there a timeline for this?

Reduce the number of summary vectors

Hi, is it possible to reduce the number of summary vectors to less than 15 tokens ? I am just curious, if you have tried the number of summary vectors to be in this ballpark and if it works effectively. I did try using a subarray of summary vectors as follows,

generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors[:, :10, :], max_new_tokens=12)[0]

But this gives very random answers. Is there a way to have only reduced summary vectors for the same ?

Finetuning an autocompressor model

Hi,
Currently, is it possible to finetune one of the (pretrained) autocompressor models? If yes, could you please provide references to sample code/readme?

Inquire on data of Table 1

Your insightful work of AutoCompressor on compressing sequences provides wonderful thoughts on the topic of processing long windows. Recently, I've been trying to reproduce some of your results (mainly about Table 1. Sec. 4.1 in your paper) and got a few questions:

You've kindly provided the 6K/8K split version of 2B tokens from the Pile for training and evaluation, as well as the checkpoint named as AutoCompressor-2.7b-6k. If I understand it correctly, the checkpoint here is exactly the model "AutoCompressor" in Table 1 and it is trained and evaluated with the 8K split version data. Am I right?
Given the assumption above, I evaluated the model using the checkpoint and the data of 8K sequences with the results listed below. I reused your script train.sh and set segments_per_substep=${SEG:-4} and training_substeps=${SUB:-4}. And I got the following results, which had a gap from the reported data.

Domain	6k model 6k→2k
Book3	10.37
FreeLaw	6.44
Github	3.94
Wikipedia	8.86
Average (exp of mean NLL)	6.95
Reported in paper	5.93

I'm not sure if I misunderstood some of the evaluation settings, and I'd like to know whether you may share the script for reproducing results with other context lengths (128,512,2048) in Table 1. Your attention to this matter is highly appreciated. Thanks a lot!

Follow-up on Code Release Timeline

I just wanted to follow up on #2. There, @CodeCreator said the code would be released by June 25th at the latest. Is there a new timeline?

Install as python package?

Sorry if this is obvious, but Is there anyway to run/install this without having to download the git repo? I couldn't find the "auto_compressor" package on pip which is where I normally install dependencies from

RuntimeError: FlashAttention only support fp16 and bf16 data type

Hi,
Thanks for the great work! I am facing a bug and hope you to help me :) @CodeCreator
When I ran 'bash/train_llama.sh', I got the same error as #13.
After trying the solution, the code stuck with another error:
RuntimeError: FlashAttention only support fp16 and bf16 data type
It seems I should upgrade my transformers version, but I will face issue #13 again, how could I solve it?
Or could you give a compatible transformers version?

BUG REPORT

Hi, I'm trying to reproduce this excellent work. I found some suspected bugs in line 232 and 233 of AutoCompressor.py.
When i print two variables: soft_prompt_length and summary_length, i find that during the forward_segment loop, these two variables will become 0 at some time before loop ends, which will cause the shift_logits dimension to be summary_length longer than labels when calculating loss. I speculate that this may be caused by calling SummaryConfig.reset early?
Anyway, I chose to replace the code in these two places with:

soft_prompt_length = soft_prompt.size(1)
summary_length = placeholder_embeds.size(1)

to avoid errors. Can you guys check the reason for this?

torchrun error when generating training split

When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.

Generating train split:   7%|▋         | 5813/81380 [00:35<03:31, 357.02 examples/s]E0731 23:14:13.108000 140299780256832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 488431) of binary: /home/oswaldhe/miniconda3/envs/autocompressor/bin/python
Traceback (most recent call last):
  File "/home/oswaldhe/miniconda3/envs/autocompressor/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.

Summary Vector Failures and Incomplete Answers with Numerical Contexts

Hello! Thanks for your impressive works.
I have been exploring this work and ran into some issues while using the example provided with the model princeton-nlp/AutoCompressor-2.7b-6k. It seems that the summary vector approach is not functioning as expected, particularly when the context involves numerical information.

Issues Encountered:

Summary vector handling with questions containing numbers: When using the summary vectors, the model fails to generate correct answers for questions that contain numerical information. For example, in the provided cases, the model produces incomplete answers with truncated numbers (e.g., "19" instead of "1973" for the birth year and graduation year of Joe Biden).
Incomplete answers: The model also demonstrates an inability to generate full answers in some cases. Instead, it generates truncated responses, cutting off crucial information (e.g., "the" instead of completing the phrase "the presidency").

Example Cases:

The birth year of Joe Biden is "19 (Expected: "1942")
Joe Biden graduated from the University of Delaware in the year "19 (Expected: "1965")
He earned his law degree from Syracuse University College of Law in "1973 (Expected: "1968")
Joe Biden served as the United States Senator from Delaware for six terms, from "1973 (Expected: "1973 to 2009")
Biden campaigned for the presidency in "the (Expected: "the 2020 elections")
He was inaugurated as the 46th President of the United States on "In (Expected: "In January 20, 2021")

It's crucial to address these issues, as they significantly impact the model's ability to provide accurate and complete answers, particularly in scenarios involving numeric contexts. I would appreciate any insights or guidance on resolving these problems to enhance the model's performance in handling such cases.

Your shared model trained on LLAMA2 is not trained on Lora, It's full-finetuned model.

In you paper,
You said llama2 was trained on LLAMA with fewer parameters,
But your shared model (princeton-nlp/AutoCompressor-Llama-2-7b-6k) seems just full fine-tuned model.
https://huggingface.co/princeton-nlp/AutoCompressor-Llama-2-7b-6k/tree/main

If the models is lora-trained, It must be use PEFT library,
but the that model is not using PEFT, just loaded on only transformers library.

Can you explain about this?

Question on the preprocessed data

Hi, I found the domain of test samples does not match the name of the dataset.

from datasets import load_dataset
from transformers import AutoTokenizer

ds = load_dataset('awettig/Pile-Github-0.5B-6K-opt')
tok = AutoTokenizer.from_pretrained('facebook/opt-2.7b')

print(tok.decode(ds['test'][-1]['input_ids'])[:100])

# Output:
# </s>Ogōri Station\n\nis a railway station on the Amagi Line located in Ogōri, Fukuoka Prefecture, Japan. It is operated by the Amagi Railway, a third sector public-private partnership corporation.\n\nLines\nThe station is served by the Amagi Railway Amagi Line and is located 3.8\xa0km from the start of the line at. All Amagi Line trains stop at the station.\n\nLayout\nThe station consists of a side platform serving a single track on an embankment. There is no station building but a shelter has been set up on the platform. From the main road, a roofed flight of steps leads up to the platform. A staffed ticket window is located at an intermediate landing halfway up the flight of steps.\n\nPlatforms\n\nAdjacent stations\n\nHistory\nJapanese Government Railways (JGR) opened the station on 28 April 1939 with the name

The output is from the wikipedia page https://en.wikipedia.org/wiki/Og%C5%8Dri_Station rather than Github. I am new to the Pile dataset, so I am curious whether this is an error in your preprocessing scripts, or it is expected.

question about `position_ids`

Hello, I noticed the position_ids always is None in the released code, but the model still works well. why? thanks.

CUDA out of memory.

Can you help me how to conduct experiments using a single NVIDIA A100 80GB GPU?

I meet the problem of "CUDA out of memory."

My environments:
GPU: NVIDIA A100 80GB GPU cuda11.7
torch version: 1.13.1

I have set total batch_size=1, but also out of memory.

Is there anything that needs to be modified in your train.sh?

Some issue about ICL Experience

This is an impressive work, thank you for your contribution. However, I have some confusion about the ICL experiments. Did you use the ICL training dataset when fine-tune AutoCompressor from Llama-7B? Also, how did you handle texts longer than 150 characters for datasets like BoolQ and MultiRC? Thank you!

Question about the data preprocessing

Hi, thanks again for your great work, I am reading the source code about AutoCompressor and I have a small question about the preprocessing procedure. Acoording to the code:

# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
  # Concatenate all texts.
  concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
  # customize this part to your needs.
  if total_length >= block_size:
      total_length = (total_length // block_size) * block_size
  # Split by chunks of max_len.
  result = {
      k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
      for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

If I am understanding correctly, this function will concatenate texts from different docs firstly, and chunk the concatenated text according to block_size (maybe 6144 for princeton-nlp/AutoCompressor-Llama-2-7b-6k? this argument is not defined in the released code). We can get text of arbitrary length by this way, which is great for earning long text input, but it also ignores the independence of different documents, does it hurt the performance? Thanks.

Dimension of last_hidden_state size

I am trying to understand how summary vectors are generated. From this line, my understanding is that they are part of last_hidden_state tensor. How is it ensured that the dimension of last_hidden_state of all models is at least equal to summary_length?

Held-out perplexity question

Hey bro, how exciting your work is! Thanks for your contribution. I have some confusion surround me：

The paper mentioned that Perplexity calculated by held-out last 2048 tokens. But it seems that you calculated the entire sequence's nll during the evaluation phase instead of fixing hthe last 2048 tokens. It would be highly appreciated if you could reply me, thanks!

Install instructions are not clear

Hi,
I find the following missing from the install instructions

How do I install autocompressors package?
What should I install to just perform inference (i.e., obtain soft prompts for a given prompt)?
How can I run (2) without flash-attention?

Inquiry for the release date of the pre-trained model

Hi, the pre-trained model mentioned in the repository is currently not available. Is there a tentative release date for the pre-trained model? It would be great to have an estimated date.

princeton-nlp / autocompressors Goto Github PK

autocompressors's Issues

The paper mentioned that Perplexity calculated by held-out last 2048 tokens. But it seems that you calculated the entire sequence's nll during the evaluation phase instead of fixing hthe last 2048 tokens. It would be highly appreciated if you could reply me, thanks!

Recommend Projects

Recommend Topics

Recommend Org