princeton-nlp / autocompressors Goto Github PK

[EMNLP 2023] Adapting Language Models to Compress Long Contexts

Home Page: https://arxiv.org/abs/2305.14788

Python 93.17% Shell 6.83%

autocompressors's Introduction

Adapting Language Models to Compress Long Contexts (EMNLP'23)

This is the official implementation of the paper Adapting Language Models to Compress Long Contexts, in which we train AutoCompressors which are language models with the new capability to (1) compress context information into a small set of summary vectors and (2) reason over these summary vectors which are passed to the model as soft prompts.

Now supporting .generate() and releasing an AutoCompressor based on Llama-2-7b.

Example:

Example use of the API with a pre-trained AutoCompressor model:

import torch
from transformers import AutoTokenizer
from auto_compressor import LlamaAutoCompressorModel, AutoCompressorModel

# Load AutoCompressor trained by compressing 6k tokens in 4 compression steps
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k")
# Need bfloat16 + cuda to run Llama model with flash attention
model = LlamaAutoCompressorModel.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k", torch_dtype=torch.bfloat16).eval().cuda()

prompt = 'The first name of the current US president is "'
prompt_tokens = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

context = """Joe Biden, born in Scranton, Pennsylvania, on November 20, 1942, had a modest upbringing in a middle-class family. He attended the University of Delaware, where he double-majored in history and political science, graduating in 1965. Afterward, he earned his law degree from Syracuse University College of Law in 1968.\nBiden's early political career began in 1970 when he was elected to the New Castle County Council in Delaware. In 1972, tragedy struck when his wife Neilia and 1-year-old daughter Naomi were killed in a car accident, and his two sons, Beau and Hunter, were injured. Despite this devastating loss, Biden chose to honor his commitment and was sworn in as a senator by his sons' hospital bedsides.\nHe went on to serve as the United States Senator from Delaware for six terms, from 1973 to 2009. During his time in the Senate, Biden was involved in various committees and was particularly known for his expertise in foreign affairs, serving as the chairman of the Senate Foreign Relations Committee on multiple occasions.\nIn 2008, Joe Biden was selected as the running mate for Barack Obama, who went on to win the presidential election. As Vice President, Biden played an integral role in the Obama administration, helping to shape policies and handling issues such as economic recovery, foreign relations, and the implementation of the Affordable Care Act (ACA), commonly known as Obamacare.\nAfter completing two terms as Vice President, Joe Biden decided to run for the presidency in 2020. He secured the Democratic nomination and faced the incumbent President Donald Trump in the general election. Biden campaigned on a platform of unity, promising to heal the divisions in the country and tackle pressing issues, including the COVID-19 pandemic, climate change, racial justice, and economic inequality.\nIn the November 2020 election, Biden emerged victorious, and on January 20, 2021, he was inaugurated as the 46th President of the United States. At the age of 78, Biden became the oldest person to assume the presidency in American history.\nAs President, Joe Biden has worked to implement his agenda, focusing on various initiatives, such as infrastructure investment, climate action, immigration reform, and expanding access to healthcare. He has emphasized the importance of diplomacy in international relations and has sought to rebuild alliances with global partners.\nThroughout his long career in public service, Joe Biden has been recognized for his commitment to bipartisanship, empathy, and his dedication to working-class issues. He continues to navigate the challenges facing the nation, striving to bring the country together and create positive change for all Americans."""
context_tokens = tokenizer(context, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

summary_vectors = model(context_tokens, output_softprompt=True).softprompt
print(f"Compressing {context_tokens.size(1)} tokens to {summary_vectors.size(1)} summary vectors")
# >>> Compressing 660 tokens to 50 summary vectors

generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors, max_new_tokens=12)[0]
print("Generation w/ summary vectors:\n" + tokenizer.decode(generation_with_summary_vecs))
# >>> The first name of the current US president is "Joe" and the last name is "Biden".

next_tokens_without_context = model.generate(prompt_tokens, do_sample=False, max_new_tokens=11)[0]
print("Generation w/o context:\n" + tokenizer.decode(next_tokens_without_context))
# >>> The first name of the current US president is "Donald" and the last name is "Trump".

Install

Setup a new environment and install pytorch version 2.1.0, followed by these libraries

pip install packaging
pip install transformers==4.34.0 datasets==2.13.4 accelerate==0.24.1 sentencepiece==0.1.99 flash-attn==2.3.5 wandb
# Flash rotary embeddings (requires setting correct CUDA_HOME variable)
pip install git+https://github.com/Dao-AILab/flash-attention.git#subdirectory=csrc/rotary

Then clone this repo and navigate to the repository root to run scripts or import the libraries.

Training

train.sh is the main method for training AutoCompressors and defines the most important hyperparameters for train.py. You may have adjust some setting, like the number GPUs, depending on the system. The script should be easy to get started with, since it uses pre-tokenized datasets from the huggingface hub.

Notes on Flash Attention

We use Flash Attention which lowers the memory requirements during training substantially.

Llama architecture: We implement flash attention via the Flash Attention package. These kernels require training and running the model on cuda in mixed or half precision.

OPT architecture: We implement flash attention via torch.nn.functional.scaled_dot_product_attention, which you can use by adding --fast_attention to train.sh. Note that this is experimental and requires the preview version of pytorch. We have encountered some issues with using fast attention during evaluation, especially with use_cache=True, so we recommend only using the fast attention during training.

Pre-trained Models

All the fine-tuned models from our paper can be found on Huggingface hub:

Link	Base model	Fine-tuning seq. length	Fine-tuning data	#Summary vectors	Summmary accumulation	Randomized segmenting	Softprompt stop gradient
princeton-nlp/AutoCompressor-Llama-2-7b-6k	Llama-2-7b	6144 tokens in 4 compression steps	15B tokens from RedPajama	50	✔️	✔️	✔️
princeton-nlp/FullAttention-Llama-2-7b-6k	Llama-2-7b	6144 tokens without compression	15B tokens from RedPajama	-
princeton-nlp/AutoCompressor-2.7b-6k	OPT-2.7b	6144 tokens in 4 compression steps	2B tokens from the Pile	50	✔️	✔️	✔️
princeton-nlp/RMT-2.7b-8k	OPT-2.7b	8192 tokens in 4 compression steps	2B tokens from the Pile	50
princeton-nlp/FullAttention-2.7b-4k	OPT-2.7b	4092 tokens without compression	2B tokens from the Pile	-
princeton-nlp/AutoCompressor-2.7b-30k	OPT-2.7b	30720 tokens in 20 compression steps	2B tokens from Books3 from the Pile	50	✔️	✔️	✔️
princeton-nlp/AutoCompressor-1.3b-30k	OPT-1.3b	30720 tokens in 20 compression steps	2B tokens from Books3 from the Pile	50	✔️	✔️	✔️
princeton-nlp/AutoCompressor-1.3b-30k	OPT-1.3b	30720 tokens in 15 compression steps	2B tokens from Books3 from the Pile	50

Loading Models

To load Llama-20based AutoCompressor models, import LlamaAutoCompressModel:

from transformers import AutoTokenizer
from auto_compressor import LlamaAutoCompressorModel

tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k")
model = LlamaAutoCompressModel.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k")

To load OPT-based AutoCompressor models, import OPTAutoCompressorModel:

from transformers import AutoTokenizer
from auto_compressor import OPTAutoCompressorModel

tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/AutoCompressor-2.7b-6k")
model = OPTAutoCompressorModel.from_pretrained("princeton-nlp/AutoCompressor-2.7b-6k")

Summary Vectors

The summary vectors for a given context can be obtained in two ways:

Explicitly: Call the model with out = model(input_ids, attention_mask, ..., output_softprompt=True) and obtain the summary vectors as summary_vectors = out.softprompt which can be passed to further calls by model(..., softprompt=sumary_vectors).
Implicitly: Call the model with out = model(input_ids, segment_lengths=segment_lengths), where segment_lengths is a list of integers that should add up to the overall sequence length input_ids.size(1). After each segment, the model will automatically generate the summary vectors and prepend them to the next segment. This can still be combined with output_softprompt=True to generate the final summary vectors for the entire input. This is convenient for multi-step compression of long inputs, which would otherwise exceed the model's maximum position.

Bug or Questions?

If you have any questions related to the code or the paper, feel free to email Alexis and Alexander ([email protected], [email protected]). If you encounter a problem or bug when using the code, you can open an issue. Please try to specify the problem with detail so we can help you quickly!

Citation

@inproceedings{chevalier2023adapting,
    title = "Adapting Language Models to Compress Contexts",
    author = "Chevalier, Alexis  and
      Wettig, Alexander  and
      Ajith, Anirudh  and
      Chen, Danqi",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.232",
    doi = "10.18653/v1/2023.emnlp-main.232",
    pages = "3829--3846"
}

autocompressors's People

Contributors

Stargazers

Watchers

Forkers

ewouth louchao98 kp-forks apollohuang1 mooler0410 daviddelaurier codeaudit techthiyanes standardgalactic mu-arkhipov tnt305 kaifanli1013 flyingwing broalantaps weedge jerrykal

autocompressors's Issues

torchrun error when generating training split

When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.

Generating train split:   7%|▋         | 5813/81380 [00:35<03:31, 357.02 examples/s]E0731 23:14:13.108000 140299780256832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 488431) of binary: /home/oswaldhe/miniconda3/envs/autocompressor/bin/python
Traceback (most recent call last):
  File "/home/oswaldhe/miniconda3/envs/autocompressor/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.

BUG REPORT

Hi, I'm trying to reproduce this excellent work. I found some suspected bugs in line 232 and 233 of AutoCompressor.py.
When i print two variables: soft_prompt_length and summary_length, i find that during the forward_segment loop, these two variables will become 0 at some time before loop ends, which will cause the shift_logits dimension to be summary_length longer than labels when calculating loss. I speculate that this may be caused by calling SummaryConfig.reset early?
Anyway, I chose to replace the code in these two places with:

soft_prompt_length = soft_prompt.size(1)
summary_length = placeholder_embeds.size(1)

to avoid errors. Can you guys check the reason for this?

Timeline for Release of Code and Pre-Trained Models

The paper says

Our code and pre-trained models are publicly available at https://github.com/princeton-nlp/AutoCompressors.

Right now, that doesn't seem to be the case. Is there a timeline for this?

RuntimeError: FlashAttention only support fp16 and bf16 data type

Hi,
Thanks for the great work! I am facing a bug and hope you to help me :) @CodeCreator
When I ran 'bash/train_llama.sh', I got the same error as #13.
After trying the solution, the code stuck with another error:
RuntimeError: FlashAttention only support fp16 and bf16 data type
It seems I should upgrade my transformers version, but I will face issue #13 again, how could I solve it?
Or could you give a compatible transformers version?

AttributeError: 'SubstepTrainer' object has no attribute 'do_grad_scaling'

Hi,

I am trying to train an AutoCompressor from scratch but I am encountering the following error when executing bash run/train_llama.sh --fast_attention:

Traceback (most recent call last):
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/train.py", line 285, in <module>
    main()
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/train.py", line 240, in main
    train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/gscratch/xlab/msclar/anaconda3/envs/embed/lib/python3.9/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/gscratch/xlab/msclar/anaconda3/envs/embed/lib/python3.9/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/substep_trainer.py", line 195, in training_step
    loss, softprompt = self.training_substep(model, input_slice, softprompt, segment_lengths)
  File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/substep_trainer.py", line 174, in training_substep
    if self.do_grad_scaling:
AttributeError: 'SubstepTrainer' object has no attribute 'do_grad_scaling'

I used to be able to train AutoCompressors a few months ago, so I wonder if the issue is related to any updates either in AutoCompressor or a dependency. Do you happen to know what the issue may be? I am trying to train the model in a single A100.

Also, I believe that to be able to run the current code with the original OPT models one needs to change line 170 of train.py, changing LlamaAutoCompressorModel to AutoCompressorModel.

Thanks in advance!

Summary Vector Failures and Incomplete Answers with Numerical Contexts

Hello! Thanks for your impressive works.
I have been exploring this work and ran into some issues while using the example provided with the model princeton-nlp/AutoCompressor-2.7b-6k. It seems that the summary vector approach is not functioning as expected, particularly when the context involves numerical information.

Issues Encountered:

Summary vector handling with questions containing numbers: When using the summary vectors, the model fails to generate correct answers for questions that contain numerical information. For example, in the provided cases, the model produces incomplete answers with truncated numbers (e.g., "19" instead of "1973" for the birth year and graduation year of Joe Biden).
Incomplete answers: The model also demonstrates an inability to generate full answers in some cases. Instead, it generates truncated responses, cutting off crucial information (e.g., "the" instead of completing the phrase "the presidency").

Example Cases:

The birth year of Joe Biden is "19 (Expected: "1942")
Joe Biden graduated from the University of Delaware in the year "19 (Expected: "1965")
He earned his law degree from Syracuse University College of Law in "1973 (Expected: "1968")
Joe Biden served as the United States Senator from Delaware for six terms, from "1973 (Expected: "1973 to 2009")
Biden campaigned for the presidency in "the (Expected: "the 2020 elections")
He was inaugurated as the 46th President of the United States on "In (Expected: "In January 20, 2021")

It's crucial to address these issues, as they significantly impact the model's ability to provide accurate and complete answers, particularly in scenarios involving numeric contexts. I would appreciate any insights or guidance on resolving these problems to enhance the model's performance in handling such cases.

CUDA out of memory.

Can you help me how to conduct experiments using a single NVIDIA A100 80GB GPU?

I meet the problem of "CUDA out of memory."

My environments:
GPU: NVIDIA A100 80GB GPU cuda11.7
torch version: 1.13.1

I have set total batch_size=1, but also out of memory.

Is there anything that needs to be modified in your train.sh?

Inquire on data of Table 1

Your insightful work of AutoCompressor on compressing sequences provides wonderful thoughts on the topic of processing long windows. Recently, I've been trying to reproduce some of your results (mainly about Table 1. Sec. 4.1 in your paper) and got a few questions:

You've kindly provided the 6K/8K split version of 2B tokens from the Pile for training and evaluation, as well as the checkpoint named as AutoCompressor-2.7b-6k. If I understand it correctly, the checkpoint here is exactly the model "AutoCompressor" in Table 1 and it is trained and evaluated with the 8K split version data. Am I right?
Given the assumption above, I evaluated the model using the checkpoint and the data of 8K sequences with the results listed below. I reused your script train.sh and set segments_per_substep=${SEG:-4} and training_substeps=${SUB:-4}. And I got the following results, which had a gap from the reported data.

Domain	6k model 6k→2k
Book3	10.37
FreeLaw	6.44
Github	3.94
Wikipedia	8.86
Average (exp of mean NLL)	6.95
Reported in paper	5.93

I'm not sure if I misunderstood some of the evaluation settings, and I'd like to know whether you may share the script for reproducing results with other context lengths (128,512,2048) in Table 1. Your attention to this matter is highly appreciated. Thanks a lot!

Follow-up on Code Release Timeline

I just wanted to follow up on #2. There, @CodeCreator said the code would be released by June 25th at the latest. Is there a new timeline?

Dimension of last_hidden_state size

I am trying to understand how summary vectors are generated. From this line, my understanding is that they are part of last_hidden_state tensor. How is it ensured that the dimension of last_hidden_state of all models is at least equal to summary_length?

Question on the preprocessed data

Hi, I found the domain of test samples does not match the name of the dataset.

from datasets import load_dataset
from transformers import AutoTokenizer

ds = load_dataset('awettig/Pile-Github-0.5B-6K-opt')
tok = AutoTokenizer.from_pretrained('facebook/opt-2.7b')

print(tok.decode(ds['test'][-1]['input_ids'])[:100])

# Output:
# </s>Ogōri Station\n\nis a railway station on the Amagi Line located in Ogōri, Fukuoka Prefecture, Japan. It is operated by the Amagi Railway, a third sector public-private partnership corporation.\n\nLines\nThe station is served by the Amagi Railway Amagi Line and is located 3.8\xa0km from the start of the line at. All Amagi Line trains stop at the station.\n\nLayout\nThe station consists of a side platform serving a single track on an embankment. There is no station building but a shelter has been set up on the platform. From the main road, a roofed flight of steps leads up to the platform. A staffed ticket window is located at an intermediate landing halfway up the flight of steps.\n\nPlatforms\n\nAdjacent stations\n\nHistory\nJapanese Government Railways (JGR) opened the station on 28 April 1939 with the name

The output is from the wikipedia page https://en.wikipedia.org/wiki/Og%C5%8Dri_Station rather than Github. I am new to the Pile dataset, so I am curious whether this is an error in your preprocessing scripts, or it is expected.

Reduce the number of summary vectors

Hi, is it possible to reduce the number of summary vectors to less than 15 tokens ? I am just curious, if you have tried the number of summary vectors to be in this ballpark and if it works effectively. I did try using a subarray of summary vectors as follows,

generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors[:, :10, :], max_new_tokens=12)[0]

But this gives very random answers. Is there a way to have only reduced summary vectors for the same ?

Question about the data preprocessing

Hi, thanks again for your great work, I am reading the source code about AutoCompressor and I have a small question about the preprocessing procedure. Acoording to the code:

# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
  # Concatenate all texts.
  concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
  # customize this part to your needs.
  if total_length >= block_size:
      total_length = (total_length // block_size) * block_size
  # Split by chunks of max_len.
  result = {
      k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
      for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

If I am understanding correctly, this function will concatenate texts from different docs firstly, and chunk the concatenated text according to block_size (maybe 6144 for princeton-nlp/AutoCompressor-Llama-2-7b-6k? this argument is not defined in the released code). We can get text of arbitrary length by this way, which is great for earning long text input, but it also ignores the independence of different documents, does it hurt the performance? Thanks.

Install as python package?

Sorry if this is obvious, but Is there anyway to run/install this without having to download the git repo? I couldn't find the "auto_compressor" package on pip which is where I normally install dependencies from

Inquiry for the release date of the pre-trained model

Hi, the pre-trained model mentioned in the repository is currently not available. Is there a tentative release date for the pre-trained model? It would be great to have an estimated date.

Finetuning an autocompressor model

Hi,
Currently, is it possible to finetune one of the (pretrained) autocompressor models? If yes, could you please provide references to sample code/readme?

substep & segment

Hello, thanks for the great work! But I wonder why the training procedure is split to several substeps with each substeps has several segments. As from the code, the softprompt is accumulate through each input. So why do we just divide the inputs to several segments and accumulate the loss and softprompt during the forward_segment function and divide the gradient_accumulate_steps in training_step.

Held-out perplexity question

Hey bro, how exciting your work is! Thanks for your contribution. I have some confusion surround me：

The paper mentioned that Perplexity calculated by held-out last 2048 tokens. But it seems that you calculated the entire sequence's nll during the evaluation phase instead of fixing hthe last 2048 tokens. It would be highly appreciated if you could reply me, thanks!

Install instructions are not clear

Hi,
I find the following missing from the install instructions

How do I install autocompressors package?
What should I install to just perform inference (i.e., obtain soft prompts for a given prompt)?
How can I run (2) without flash-attention?

question about `position_ids`

Hello, I noticed the position_ids always is None in the released code, but the model still works well. why? thanks.

Your shared model trained on LLAMA2 is not trained on Lora, It's full-finetuned model.

In you paper,
You said llama2 was trained on LLAMA with fewer parameters,
But your shared model (princeton-nlp/AutoCompressor-Llama-2-7b-6k) seems just full fine-tuned model.
https://huggingface.co/princeton-nlp/AutoCompressor-Llama-2-7b-6k/tree/main

If the models is lora-trained, It must be use PEFT library,
but the that model is not using PEFT, just loaded on only transformers library.

Can you explain about this?

Some issue about ICL Experience

This is an impressive work, thank you for your contribution. However, I have some confusion about the ICL experiments. Did you use the ICL training dataset when fine-tune AutoCompressor from Llama-7B? Also, how did you handle texts longer than 150 characters for datasets like BoolQ and MultiRC? Thank you!