princeton-nlp / autocompressors Goto Github PK
View Code? Open in Web Editor NEW[EMNLP 2023] Adapting Language Models to Compress Long Contexts
Home Page: https://arxiv.org/abs/2305.14788
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
Home Page: https://arxiv.org/abs/2305.14788
Hi,
I am trying to train an AutoCompressor from scratch but I am encountering the following error when executing bash run/train_llama.sh --fast_attention
:
Traceback (most recent call last):
File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/train.py", line 285, in <module>
main()
File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/train.py", line 240, in main
train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
File "/gscratch/xlab/msclar/anaconda3/envs/embed/lib/python3.9/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/gscratch/xlab/msclar/anaconda3/envs/embed/lib/python3.9/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/substep_trainer.py", line 195, in training_step
loss, softprompt = self.training_substep(model, input_slice, softprompt, segment_lengths)
File "/mmfs1/gscratch/xlab/msclar/Repositories/AutoCompressors/substep_trainer.py", line 174, in training_substep
if self.do_grad_scaling:
AttributeError: 'SubstepTrainer' object has no attribute 'do_grad_scaling'
I used to be able to train AutoCompressors a few months ago, so I wonder if the issue is related to any updates either in AutoCompressor or a dependency. Do you happen to know what the issue may be? I am trying to train the model in a single A100.
Also, I believe that to be able to run the current code with the original OPT models one needs to change line 170 of train.py, changing LlamaAutoCompressorModel to AutoCompressorModel.
Thanks in advance!
Hello, thanks for the great work! But I wonder why the training procedure is split to several substeps with each substeps has several segments. As from the code, the softprompt is accumulate through each input. So why do we just divide the inputs to several segments and accumulate the loss and softprompt during the forward_segment
function and divide the gradient_accumulate_steps
in training_step.
The paper says
Our code and pre-trained models are publicly available at https://github.com/princeton-nlp/AutoCompressors.
Right now, that doesn't seem to be the case. Is there a timeline for this?
Hi, is it possible to reduce the number of summary vectors to less than 15 tokens ? I am just curious, if you have tried the number of summary vectors to be in this ballpark and if it works effectively. I did try using a subarray of summary vectors as follows,
generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors[:, :10, :], max_new_tokens=12)[0]
But this gives very random answers. Is there a way to have only reduced summary vectors for the same ?
Hi,
Currently, is it possible to finetune one of the (pretrained) autocompressor models? If yes, could you please provide references to sample code/readme?
Your insightful work of AutoCompressor on compressing sequences provides wonderful thoughts on the topic of processing long windows. Recently, I've been trying to reproduce some of your results (mainly about Table 1. Sec. 4.1 in your paper) and got a few questions:
You've kindly provided the 6K/8K split version of 2B tokens from the Pile for training and evaluation, as well as the checkpoint named as AutoCompressor-2.7b-6k. If I understand it correctly, the checkpoint here is exactly the model "AutoCompressor" in Table 1 and it is trained and evaluated with the 8K split version data. Am I right?
Given the assumption above, I evaluated the model using the checkpoint and the data of 8K sequences with the results listed below. I reused your script train.sh and set segments_per_substep=${SEG:-4}
and training_substeps=${SUB:-4}
. And I got the following results, which had a gap from the reported data.
Domain | 6k model 6k→2k |
---|---|
Book3 | 10.37 |
FreeLaw | 6.44 |
Github | 3.94 |
Wikipedia | 8.86 |
Average (exp of mean NLL) | 6.95 |
Reported in paper | 5.93 |
I'm not sure if I misunderstood some of the evaluation settings, and I'd like to know whether you may share the script for reproducing results with other context lengths (128,512,2048) in Table 1. Your attention to this matter is highly appreciated. Thanks a lot!
I just wanted to follow up on #2. There, @CodeCreator said the code would be released by June 25th at the latest. Is there a new timeline?
Sorry if this is obvious, but Is there anyway to run/install this without having to download the git repo? I couldn't find the "auto_compressor" package on pip which is where I normally install dependencies from
Hi,
Thanks for the great work! I am facing a bug and hope you to help me :) @CodeCreator
When I ran 'bash/train_llama.sh', I got the same error as #13.
After trying the solution, the code stuck with another error:
RuntimeError: FlashAttention only support fp16 and bf16 data type
It seems I should upgrade my transformers version, but I will face issue #13 again, how could I solve it?
Or could you give a compatible transformers version?
Hi, I'm trying to reproduce this excellent work. I found some suspected bugs in line 232 and 233 of AutoCompressor.py.
When i print two variables: soft_prompt_length
and summary_length
, i find that during the forward_segment
loop, these two variables will become 0 at some time before loop ends, which will cause the shift_logits
dimension to be summary_length longer than labels
when calculating loss. I speculate that this may be caused by calling SummaryConfig.reset
early?
Anyway, I chose to replace the code in these two places with:
soft_prompt_length = soft_prompt.size(1)
summary_length = placeholder_embeds.size(1)
to avoid errors. Can you guys check the reason for this?
When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.
Generating train split: 7%|▋ | 5813/81380 [00:35<03:31, 357.02 examples/s]E0731 23:14:13.108000 140299780256832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 488431) of binary: /home/oswaldhe/miniconda3/envs/autocompressor/bin/python
Traceback (most recent call last):
File "/home/oswaldhe/miniconda3/envs/autocompressor/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.
Hello! Thanks for your impressive works.
I have been exploring this work and ran into some issues while using the example provided with the model princeton-nlp/AutoCompressor-2.7b-6k. It seems that the summary vector approach is not functioning as expected, particularly when the context involves numerical information.
Issues Encountered:
Summary vector handling with questions containing numbers: When using the summary vectors, the model fails to generate correct answers for questions that contain numerical information. For example, in the provided cases, the model produces incomplete answers with truncated numbers (e.g., "19" instead of "1973" for the birth year and graduation year of Joe Biden).
Incomplete answers: The model also demonstrates an inability to generate full answers in some cases. Instead, it generates truncated responses, cutting off crucial information (e.g., "the" instead of completing the phrase "the presidency").
Example Cases:
It's crucial to address these issues, as they significantly impact the model's ability to provide accurate and complete answers, particularly in scenarios involving numeric contexts. I would appreciate any insights or guidance on resolving these problems to enhance the model's performance in handling such cases.
In you paper,
You said llama2 was trained on LLAMA with fewer parameters,
But your shared model (princeton-nlp/AutoCompressor-Llama-2-7b-6k) seems just full fine-tuned model.
https://huggingface.co/princeton-nlp/AutoCompressor-Llama-2-7b-6k/tree/main
If the models is lora-trained, It must be use PEFT library,
but the that model is not using PEFT, just loaded on only transformers library.
Can you explain about this?
Hi, I found the domain of test samples does not match the name of the dataset.
from datasets import load_dataset
from transformers import AutoTokenizer
ds = load_dataset('awettig/Pile-Github-0.5B-6K-opt')
tok = AutoTokenizer.from_pretrained('facebook/opt-2.7b')
print(tok.decode(ds['test'][-1]['input_ids'])[:100])
# Output:
# </s>Ogōri Station\n\nis a railway station on the Amagi Line located in Ogōri, Fukuoka Prefecture, Japan. It is operated by the Amagi Railway, a third sector public-private partnership corporation.\n\nLines\nThe station is served by the Amagi Railway Amagi Line and is located 3.8\xa0km from the start of the line at. All Amagi Line trains stop at the station.\n\nLayout\nThe station consists of a side platform serving a single track on an embankment. There is no station building but a shelter has been set up on the platform. From the main road, a roofed flight of steps leads up to the platform. A staffed ticket window is located at an intermediate landing halfway up the flight of steps.\n\nPlatforms\n\nAdjacent stations\n\nHistory\nJapanese Government Railways (JGR) opened the station on 28 April 1939 with the name
The output is from the wikipedia page https://en.wikipedia.org/wiki/Og%C5%8Dri_Station rather than Github. I am new to the Pile dataset, so I am curious whether this is an error in your preprocessing scripts, or it is expected.
Hello, I noticed the position_ids
always is None in the released code, but the model still works well. why? thanks.
Can you help me how to conduct experiments using a single NVIDIA A100 80GB GPU?
I meet the problem of "CUDA out of memory."
My environments:
GPU: NVIDIA A100 80GB GPU cuda11.7
torch version: 1.13.1
I have set total batch_size=1, but also out of memory.
Is there anything that needs to be modified in your train.sh
?
This is an impressive work, thank you for your contribution. However, I have some confusion about the ICL experiments. Did you use the ICL training dataset when fine-tune AutoCompressor from Llama-7B? Also, how did you handle texts longer than 150 characters for datasets like BoolQ and MultiRC? Thank you!
Hi, thanks again for your great work, I am reading the source code about AutoCompressor and I have a small question about the preprocessing procedure. Acoording to the code:
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
if total_length >= block_size:
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
If I am understanding correctly, this function will concatenate texts from different docs firstly, and chunk the concatenated text according to block_size
(maybe 6144 for princeton-nlp/AutoCompressor-Llama-2-7b-6k? this argument is not defined in the released code). We can get text of arbitrary length by this way, which is great for earning long text input, but it also ignores the independence of different documents, does it hurt the performance? Thanks.
I am trying to understand how summary vectors are generated. From this line, my understanding is that they are part of last_hidden_state
tensor. How is it ensured that the dimension of last_hidden_state
of all models is at least equal to summary_length
?
Hey bro, how exciting your work is! Thanks for your contribution. I have some confusion surround me:
Hi,
I find the following missing from the install instructions
Hi, the pre-trained model mentioned in the repository is currently not available. Is there a tentative release date for the pre-trained model? It would be great to have an estimated date.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.