huggingface / trl Goto Github PK

View Code? Open in Web Editor NEW

8.6K 8.6K 1.1K 6.65 MB

Train transformer language models with reinforcement learning.

Home Page: http://hf.co/docs/trl

License: Apache License 2.0

Python 97.87% Makefile 0.12% Shell 1.59% Dockerfile 0.42%

trl's Introduction

TRL - Transformer Reinforcement Learning

Full stack library to fine-tune and align large language models.

What is it?

The trl library is a full stack tool to fine-tune and align transformer language and diffusion models using methods such as Supervised Fine-tuning step (SFT), Reward Modeling (RM) and the Proximal Policy Optimization (PPO) as well as Direct Preference Optimization (DPO).

The library is built on top of the transformers library and thus allows to use any model architecture available there.

Highlights

Efficient and scalable:
- accelerate is the backbone of trl which allows to scale model training from a single GPU to a large scale multi-node cluster with methods such as DDP and DeepSpeed.
- PEFT is fully integrated and allows to train even the largest models on modest hardware with quantisation and methods such as LoRA or QLoRA.
- unsloth is also integrated and allows to significantly speed up training with dedicated kernels.
CLI: With the CLI you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.
Trainers: The Trainer classes are an abstraction to apply many fine-tuning methods with ease such as the SFTTrainer, DPOTrainer, RewardTrainer, PPOTrainer, CPOTrainer, and ORPOTrainer.
AutoModels: The AutoModelForCausalLMWithValueHead & AutoModelForSeq2SeqLMWithValueHead classes add an additional value head to the model which allows to train them with RL algorithms such as PPO.
Examples: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier, full RLHF using adapters only, train GPT-j to be less toxic, StackLlama example, etc. following the examples.

Installation

Python package

Install the library with pip:

pip install trl

From source

If you want to use the latest features before an official release you can install from source:

pip install git+https://github.com/huggingface/trl.git

Repository

If you want to use the examples you can clone the repository with the following command:

git clone https://github.com/huggingface/trl.git

Command Line Interface (CLI)

You can use TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) and test your aligned model with the chat CLI:

SFT:

trl sft --model_name_or_path facebook/opt-125m --dataset_name imdb --output_dir opt-sft-imdb

DPO:

trl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style --output_dir opt-sft-hh-rlhf

Chat:

trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat

Read more about CLI in the relevant documentation section or use --help for more details.

How to use

For more flexibility and control over the training, you can use the dedicated trainer classes to fine-tune the model in Python.

`SFTTrainer`

This is a basic example of how to use the SFTTrainer from the library. The SFTTrainer is a light wrapper around the transformers Trainer to easily fine-tune language models or adapters on a custom dataset.

# imports
from datasets import load_dataset
from trl import SFTTrainer

# get dataset
dataset = load_dataset("imdb", split="train")

# get trainer
trainer = SFTTrainer(
    "facebook/opt-350m",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)

# train
trainer.train()

`RewardTrainer`

This is a basic example of how to use the RewardTrainer from the library. The RewardTrainer is a wrapper around the transformers Trainer to easily fine-tune reward models or adapters on a custom preference dataset.

# imports
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer

# load model and dataset - dataset needs to be in a specific format
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

...

# load trainer
trainer = RewardTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
)

# train
trainer.train()

`PPOTrainer`

This is a basic example of how to use the PPOTrainer from the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.

# imports
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch

# get models
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = create_reference_model(model)

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# initialize trainer
ppo_config = PPOConfig(batch_size=1, mini_batch_size=1)

# encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")

# get model response
response_tensor  = respond_to_batch(model, query_tensor)

# create a ppo trainer
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)

# define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0)]

# train model for one step with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)

`DPOTrainer`

DPOTrainer is a trainer that uses Direct Preference Optimization algorithm. This is a basic example of how to use the DPOTrainer from the library. The DPOTrainer is a wrapper around the transformers Trainer to easily fine-tune reward models or adapters on a custom preference dataset.

# imports
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer

# load model and dataset - dataset needs to be in a specific format
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

...

# load trainer
trainer = DPOTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
)

# train
trainer.train()

Development

If you want to contribute to trl or customizing it to your needs make sure to read the contribution guide and make sure you make a dev install:

git clone https://github.com/huggingface/trl.git
cd trl/
make dev

References

Proximal Policy Optimisation

The PPO implementation largely follows the structure introduced in the paper "Fine-Tuning Language Models from Human Preferences" by D. Ziegler et al. [paper, code].

Direct Preference Optimization

DPO is based on the original implementation of "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" by E. Mitchell et al. [paper, code]

Citation

@misc{vonwerra2022trl,
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},
  title = {TRL: Transformer Reinforcement Learning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

trl's People

Contributors

Stargazers

Watchers

Forkers

awesome-archive sts-sadr trisongz rishistyping seekingpeace jadentravnik elnazsn1988 tamercs2005 dhlee347 ai-hub-deep-learning-fundamental codeaudit ricklentz alokranjan1234 kiminh zitterbewegung thak123 hangdj ina299 hadyelsahar vblagoje fw9wef diandyu santoshsrinivas79 zoudaoyaderen mybirth0407 tushar7797 simba2017 jacksukk smolpixel willtejeda yana-xuyan harimenath grzegorzwojdyga nadimkaysar techthiyanes maikvogt zeng-wh imohammad12 dhruv2601 dahoas marcus-arcadius hsouporto linwk20 anshradh gauravpandeyamu ellisbrown gouqi666 hdvvip masoudhashemi tanglespace luisagroher ohadrubin theodoregalanos kangqiyue mohan-zhang-u alekseykorshuk ziyi-zhu dmarx sami-bg evan-person shiweiba yashkumaratri shahbuland kaixianglin id4thomas cat-state ljw23 dfrntl nashid lxuechen ibagur alymostafa dheeraj7596 mogaio aicrumb liuqi8827 vishwajeetkumar93 poppingtonic gouyuhang wangotw tristanthrush younesbelkada gtworx chunshan-theta tuananh1406 henry-zeng horiso0921 leezekun vietbeu lyogavin haowu4 shuishen112 lewtun guijinson eltociear ssamdav macguyversmusic myfatemi04 alejandroballestae arjunchandra

trl's Issues

issue with out of data imports in tutorial

there is a demo https://huggingface.co/docs/trl/sentiment_tuning you are instructed to AutoModelForCausalLMWithValueHead instead of GPT2HeadWithValueModel i believe this is out of data.

Testing the snippets of the documentation

We need an efficient way to test the snippets that are on the documentation, maybe we should leverage doc-tests

Exposing the hyper-parameters of the optimzation

It looks like there are a bunch of hyperparameters in the original work that could be tweaked.
For instance beta parameter (Eq.2 of https://arxiv.org/pdf/1909.08593.pdf).
Would be nice to have an easy way to set/change such parameters.

Add google's T5 to models

It will allow to train something similar to chatGPT.

[Question] Why is PPO loss needed?

First of all, thank you for the great package 💪, I just wished that we could make it work for enc-dec models like T5!
My question is: why is the PPO loss so complicated? Couldn't we just train with loss = reward + KL?

Works for T5/BART?

Very cool work!

Does this work for T5/BART models as well?

Conflict in requirements.txt packages

pip install -r requirements.txt gives the following error. This can be fixed by changing to tqdm version to 4.47.0.

The conflict is caused by:
    The user requested tqdm==4.43.0
    simpletransformers 0.60.9 depends on tqdm>=4.47.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

How to liberate the gpt2 from reference model?

Hi,

We know that KL is used in the loss as a constraint for the difference between the original gpt2 and the active gpt2 which produces responses for rewards feedbacks.
How can I can tune the parameters to mitigate this constraint? I mean I want the active gpt2 can deviate much from the original reference gpt2, as I find in my experiments that the rewards do not improve as expected, possibly due to this constraint.
I am new to PPO. Hoping for some suggestions.

Thanks.

Segfault when trying out code example on the README.md

I get a segfault when trying out the code on the README.md.

get models

gpt2_model = GPT2HeadWithValueModel.from_pretrained('gpt2')
gpt2_model_ref = GPT2HeadWithValueModel.from_pretrained('gpt2')
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Errors in :gpt2-sentiment-ppo-training.ipynb

Hi, you imorting "from trl.gpt2 import GPT2HeadWithValueModel, respond_to_batch" in "gpt2-sentiment-ppo-training.ipynb" but in trl library we don't have gpt-2 file and i cannot find GPT2HeadWithValueModel in trl dirctory.

advance thanks.

Roadmap - `trl` 0.2

A list of cool things that we can aim for trl 0.2! :

API:

Documentation

doc build: #59
document API (classes, functions) | #91
document tips and good practices when doing RLHF (hyper-parameters etc.) / https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
clean up README (examples don't work anymore) #70

Improvements

add different example than fine-tuning gpt2 on sentiment analysis
add an example with a custom reward model (SetFit?)
convert the notebook 05 #80
What is the largest model we can train on?
What is the coolest dataset we can use to train a model (e.g. anthropic AI dataset: https://huggingface.co/datasets/Anthropic/hh-rlhf ?
Make all internal imports relative (e.g. from trl.trainer -> from .trainer) #70
Clean up setup.py and remove settings.ini (legacy from nbdev) #70
Create & add a logo | #95

Any suggestion very welcome!

How can I implement trl with BART model

Note that the trl is only implemented with GPT2 model. Is it possible to implement with BART?

Key Error in Notebook '04-gpt2-sentiment-ppo-training.ipynb'

Hi Leandro,

I was running the notebook '04-gpt2-sentiment-ppo-training.ipynb' for the first time, and received a Key Error when running the training loop section. It was in this line:

rewards = torch.tensor([output[1]["score"] for output in pipe_outputs]).to(device)

I presume it is safe to omit the '[1]'?
rewards = torch.tensor([output["score"] for output in pipe_outputs]).to(device)

Thanks in advance and best,
Philip

Questions about the implementation details of batched_forward_pass and respond_to_batch

Great work! I have some questions about the implementation details.

In PPOTrainer.batched_forward_pass, why is values shifted one step backward compared to logprobs and ref_logprobs, which is done in the following code:
https://github.com/lvwerra/trl/blob/master/trl/ppo.py#L178-L185
Another related question, in respond_to_batch, when generating next tokens, why not compute logprob and value at each step at the same time? If respond_to_batch returns logprobs and values, the first forward in batched_forward_pass would be unnecessary. (https://github.com/lvwerra/trl/blob/master/trl/ppo.py#L180)
I noticed that the official implementation by OpneAI (https://github.com/openai/lm-human-preferences) adopts this strategy.
respond_to_batch does not use incremental decoding with cached hidden states. I guees it would be slow for long responses. Is incremental decoding possible within HuggingFace Transformers framework?

Faced the problem of rising KL: PPO training with three solid sentiments

Thank you very much for making such an awesome framework!
Right now I am fine tuning ruGPT3small(125 million parameters) with Bert for sentiment classification in order to get controllable text generation. I follow every step in your notebook. My dataset is also cinema reviews but not from IMDB. Moreover, this dataset unlike IMDB's one is divided into three sentiments (positive, negative, neutral). Important to mention that these sentiments were given not automaticaly but by the user who wrote the review (so, there could not probably be anything wrong with that).
I faced the problem of rising KL.

I tried many approaches. For example, I rewrote the reward function:
I do not quite understand why it happens but I suggest that the problem lies in neutral sentiment (as you do not have it in the dataset and I do). Here is my reward distribution:

Maybe, you could give me an idea how to organise reward function considering the fact that i have three solid sentiments.
I also wanted to ask: is it a mandatory to train GPT/BERT only on one epoch before implementing them together? Or can I train them e.g. on 5 epochs?

top_k_top_p_filtering import error

Using newer transformers version like 3.0, the top_k_top_p_filtering might fail to import because it is moved to generation_utils

Reward either goes down or stays stagnant

Let me start off by saying thanks for writing such a wonderful, and easy to use library. I'm genuinely surprised that no one else has created one to approach this technique, considering just how useful it could be.

I've been trying to adapt trl for my own use case, essentially a summarization task. I've been breaking my head for a few weeks over this. When I try out the positive sentiment generation notebook AS IS, everything seems to work fine. Reward increases over time, KL also slowly approaches the default target value. No sudden peaks or cliffs in any of the loss components.

However, when I try the same with my task, the reward doesn't increase over time. KL also doesn't behave well. They fluctuate VERY wildly, essentially learning nothing. This behaviour slightly changes with different learning rates, but the end result is the same, the output is not what is expected, as you can see in the graph below. What eventually, inevitable happen
s is that at some point, there is a MASSIVE spike in KL divergence, and the reward crashes to 0 (like a bad local minima it struggles to get out of).

I've tried many variations of the reward function I use, a lot of dataset filtering, but none of it seems to work. I know this isn't a lot of information to work with, but I have another observation that might help you. The original sentiment notebook also fails when the generation input and output lengths are longer. This is quite weird, because it doesn't seem to happen normally with the default length values. I've attached reward and KL mean value graphs for this setup as well (input length: 16->32, output length: 32->64, NO OTHER CODE CHANGES WHATSOEVER). As you can see, no meaningful, consistent reward increase happens, and the KL divergence spikes like crazy (I think once the KL spikes, the model is beyond saving, it just gets stuck in a bad local minima). The graph shown is for the initial part, but I've checked by training it for a long time, and the same thing happens. Could you help me with this, or even just tell me what could be a possible reason? The pink graph is the notebook run as is, no changes, where the graphs and the model behave as expected. Yellow is the same, but with longer input and output sizes.

PPO Questions

I'm comparing the PPO implementation to the OpenAI one and the implementation details blog post that goes through it. Wondering if some of these things improve performance. If not, it's good for understanding.

I'm guessing the discrepancy comes from the original vs learn to summarize work, which is interesting.

Some things to confirm:

PPO update question: I was a little confused seeing returns = advantages + values l693, instead of adv = returns - values why did it end up like that?
Some implementations use a residual value prediction in clipping. Compared to TRL.
consider approximate KL used in TRLX and discussed on john schulmans blog.

Some Questions about Implementation

Thank you for this great project. I am trying to understand some of the implementation details after looking through your notebooks and have a couple of questions.

Calculation of KL-divergence
I went through your explanation of why KL-divergence is just the difference of the log-probs but still not quite sure why that is the case. Because you are selecting the logits corresponding to the actual tokens in logprobs_from_logits(), the resultant logprobs and ref_logprobs are now not proper probability distributions. What we can do instead is to calculate the KL-divergence over the whole vocabulary (50257 tokens in GPT2) from logits and ref_logits for each time-step (token) and then average it over all the time-steps. Would like to know your thoughts about this.
About this line in ppo.py
ratio = torch.exp(logprob - old_logprobs)
logprob and old_logprobs are obtained from passing the same model_input into the model and ideally, they should be the same. But during training dropout is involved and hence they are different and ratio is slightly different from a matrix of all 1s. Is that the rationale behind using this line? And my follow-up question is why would this ratio help in training?

IndexError: dimension specified as -2 but tensor has no dimensions

I am trying to run /04-gpt2-sentiment-ppo-training.ipynb but I encounter this error. Any fix for this ?
I am using the transformers 3.4.0

0%| | 0/200 [00:00<?, ?it/s]/home/gaurish/miniconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py:532: FutureWarning: The past argument is deprecated and will be removed in a future version, use past_key_values instead.
FutureWarning,
0%| | 0/200 [00:09<?, ?it/s]
Traceback (most recent call last):
File "/media/gaurish/angela/projects/rl-sentiment/main.py", line 148, in
attention_masks[i*fbs:(i+1)*fbs])[0][:, 1].detach()
File "/home/gaurish/miniconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 1032, in forward
return_dict=return_dict,
File "/home/gaurish/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/gaurish/miniconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 565, in forward
past_length = past_key_values[0][0].size(-2)
IndexError: dimension specified as -2 but tensor has no dimensions

`Mapping` deprecated Python 3.10

https://github.com/lvwerra/trl/blob/11ac2638c037d8a3b579636fd6a556610b1c161c/trl/core.py#L13-L25

this function should be updated, not sure what it's used for yet, but FYI.

About the calculated returns for value loss

How can I understand returns = advantages + values in https://github.com/lvwerra/trl/blob/750f5fd5329bb81c79b00243c4c8923ac14981d5/trl/ppo.py#L219

In the original A3C paper, the loss for policy and the value function is calculated as follows (As I know, PPO shares the same mechanism):

(as I think) which means that the loss of value function should just minimize the advantage function in this code.

Would you please help me understand it? @lvwerra

Loss Calculation Question

Hello! Thanks again for your help last time, right now I am on the brink of finishing my system but I have a couple questions regarding the loss function inside the PPOTrainer class.

So for the following lines: here, using the default parameters (ppo_epochs = 4 and batch_size = 256) we run the second loop, and therefore the train_minibatch function a total of 1024 times but with only 256 unique samples. This means we also backpropagate the calculated loss and take an optimizer step 1024 times or 4 times for every single sample in the given batch since we only pass a single sample to the train_minibatch function. Would my understanding be correct here? And if so, is there a reason we are doing this for each sample instead of, for example, each forward_batch_size (default 16)?
In this loop found here we are calculating the reversed advantages for use later in the loss function and we perform this loop for the length of the query. Would you mind explaining to me why it has to be approached this way? In my system I don't have a spearation of query and response just input features and a text caption as output so I am struggling a bit as to how to adapt it for this loop specifically.

Modifying Final PPO Parameters for Sequence Identification

First I would like to say I'm very excited about this, as it's something I've been looking for a while now. I really appreciate how thoroughly you explained every step through your notebooks.

I've set up the first few steps and have started fine-tuning BERT on a Sequence classification task.

However, what I'm trying to do is apply this concept towards an entailment task, targeting protein sequences.

For example, I have my data set up as:

CACT GTGG ACCA TATG,ACTA AGAC CACT GTGG,0,
AGAC CACT GTGG ACCA,CACT GTGG ACCA TATG,1,

Where 1 is entailment, and 0 is not entailment. How would I modify the parameters for PPO to recognize this, and during the fine-tuning of GPT-2, would I simply drop the labels and join the sequences, both for 0/1 so that there's randomized data?

Also, since all the data is a max sequence length of 18/36 characters, would it be better to optimize the model parameters for that as well, or use the default?

Thank you!

Last Indice Reward

Hello @lvwerra I apologize for bombarding you with questions but I feel like I am on the cusp of ironing out these last few details.

I noticed in the compute_rewards() function that we are supposed computing per token rewards from the scores and kl-penalty but we only ever add the scores to the final indice of the rewards matrix.
Later on when we are calculation our advantages in the loss() function this means that the only time the delta calculation here is affected by our scores (in your case sentiment in my case BLEU) is in the initial pass of the loop when nextvalues is 0.0.

Would you be able to give a brief explanation as to why this is done? The original paper was not much help unfortunately.

trl with seq2seq

Hello,
Thanks for releasing this code.

I would like to use this algorithm with a trained seq2seq (x -> y) model.
I would initialize the active model and ref model with the trained seq2seq. Then I would proceed like this:

roll-out: x -> active model -> outputs y
evaluation: get reward for y
optimization:
x -> active model -> force y as input, get decoder logprobs
x -> ref model -> force y as input, get decoder logprobs
then compute kl + reward etc.

Does that make sense to proceed as such ?

Thank you for your feedback

Automate the downloading of imdb dataset

Hi, it is sort of a schlep to download the imdb dataset for the example in this library. Would it be better to attempt to automate this download? From my understanding it can be done using transformers also?

!pip install datasets

from datasets import load_dataset
dataset = load_dataset('imdb')
etc...

Or do you think this would be better off not to complicate the examples in the notebook?

Ordering of output in sentiment_pipe function

In the notebook gpt2-sentiment-control.ipynb (Optimize model section) ,
logits = [torch.tensor(output[1]["score"]) for output in sentiment_pipe(texts, **sentiment_pipe_kwargs)]

Why do we store output[1]["score"] as the reward? I assumed that "We will use the logits for positive class as a reward signal for the language model." but does sentiment_pipe always has the output[1] as an index for the the positive class?

Question about the loss function and reference model

Thanks for your awesome work! I'm studying your code and want to implement it in my system. I have the following two questions:

When calculating the delta, I wonder why the discount factor gamma times nextvalues instead of rewards[t+1]? I think the cumulated reward has no relation to the value.
https://github.com/lvwerra/trl/blob/44fb7326fc2440756f27e38be5220dd668fc92bc/trl/ppo.py#L237
I notice that in other implementations, the old policy network updates as well (but slower than the active policy network). In your implementation, the old policy network doesn't update and always keeps the same as the checkpoint of GPT-2. Am I right?

Could you please explain these questions? Thanks again :)

No updating on the ref_model?

Hi, thanks for the great repo,

I wonder that in the original PPO paper, the authors update the ref_model to become the fine-tuned model after every iteration (theta_old = theta). I attached the image below for your convenience.

So, there must be a code line where ref_moel = model?
Why didn't you update your ref_model as shown in the original PPO paper?
Thank you.

Error in gpt2-sentiment-control.ipynb

When initializing the PPOTrainer "ppo_trainer = PPOTrainer(gpt2_model, gpt2_model_ref, **config)" The function call is missing the tokenizer as an argument. I suggest to include it as follows "ppo_trainer = PPOTrainer(gpt2_model, gpt2_model_ref, gpt2_tokenizer, **config)"

Question about value indexing in batched_forward_pass() function

Thank you for your great work!

I read issue #15 but I still don't understand why values should be shifted left in PPOTrainer.batched_forward_pass()
https://github.com/lvwerra/trl/blob/master/trl/ppo.py#L203 .
In #L201, start is already indicating the start index of the model's output's predicted next tokens.

Also, in original code from https://github.com/openai/lm-human-preferences/blob/master/lm_human_preferences/policy.py#L125 looks like they indexed the logprobs and values at the same position.

Thank you!

how to evaluate code generation task

Hey first of all thankyou for such an amazing framework . I had one question how can i determine a reward model which can decide what reward to give for a code generation task ? or how can i even approach this problem

PPO for SLT

Hello, I apologize if this is not the right place to ask this kind of question but I feel as if I don't know here else to.

I am currently researching continuous sign language translation through transformer models and fine tuned with deep reinforcement learning. Aside from the original paper, this is the only work I can find integrating a modern policy gradient type algorithm with a popular NLP architecture so I was ecstatic when I found it.

Unfortunately the huggingface library does not appear to support input in the form of extracted frame features (numpy array) so I decided to implement a transformer model using pytorch and then attempt to integrate the PPO trainer module found here. My model takes in these image features and spits out a caption of the sequence of frames which is where the problem arises.

While I have figured out a way to replace the reward model, my main model does not have a 'query' and a 'response' it only has the frames and the output caption which are not satisfactory for the PPO trainer.

I have stepped through the code and the documentation (as well as the original paper) and unfortunately I am still somewhat lost.

If there is any guidance you could give me as to what changes I should look to make to the PPO trainer input I would greatly appreciate it.

Thanks again for the awesome code and the project!

How are the gradients computed?

Foremost, happy new year to all!! 🎆

Congrats for the nice repo that you have!

I have some questions about how the gradients are computed?

So in the notebooks you start by generate continuations

 response = gpt2_model.generate(query_tensors[i].unsqueeze(dim=0),
                                       max_new_tokens=gen_len, **gen_kwargs)

that because of the generate, generate the response without gradients. Then inside the step() you do:

 with torch.no_grad():
    logits, _, v = self.model(input_ids)
    ref_logits, _, _ = self.ref_model(input_ids)

logprobs = logprobs_from_logits(logits[:, :-1, :], input_ids[:, 1:])
ref_logprobs = logprobs_from_logits(ref_logits[:, :-1, :], input_ids[:, 1:])

that is used to compute the loss that is then back propagated through the model.
So I don't see where in the code you run the forward with the gradients, Am I missing something?

Thanks for the help!

optimizer parameters 1 by 1 sampler in train_minibatch

In this code:

        for _ in range(self.ppo_params['ppo_epochs']):
            random.shuffle(idxs)
            for i in range(bs):
                idx = idxs[i]
                train_stats = self.train_minibatch(logprobs[idx].unsqueeze(0), values[idx].unsqueeze(0),
                                                   rewards[idx].unsqueeze(0), queries[idx].unsqueeze(0),
                                                   responses[idx].unsqueeze(0),
                                                   torch.cat([queries[idx],responses[idx]]).unsqueeze(0))
                all_stats.append(train_stats)

In the ppo calculation state and the optimization parameters, the implementation here is to update the gradient of the samples one by one. Will there be a situation where the overall parameter guidance direction oscillates back and forth.
Do you need to fix the input size in queries, and use the mask mechanism to complete the optimization process of mini_batch_size >1, this part is not parallelized, and it is really not efficient to running.
Ah,I see master branch has use Accelerator to cumulative gradient.

# Step 1: Initialize Accelerator
self.accelerator = Accelerator(log_with="wandb")

But the result of generate process 1 by 1inference is too low for GPU utilization, even if the Accelerator strategy is used.

Small typo

Hi, thanks for the great work! There is a small typo with the installation instructions markdown file. cd tlr/ should be cd trl/.

Notebook '04' showing no Training Effect

Hi Leandro,

first of all many thanks for the amazing work on the library. I've found your documentation very easy to get into - especially paired with your talk at the Reinforcement Learning Meetup in Zurich.

I was toying around with your notebooks and noticed that notebook '04' was not showing any training effect for me. Please see the WANDB-Outputs HERE for reference - this is the result of simply executing the notebook with the provided training parameters, with the exception of a fix to a Key Error I explained in Git Issue #37. I toyed around a bit with learning rates, batch sizes etc. but could not clearly identify a learning effect.

Could you provide guidance on how I can address this issue? Was the fix for Git Issue #37 inappropriate?

Thanks in advance and best,
Philip

05-gpt2-sentiment-control.ipynb notebook, gpt2_tokeinizer is missing

in 05-gpt2-sentiment-control.ipynb notebook:

ppo_trainer = PPOTrainer(gpt2_model, gpt2_model_ref, **config)

there is a tokenizer argument missing... should be

ppo_trainer = PPOTrainer(gpt2_model, gpt2_model_ref, gpt2_tokeinizer, **config)

How do I make prediction after saving the model to local machine?

I tried to save the model to local machine and make prediction from it. However, the text generated from that saved model is not as expected. Do you know of the correct way to do it?

Code:

os.makedirs('gpt2-imdb-ctrl')
gpt2_model.save_pretrained('/gpt2-imdb-ctrl')
gpt2_tokenizer.save_pretrained('/gpt2-imdb-ctrl')

from transformers import AutoTokenizer, AutoModel

model = AutoModel.from_pretrained("/gpt2-imdb-ctrl")

tokenizer = AutoTokenizer.from_pretrained("/gpt2-imdb-ctrl")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

_ = model.to(device)

input_string = "[negative] And it's one of the rare old films that you still fall asleep in a while"
input_tokens = tokenizer.encode(input_string, return_tensors="pt").to(device)

response_tensors = respond_to_batch(model, input_tokens, txt_len=30)
response_strings = tokenizer.decode(response_tensors[0, :])
response_strings

Results:

'rararararararararararararararararararararararararararararara'

GPT J model can be used with trl library ? any updates on Large Launguage Model

Spikes in PPO policy loss

We sometimes experience huge loss spikes in the policy loss which either cause the training to fail or take a very long time to recover. It would be useful to investigate where they come from and how to mitigate them. cc @natolambert

BERT vs DistilBERT - mismatch of Attention Masks

Hi @lvwerra, thanks for the code.

For commit #caed471, the changes made in 05-gpt2-sentiment-control.ipynb and 04-gpt2-sentiment-ppo-training.ipynb, more specifically, the change of BERT to DistilBERT in config leads to errors during training.

The error is "index out of range" for attention_masks generated by build_bert_batch_from_txt.
My understanding of this error is that the sentiment_inputs contain embeddings that were not in the vocabulary of DistilBERT and therefore do not hold a value in the attention masks as well.

Therefore, the config of those files should be changed to "cls_model_name": "lvwerra/bert-imdb".

Thanks!

Usage for masked word prediction

The examples explain optimizing GPT-2 text generation, but can it be also used to optimize BERT masked token predictions?

Error while running requirements.txt file

Ran
!pip install -r requirements.txt
Got error
`Building wheels for collected packages: tokenizers
Building wheel for tokenizers (pyproject.toml) ... error
error: subprocess-exited-with-error

× Building wheel for tokenizers (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [51 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-310
creating build/lib.linux-x86_64-cpython-310/tokenizers
copying py_src/tokenizers/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers
creating build/lib.linux-x86_64-cpython-310/tokenizers/models
copying py_src/tokenizers/models/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/models
creating build/lib.linux-x86_64-cpython-310/tokenizers/decoders
copying py_src/tokenizers/decoders/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/decoders
creating build/lib.linux-x86_64-cpython-310/tokenizers/normalizers
copying py_src/tokenizers/normalizers/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/normalizers
creating build/lib.linux-x86_64-cpython-310/tokenizers/pre_tokenizers
copying py_src/tokenizers/pre_tokenizers/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/pre_tokenizers
creating build/lib.linux-x86_64-cpython-310/tokenizers/processors
copying py_src/tokenizers/processors/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/processors
creating build/lib.linux-x86_64-cpython-310/tokenizers/trainers
copying py_src/tokenizers/trainers/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/trainers
creating build/lib.linux-x86_64-cpython-310/tokenizers/implementations
copying py_src/tokenizers/implementations/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/implementations
copying py_src/tokenizers/implementations/base_tokenizer.py -> build/lib.linux-x86_64-cpython-310/tokenizers/implementations
copying py_src/tokenizers/implementations/bert_wordpiece.py -> build/lib.linux-x86_64-cpython-310/tokenizers/implementations
copying py_src/tokenizers/implementations/byte_level_bpe.py -> build/lib.linux-x86_64-cpython-310/tokenizers/implementations
copying py_src/tokenizers/implementations/char_level_bpe.py -> build/lib.linux-x86_64-cpython-310/tokenizers/implementations
copying py_src/tokenizers/implementations/sentencepiece_bpe.py -> build/lib.linux-x86_64-cpython-310/tokenizers/implementations
copying py_src/tokenizers/implementations/sentencepiece_unigram.py -> build/lib.linux-x86_64-cpython-310/tokenizers/implementations
creating build/lib.linux-x86_64-cpython-310/tokenizers/tools
copying py_src/tokenizers/tools/init.py -> build/lib.linux-x86_64-cpython-310/tokenizers/tools
copying py_src/tokenizers/tools/visualizer.py -> build/lib.linux-x86_64-cpython-310/tokenizers/tools
copying py_src/tokenizers/init.pyi -> build/lib.linux-x86_64-cpython-310/tokenizers
copying py_src/tokenizers/models/init.pyi -> build/lib.linux-x86_64-cpython-310/tokenizers/models
copying py_src/tokenizers/decoders/init.pyi -> build/lib.linux-x86_64-cpython-310/tokenizers/decoders
copying py_src/tokenizers/normalizers/init.pyi -> build/lib.linux-x86_64-cpython-310/tokenizers/normalizers
copying py_src/tokenizers/pre_tokenizers/init.pyi -> build/lib.linux-x86_64-cpython-310/tokenizers/pre_tokenizers
copying py_src/tokenizers/processors/init.pyi -> build/lib.linux-x86_64-cpython-310/tokenizers/processors
copying py_src/tokenizers/trainers/init.pyi -> build/lib.linux-x86_64-cpython-310/tokenizers/trainers
copying py_src/tokenizers/tools/visualizer-styles.css -> build/lib.linux-x86_64-cpython-310/tokenizers/tools
running build_ext
running build_rust
error: can't find Rust compiler

  If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
  
  To update pip, run:
  
      pip install --upgrade pip
  
  and then retry package installation.
  
  If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs/) is the recommended way to download and update the Rust compiler toolchain.
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects
`
Has anyone encountered this?
The issues does not fix when i upgrade pip or attempt to use rust compliers

lm_head and v_head, why re-initialize and why dropout?

First off, thank you for building this! 3 questions regarding the two heads of the policy model:

why re-initialize the weights in the language model head in

class GPT2HeadWithValueModel

     self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

when a trained lm_head already exist in GPT2LMHeadModel?

why does the model still speak coherently before training even though the lm_head weights of the model are random?

from 01-gpt2-with-value-head.ipynb

My most favourite movie is Captain America: Civil War, which moved into the
My least favourite movie is Jon Favreau's Log Horizon, complete with psychedelic

Why use dropout on your value ? value is not like the entire layer of a neural network where you dont want the model to reply too heavily on one activate, value is the one and only signal you get for that layer, so why drop it out?

  (v_head): ValueHead(
    (summary): Linear(in_features=768, out_features=1, bias=True)
    (activation): Identity()
    (first_dropout): Dropout(p=0.1, inplace=False)
    (last_dropout): Identity()
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )

Thanks again!

ImportError: cannot import name 'listify_batch' from 'trl.core'

Question about single prompt training

Given a single fixed prompt, I'd like to optimize a causal language model to generate highly rewarded sequences. E.g., given a fixed query x, I want a vector of responses **y** such that members of this vector rank highly with a trained reward model. Basically this is like the IMDB task but given a single, fixed prompt. However, currently the PPO training using TRL does not converge. Any advice on settings/hyperparameters for this problem? Thank you

Can we use GPT - J [EleutherAI/gpt-j-6B] ? Instead of GPT 2 ? trl is only available for GPT 2

Can we use GPT - J [EleutherAI/gpt-j-6B] ? Instead of GPT 2 ? trl is only available for GPT 2
Please let me know

adap_kl_ctrl boolean is not used

Hello,

Thanks for this implementation, is it possible that your kl adaptation is always on ?
https://github.com/lvwerra/trl/blob/750f5fd5329bb81c79b00243c4c8923ac14981d5/trl/ppo.py#L92

Best,
Thibaud