Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Last Indice Reward about trl HOT 10 CLOSED

huggingface commented on September 25, 2024 1

Last Indice Reward

from trl.

Comments (10)

lvwerra commented on September 25, 2024 3

Hi @jpanaro, so first of all, the reward is only given once the sequence generation is complete which is why the score/reward is only added to the last token. You are right that then it is discounted and added to the previous tokens as well. You can find the equation in the original PPO paper in equations (11)+(12). To simplify calculations it is done from back to front starting with the last token.

It only makes sense to add the rewards to the last token since for example the BLEU score is only valid for the complete sequence. Similar to Atari games where you only get a reward after you complete a level and then the advantage equation is used to discount the rewards to previous actions. How strong the discount is depends on the value of lambda.

Does that answer your question?

from trl.

jpanaro commented on September 25, 2024

That makes a lot more sense, thanks! I am just trying to resolve my negative kl-divergence problems which are causing my model to slowly diverge and produce garbage.
Currently:

I have tried performing top_p_top_k filtering on the logits which has somewhat helped but I am limited by the fact that my models decoder produces all outputs and hidden states at once so I cannot perform the filtering as it unrolls, only after which I feel limits its effectiveness.
I have also tried to zero out all of the logits following the first EOS token generated but this just led to identical performance to the top_k_top_p filtering. I have also recently discovered that this might be pointless as when the logprobs are calculated the indices that used to be all zeros now are filled with nonzero values.
Lastly regardless of actual gt length my model will also produce logprobs to the max length (29) so my final consideration is to try and find the first EOS, add the reward to that indice, and then cut all following indices or set them to zero.

Any thoughts on these methods or any solutions I may have missed I would greatly appreciate your input!

from trl.

lvwerra commented on September 25, 2024

I had issues with negative KL-divergence twice! Both times it was related to the generation function and the model found ways to exploit some functionality, such as the padding tokens or the fact that if the min_length is not yet reached the logprob of the EOS token is set to zero. The model can achieve negative KL-divergence by assigning astronomically small logprobs for the tokens that the generation function sets "manually". I tried to summarize this at the end of this notebook.

I hope this helps. My suggestion would be to use greedy decoding and then modify the reward in your code (e.g. adding an extra term to BLEU) if the EOS token appears too early.

I hope this helps!

from trl.

lvwerra commented on September 25, 2024

As a remark: if you sample properly from your tuned model you should never achieve negative KL-divergence. This indicates that something is wrong in the way you generate the sequences.

from trl.

jpanaro commented on September 25, 2024

Yeah, I saw when I was reading through the notebooks how those issues cropped up. Since my model uses the JoeyNMT library and not the HuggingFace library I am sure there are some differences in generation so I guess I will have to find those differences myself.

I will give that a go! I think if I can potentially penalize the early generation of the EOS token as well as the secondary problem of the model making too many periods prior to the EOS token.

Regarding the last remark. When you say generate sequences do you mean how the model actually produce the decoder hidden states that compose the logits or do you mean things like how you produce the logprobs or how you apply a probability distribution to those logprobs (i.e. greedy vs categorical vs multinomial)?

from trl.

lvwerra commented on September 25, 2024

So the output of the model are logits and with a softmax function you can transform these into probabilities. If you just sample from these probabilities you should get positive KL-divergence. What the generation function (at least in the transformers library) do is applying extra tricks like overwriting some probabilities with 0. This leaves the model a backdoor to exploit by setting the probabilities of these overwritten outputs very low thus leading to negative KL-divergence which means positive rewards for the model. To avoid this I wrote a custom generation function respond_to_batch to have full control.

from trl.

jpanaro commented on September 25, 2024

I managed to fix the negative kl divergence problem. It turns out it was just a sampling alignment issue stemming from the greedy decoding. I

Unfortunately my new problem is that my reward seems to want to decrease as the quality of the sentences are also decreasing. This leads to a direct hit to my BLEU score which means the reward can only reduce as the finetuning goes on.

I think one of the issues is that the model I am starting with already has a majority of the training BLEU 4 scores in the 90->100 range so improvement on the training set is very difficult despite the BLEU 4 scores for the validation set being ~19 at the most. I tried casting all my BLEU scores in the 80->100 range to the range of -4.5->4.5 with scores less than 80 getting set to -4.0 to mimic your positive sentiment scores range as I'm fairly certain just using the raw scores in the loss calculation would blow the rewards out of proportion and also it naturally has no negative rewards so I didn;t think it would penalize the model for lower BLEU scores enough. That ended up looking like this: .

I thought this would rectify the score decreasing issue but unfortunately the reward_mean still immediately sinks below 0 if it did not start there after the first epoch and then bounces between -0.05 and -0.15. I think part of the problem is my spiky KL value which now ranges from 1.5-16 and my lowish average score value which stabilizes at a little under 0.5 after about 10 epochs. Have you dealt with a positive KL but negative rewards similar to this matter?

from trl.

lvwerra commented on September 25, 2024

Unfortunately I have not encountered that specific problem. Within the PPO trainer the advantages are whitened, meaning the mean is set to zero and the standard deviation to one. The difference between training and validation score seem to indicate that you might be overfitting the training set. Have you tried reducing that? Also, you could try to decrease the KL factor and see if it improves the problem? Since you are directly measuring the quality of text with BLEU it might not be so important to constrain the language model.

from trl.

jpanaro commented on September 25, 2024

Ah, ok that makes sense, so when they are whitened, it does not matter how large the reward is, it will be distributed across the entire advantage equally? I'm just worried the massive numbers (92, 88, 100) will wash everything else out if I don't "dampen" them.

Overfitting seems to be a major problem, possibly the main contributor to the lack of performance gain. I think I will cut the models learning process to fewer epochs and give PPO a chance to explore more solutions.

This might be worth experimenting with seeing as I wan't the model to explore a little more anyways. Thanks for the tips!

from trl.

lvwerra commented on September 25, 2024

It will not be distributed but scaled down, such that the distribution in a batch is normalised. I think this should get rid off the characteristic scale of your scores in the PPO training. But you might want to check whether this is really true. If you also use weights&biases you can monitor all scores from the dashboard. You might also be able to see it in the loss scale and distribution.

In any case good luck!

from trl.

Last Indice Reward about trl HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent