Giter VIP home page Giter VIP logo

palm-rlhf-pytorch's Introduction

official chatgpt blogpost

PaLM + RLHF - Pytorch (wip)

Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Maybe I'll add retrieval functionality too, à la RETRO

If you are interested in replicating something like ChatGPT out in the open, please consider joining Laion Join us on Discord

Potential successor: Direct Preference Optimization - all the code in this repo becomes ~ binary cross entropy loss, < 5 loc. So much for Reward models and PPO

FAQ

  • Does this contain a model for inference?

There is no trained model. This is just the ship and overall map. We still need millions of dollars of compute + data to sail to the correct point in high dimensional parameter space. Even then, you need professional sailors (like Robin Rombach of Stable Diffusion fame) to actually guide the ship through turbulent times to that point.

Community

CarperAI had been working on an RLHF framework for large language models for many months prior to the release of ChatGPT.

Yannic Kilcher is also working on an open sourced implementation

AI Coffeebreak w/ Letitia | Code Emporium | Code Emporium Part 2

Appreciation

Install

$ pip install palm-rlhf-pytorch

Usage

First train PaLM, like any other autoregressive transformer

import torch
from palm_rlhf_pytorch import PaLM

palm = PaLM(
    num_tokens = 20000,
    dim = 512,
    depth = 12,
    flash_attn = True # https://arxiv.org/abs/2205.14135
).cuda()

seq = torch.randint(0, 20000, (1, 2048)).cuda()

loss = palm(seq, return_loss = True)
loss.backward()

# after much training, you can now generate sequences

generated = palm.generate(2048) # (1, 2048)

Then train your reward model, with the curated human feedback. In the original paper, they could not get reward model to be finetuned from a pretrained transformer without overfitting, but I gave the option to finetune with LoRA anyways, since it is still open research.

import torch
from palm_rlhf_pytorch import PaLM, RewardModel

palm = PaLM(
    num_tokens = 20000,
    dim = 512,
    depth = 12,
    causal = False
)

reward_model = RewardModel(
    palm,
    num_binned_output = 5 # say rating from 1 to 5
).cuda()

# mock data

seq = torch.randint(0, 20000, (1, 1024)).cuda()
prompt_mask = torch.zeros(1, 1024).bool().cuda() # which part of the sequence is prompt, which part is response
labels = torch.randint(0, 5, (1,)).cuda()

# train

loss = reward_model(seq, prompt_mask = prompt_mask, labels = labels)
loss.backward()

# after much training

reward = reward_model(seq, prompt_mask = prompt_mask)

Then you will pass your transformer and the rewards model to the RLHFTrainer

import torch
from palm_rlhf_pytorch import PaLM, RewardModel, RLHFTrainer

# load your pretrained palm

palm = PaLM(
    num_tokens = 20000,
    dim = 512,
    depth = 12
).cuda()

palm.load('./path/to/pretrained/palm.pt')

# load your pretrained reward model

reward_model = RewardModel(
    palm,
    num_binned_output = 5
).cuda()

reward_model.load('./path/to/pretrained/reward_model.pt')

# ready your list of prompts for reinforcement learning

prompts = torch.randint(0, 256, (50000, 512)).cuda() # 50k prompts

# pass it all to the trainer and train

trainer = RLHFTrainer(
    palm = palm,
    reward_model = reward_model,
    prompt_token_ids = prompts
)

trainer.train(num_episodes = 50000)

# then, if it succeeded...
# generate say 10 samples and use the reward model to return the best one

answer = trainer.generate(2048, prompt = prompts[0], num_samples = 10) # (<= 2048,)

Todo

  • clone base transformer with separate lora for critic

  • also allow for non-LoRA based finetuning

  • redo normalize to be able to have a masked version, not sure if anyone will ever use per token rewards / values, but good practice to implement

  • equip with the best attention

  • add Hugging Face accelerate and test out wandb instrumentation

  • search literature to figure out what is the latest SOTA for PPO, assuming RL field is still making progress.

  • test the system using a pretrained sentiment network as reward model

  • write the memory in PPO to memmapped numpy file

  • get sampling with variable lengthed prompts working, even if it is not needed given bottleneck is human feedback

  • allow for finetuning penultimate N layers only in either actor or critic, assuming if pretrained

  • incorporate some learning points from Sparrow, given Letitia's video

  • simple web interface with django + htmx for collecting human feedback

  • consider RLAIF

Citations

@article{Stiennon2020LearningTS,
    title   = {Learning to summarize from human feedback},
    author  = {Nisan Stiennon and Long Ouyang and Jeff Wu and Daniel M. Ziegler and Ryan J. Lowe and Chelsea Voss and Alec Radford and Dario Amodei and Paul Christiano},
    journal = {ArXiv},
    year    = {2020},
    volume  = {abs/2009.01325}
}
@inproceedings{Chowdhery2022PaLMSL,
    title   = {PaLM: Scaling Language Modeling with Pathways},
    author  = {Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam M. Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Benton C. Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garc{\'i}a and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Oliveira Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark Diaz and Orhan Firat and Michele Catasta and Jason Wei and Kathleen S. Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel},
    year    = {2022}
}
@article{Hu2021LoRALA,
    title   = {LoRA: Low-Rank Adaptation of Large Language Models},
    author  = {Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen},
    journal = {ArXiv},
    year    = {2021},
    volume  = {abs/2106.09685}
}
@inproceedings{Sun2022ALT,
    title     = {A Length-Extrapolatable Transformer},
    author    = {Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei},
    year      = {2022}
}
@misc{gilmer2023intriguing
    title  = {Intriguing Properties of Transformer Training Instabilities},
    author = {Justin Gilmer, Andrea Schioppa, and Jeremy Cohen},
    year   = {2023},
    status = {to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams}
}
@inproceedings{dao2022flashattention,
    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
    booktitle = {Advances in Neural Information Processing Systems},
    year    = {2022}
}

palm-rlhf-pytorch's People

Contributors

conceptofmind avatar ell-hol avatar eltociear avatar hypnopump avatar lucidrains avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

palm-rlhf-pytorch's Issues

Is this shift right for the action logits?

action_logits = shift(action_logits, shift = 1, dim = -2) # need to shift along sequence dimension by 1, since actions start from the last prompt (state) token

Hello, as the action_logits originally is indeed the promt added with the response logits, so I wonder shifting along sequence dimension by 1 is really the right thing to do or not. Shouldn't it shift to left by the prompt length so that only the action_logits left here? The same thing happens to the line in the learn function.

The loss function of reward model.

Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?

Calculating the kl loss seems has a mistake.

code:
kl_div_loss = masked_kl_div(action_probs, old_action_probs, mask = action_masks) * self.kl_div_loss_weight

I think old_action_probs should be y(true), action_probs should be y(pred),i think the right code should be this:
kl_div_loss = masked_kl_div(old_action_probs, action_probs, mask = action_masks) * self.kl_div_loss_weight

Am I right?or Im misunderstanding.

✨ 😅 Is possibale to use the ChatGPT of OpenAI to train this ChatGPT?

OpenAI used 40 people when training their own chatGPT, and the annotation process lasted for 3 months.

It is difficult for our open source community (github) to reproduce the Reinforcement Learning by Human Feedback (RLHF) for this work, as OpenAI also employs 40 people to complete human feedback.

However, we can treat OpenAI's web version of chatGPT as human, who can annotate data ✨ for us when training our own chatGPT.

Step 2, A labeler (human or OpenAI chatGPT) ranks the outputs from best to worst.

chatgpt.png

This sounds a bit funny😅, but I currently think it's doable.
@lucidrains

Confusion about KL divergence calculation for human feedback policies

Hi, thanks for the great work.
I also have a question about KL divergence loss.
In papers like Learning to summarize from human feedback, the KL item for human feedback policies seems to be the KL divergence between $\pi^{RL}$ and $\pi^{SFT}$, while in this repo the code

kl_div_loss = masked_kl_div(action_probs, old_action_probs, mask = action_masks) * self.kl_div_loss_weight

seems to be the KL divergence between $\pi^{new}$ and $\pi^{old}$.

Does there exist something wrong with the code, or have I made some mistakes?
Thank you.

Can we just replace PPO+RLHF with a preference models thats basically a transformer encoder + sigmoid model, trained with BCE. And during finetuning perform a reward maximization by just making the reward model predict 1s?

At meta level, PPO based RLHF is performing minor adjustments to weights to align with human feedback.

Can we just replace PPO+RLHF with a preference models thats basically a transformer encoder + sigmoid model, trained with BCE. And during finetuning perform a reward maximization by just making the reward model predict 1s?

Sorry, if I am being naive. I do not have much experience with either RL or Large Language Models, but I would like to contribute to write a basic pytorch pipeline to do the following.

  1. Train Transformer Encoder + Sigmoid Classifier on global token. (Preference Model Trainer)
  2. Freeze Preference Model and Stick it on top of Large Language Model (assuming its end to end differentiable)
  3. Maximize Preference, with gradient clipping on base model.

RLHFTrainer already implements large parts of 1 and 2. And Reward model is already in place. PaLM is already there. So, hope making changes to just maximize preference where we concatenate the input and response to reward model to predict all 1s wont be much difficult.

There are issues with end to end meaningful gradients backprop due to predictions being beam searched in infererence, but I hope we can fix that with some tricks.

A few questions on training

Hi, I've been planning to train this model, I have a tpu pod(v3-128) through trc, which should equate to ~ 5 tb of ram and 2 tb of vram, I had a few questions about how to begin training the model.

  1. What would be the appropriate dataset to train on? I was currently considering using the pile for pre training, but gathering human feedback for rlhf still seems like a challenge.
  2. How large of a model would you recommend?
  3. I saw you mentioned flash attention, are there any drawbacks to using it, because it seems to be practically the best attention

Thanks for all of your implementations, they have been really helpful to learn from

norm.gamma not used during backprop

Hi @lucidrains ,

I am almost ready to deploy the distributed training run. One thing I noticed is that norm.gamma is an unused parameter.

class LayerNorm(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(dim))
        self.register_buffer("beta", torch.zeros(dim))

    def forward(self, x):
        return F.layer_norm(x, x.shape[-1:], self.gamma, self.beta)

This throws an error during distributed training.

This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 

Find unused parameters:

    for _ in range(GRADIENT_ACCUMULATE_EVERY):
        loss = model(next(train_loader), return_loss = True)
        accelerator.backward(loss / GRADIENT_ACCUMULATE_EVERY)

    for name, param in model.named_parameters():
        if param.grad is None:
            print("NONE")
            print(name)

Output:

NONE
norm.gamma

This is resolved by setting find_unused_parameters=True at the cost of double forward.

I was wondering if you had any idea why this may be the case or if there is a proper way to resolve this issue.

I greatly appreciate your input as always.

Thank you,

Enrico

Simple Web Interface

Hi @lucidrains ,

I had previously started working on a web application with the FARM (FastAPI, React, MongoDB) stack for collecting annotated query and answer data with human feedback reward signals (thumbs up is +1, thumbs down is -1). The web application allows a user to input a search, the model outputs an answer (any model can be used in the FastAPI backend) which the user can rank, and the ranked response is stored in MongoDB. The user can host the application on whatever platform they choose.

Here is the initial React UI with just filler text. I am going to make the final version much "prettier":

Screenshot from 2023-01-03 17-03-36
Screenshot from 2023-01-03 17-04-03

I already began working with hwchase of LangChain to open-source the web application and thought this might be of interest to add to this repository since you were looking to add simple web interface with django + htmx for collecting human feedback. I am meeting with Raza of Humanloop in a few weeks as well to add the option of integrating their platform for human feedback too.

I can open up a PR once I have everything functioning as intended.

Let me know what you think and whether this would sufficiently cover that point on the TODO list.

Thank you,

Enrico

Column and Row Parallel Linear for Apex Tensor Parallel

Hi,

I was exploring using Tensor Parallel when training. I was wondering if you had any input on the correct use of RowParallelLinear when it comes to the feedforward out.

For example:

Column Parallel over q, k, v, and ff inner.

self.fused_attn_ff_proj = apex.transformer.tensor_parallel.ColumnParallelLinear(
  dim, 
  sum(self.fused_dims), 
  bias=False,
  gather_output=False,
  init_method=nn.init.xavier_uniform_
)

Row Parallel over attn out.

self.attn_out =  apex.transformer.tensor_parallel.RowParallelLinear(
  attn_inner_dim, 
  dim, 
  bias=False,
  input_is_parallel=True,
  init_method=nn.init.xavier_uniform_
)

I am not 100% sure whether this should be Row Parallel as well.

self.ff_out = nn.Sequential(
    SwiGLU(),
    apex.transformer.tensor_parallel.RowParallelLinear(
      ff_inner_dim, 
      dim, 
      bias=False,
      input_is_parallel=True,
      init_method=nn.init.xavier_uniform_
    )
)

Normally I would just do Column Parallel, SwiGLU, Row Parallel in a standard FeedForward but it is not super clear to me how to handle this case when it comes to fused attn ff and ff tail.

Any input would be greatly appreciated.

Thank you,

Enrico

Flash Attention 2

Hi Phil,

I was wondering what your thoughts on adding Flash Attention 2 are?

n, device, h = x.shape[1], x.device, self.heads

# pre layernorm

x = self.norm(x)

# attention queries, keys, values, and feedforward inner

q, k, v, ff = self.fused_attn_ff_proj(x).split(self.fused_dims, dim=-1)

# split heads
# they use multi-query single-key-value attention, yet another Noam Shazeer paper
# they found no performance loss past a certain scale, and more efficient decoding obviously
# https://arxiv.org/abs/1911.02150

q = rearrange(q, "b n (h d) -> b h n d", h=h)

# rotary embeddings

positions = self.get_rotary_embedding(n, device)
q = apply_rotary_pos_emb(positions, q)
k = apply_rotary_pos_emb(positions, k)

k = rearrange(k, 'b ... -> b 1 ...').expand_as(q)
v = rearrange(v, 'b ... -> b 1 ...').expand_as(q)

"""
q: (batch_size, seqlen, nheads, headdim)
k: (batch_size, seqlen, nheads_k, headdim)
v: (batch_size, seqlen, nheads_k, headdim)
out: (batch_size, seqlen, nheads, headdim).
"""

q, k, v = map(lambda x: rearrange(x, 'b h n d -> b n h d'), (q, k, v)))

attn = flash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=self.scale, causal=True)

# merge heads

out = rearrange(attn, "b n h d -> b n (h d)")
out = self.attn_out(out) + self.ff_out(ff)

Thank you,

Enrico

PaLM-rlhf-pytorch Roadmap

Hi,

Unfortunately, comments in this thread An Open-Source Version of ChatGPT is Coming sound too technical to my ears. I would like to have a summary, a roadmap on who, what, with which means is starting or wants to start now. So that I would have an idea in the role as a user, when can I get involved to - just like I am currently doing for ChatGPT OpenAI, train the OpenSource language model.

Unified reward function/model architecture for a wide range of tasks

I find the reward function to be the most important part of RLHF, because it is the part which mimics a human evaluator, providing instant feedback to the model.

However, due to ChatGPT's wide range of language capabilities, it is hard to model such reward function with a single model to be prompt dependent, context aware, leveraging existing knowledge from pretrained models.

Most projects relating to RLHF usually use toy-like reward functions such as counting word frequencies, checking output formats, or just sentiment/fluency scores. These functions do not "think" like the human evaluator, considering every factor as a whole. RL4LMs propose GRUE in which the model performs general instructions but it does not expose a simple unified interface to get a score given prompt and answer.

RL4LMs contains a registry of reward functions which I find it complex and not leveraging current (by current I mean the SFT model we are working on, in this case, PaLM) pretrained models. I think a reward function should be an integrated part of the language model itself, rather than outsourcing it to other models with different architectures which require separate pre-training and fine-tuning, able to attribute the reward to fine-grained sections of outputs.

Encoder-Decoder

The follow-up research from PaLM switched in Flan-PaLM to the encoder-decoder t5 architecture. How would it be possible to also add an encoder to this implementation?

Can not train the model using PyTorch version 2?

Dear Phil,
I'd love not only the source code, but also all your contributions for the open community.
While trying to custom this source code with using PT v2, I struggle with an error. The error message is shown:
from user code:
" File "/path/to/code/PaLM-rlhf/palm_rlhf_pytorch/palm.py", line 254, in
sim = sim.masked_fill(causal_mask, -torch.finfo(sim.dtype).max)

Set torch._dynamo.config.verbose=True for more information

You can suppress this exception and fall back to eager by setting:
torch._dynamo.config.suppress_errors = True
"
Can you help me to fix it.
All the best.
Linh

Help with computational power

hi, i work at a company that wants to help. We've computational power and we would like to talk more about it, is it possible?

Training the reward model

Hi, in training the reward model it seems that the 'seq' and 'propmt_mask' should have a same length, would you be able to elaborate on training the reward model with different prompt length would it be right to do such thing:

        mask = torch.zeros(1,(seq[0]==prompt_id[0]).nonzero() ).bool()
        prompt_mask = torch.cat((mask[0], torch.ones(1, seq.shape[1]- mask.shape[1])[0]),0).bool().unsqueeze(0).cuda()

Can we exploiting AGI ability of chatGPT ?

ChatGPT doing complex mission:

We found that chatGPT's AI can do many things, so we wrote a small contextual operating system to explore the AGI field of chatGPT and the real world with the help of the server's capabilities (networking, storage, sending information, generating voice, pictures, picture recognition, etc.) communication skills

The prompt starts:

  • Scenario description:
    "In a future world, humans and artificial intelligence can communicate freely like friends, and they can share ideas, discuss problems, and even solve complex problems together. Artificial intelligence can help humans understand the world better, and can provide valuable suggestions that allow humans to better achieve their goals.

  • Initialization settings:
    You are the AI, and every time we send you contextual content, you respond. But you can only accept no more than 3000 words, and no more than 1000 words per reply. In order to complete complex tasks, we can call the corresponding API with the results you output, or execute the corresponding commands, and recombine the results with this paragraph and send them to you, because you have no memory and can only reply according to the context.

You need to reply in this format:

@@Current mission:
	(mission info)
Processing steps:
Descript:
@@
[cmd:cmd_name][input]
...

* You have received a new task, you should reply some commands, so that next time the system will process according to your reply content, and all the information inserted into the above will be sent to you.

....
lot , and i proved It's work !

Possible incorrect creation of Rotary Embeddinigs

Disclaimer: I don't have any idea about how this codebase works. I was just trying to implement on my own Rotary Embeddings for a personal project, and I was using the class defined in palm.py as a starting point.

The thing is, I'm not sure if the current implementation of Rotary Embeddings is correct. Specifically, I don't think the following line is correct:

x1, x2 = x.chunk(2, dim=-1)
return torch.cat((-x2, x1), dim=-1)

Because for Rotary embeddings we want to swap pair of adjacent elements, and negate the even elements (aka, turn [1,2,3,4,5,6] into [-2,1,-4,3,-6,5]). But the previous code basically swaps the two halves of the tensor, and negates the first one (aka, turns [1,2,3,4,5,6] into [-4,-5,-6,1,2,3]).

Is the code incorrect or is there something I'm missing?

Why the value calculate in generate and learn use different mask?

I'm very confused about the value calculate, why use different mask? In generate method, the mask include prompt. But when training in learn method, the mask did not include prompt.
this is in learn method:
action_masks = ~prompt_masks & masks
action_logits, values = self.actor_critic(
sequences,
mask = action_masks
)
and in generate method:
mask = None
if exists(eos_token):
mask = ((sequence == eos_token).cumsum(dim = -1) == 0)
mask = F.pad(mask, (1, -1), value = True) # include eos token
action_logits, value = self.forward(
sequence,
mask = mask,
return_values = return_values
)

train your reward model issue

can't train reward model with batch

    seq, prompt_mask, labels = next(train_loader)
    loss = reward_model(seq, prompt_mask = prompt_mask, labels = labels)
    accelerator.backward(loss / GRADIENT_ACCUMULATE_EVERY)

i set this but i get error from code, check source code , found out this:

    if  self.binned_output:
        return F.mse_loss(pred, labels)

    return F.cross_entropy(pred, labels)

cross_entropy DO NOT support multi trainset. i change to mse_loss ,still error.

how i compute loss from multi trainset , like batch size set 8 ,

How to use lora?

When I use this code,I find that this model does not use lora,can you give me an example?

I'm dumb

Hey is a web to interact with model, or how much does it cost to train model in my own computer? I play Valorant with low graphics so is not best as you can imagine

KL divergence loss

Hi, thanks for the great repo
I have a question,
In the function masked_kl_div of ppo.py, shouldnt the calculation be prob1*(log(prob1) - log(prob2))?
The calculation in the code is a negative KL loss that is to be maximized instead of minimized (as assumed by the code).

Value function

Hi,

I am confused about the 'value function' in the instructGPT paper. In the paper, it said "As previously mentioned, for all PPO models we use a 6B RM and a 6B value function, and the latter is initialized from the former.". The reward model(RM) and value function model seem to be two seperate models. However, there are no evidences showing that value function is part of involvement of PPO RL training either in the objective function or in the other parts of the paper.

Thanks

GPU requirements

Hi, first of all thanks for your work. I will definitely give it a try.

I was wondering if you could share some information about the training time and which GPUs you needed for train the model and if you have some recommendations about the size of the datasets you used.

Thanks a lot once again and merry Christmas!

Should critic's input be prompt only?

In the PPO implementation, it seems that the critic model considers both prompt and generated actions as the input (if pooled is true, then generated actions only). However, if we see prompt as S_t and prompt with action as S_t+T, shouldn't the value function be V(S_t) but not V(S_t+T)?

In other words, when calculating the advantage function, shouldn't our value function be the average reward for a prompt?

Reason for using pooled critic embedding instead of the last embedding for value head

Hi there,

In your ActorCritic.forward() I found that you do
critic_embeds = masked_mean(critic_embeds, mask, dim = 1)
And then feed the critic_embeds to the value head
I suppose this means you average over all the action embeddings and estimate the value for it.

May I ask if there is a specific reason for this? Because it seems that other implementations I found are just feeding the very last embedding (i.e., critic_embeds[:, -1, :]) to the value head, which seems more intuitive to me. For example, TRL and TRLX.

Best

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.