carperai / trlx Goto Github PK

View Code? Open in Web Editor NEW

4.3K 49.0 458.0 46.69 MB

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

License: MIT License

Python 98.48% Dockerfile 0.28% Shell 1.24%

machine-learning pytorch reinforcement-learning

trlx's People

Contributors

Stargazers

Watchers

Forkers

odellus lxuechen techthiyanes mrm8488 dagelf woody0105 brunotech samp830 marcus-arcadius reshinthadithyan shivanshupurohit alaatekleh ssahgal kaushikpatnaik c00renut guttappa1238 codeaudit jon-tow daia99 aogara-ds b1sounours thedch czwin32768 dumpmemory phungvanduy ayulockin shamspias simoninithomas jovsa gabrielhuang dongs0104 ethankim00 dpaleka psodhi cg80499 nashid thomfoster danyang-rainbow stjordanis steliord theexgenesis alekseykorshuk sengxian anubrata kunlun-zhu expert68 maksymdel aicrumb samithaj eublefar artek0chumak leshanbog kit-se stanleyjacob knowledgecluster osanseviero kaixianglin richmix ajunlonglive voxmenthe robertalanm kajfasz codethemath stat-eklee fredpwol alirezabayatmk jovany-wang brentes chorseng cclauss eogns282 hadryan mbrukman tanliboy mistobaan mmizutani rainmaker712 yesthing embeddedsamurai macguyversmusic pgill2 miaaazhang ainirobot hsl89 melvinebenezer hertera1 ggsonic paname75 tanzzilaalam ht-zhou fdoperezi tomjrtsmith doytsujin carlocayos kenhktsui t46 sayanc93 yard1 berenmillidge arjunchandra

trlx's Issues

Support for RLHF tuned Seq2Seq models

🚀 The feature, motivation, and pitch

Trlx repo supports decoder-only models such as GPT2, GPTJ etc. It will be beneficial to implement an Encoder-Decoder model for RLHF finetuning.
Providing an implementation that supports Encoder-Decoder models will be beneficial as they are also widely used in many NLP downstream tasks.
It will be good to have T5 as a base model for the Encoder-Decoder arch.

Alternatives

No response

Additional context

No response

Remove boiler plate between ILQL and PPO

@reciprocated mentioned that there is a lot of duplicated code between ILQL and PPO. We need to resolve this before we begin to add more RL algorithms.

Stable diffusion support

A text-to-image RLHF pipeline and orchestrator is needed.

Examples/simulacra.py doesn't work

🐛 Describe the bug

When running the script, it crashes with a SQL error:

Traceback (most recent call last):
  File "examples/simulacra.py", line 9, in <module>
    conn = sqlite3.connect("data/sac_public_2022_06_29.sqlite")
sqlite3.OperationalError: unable to open database file

This is on a fresh install of trlX on StabilityAI's cluster using standard configuration files.

Which trlX version are you using?

Alpha v0.2

Additional system and package information

No response

Add LORA support to TRLX

🚀 The feature, motivation, and pitch

LORA and other parameter-efficient methods can provide a number of advantages when finetuning large language models. These methods typically only update a small fraction of the model parameters during finetuning. For example, LORA only trains low-rank reparameterizations of weight matrices during training reducing parameter cost up to 10000 times.

Key Advantages:

Reduced storage cost and ability to switch out different adapters for different tasks
Increased training efficiency and reduced memory requirements since fewer optimizer states need to be tracked

Alternatives

No response

Additional context

The OpenDelta library provides support for LORA and other "delta methods"

model =  AutoModelForCausalLM.from_pretrained(model_base)
from opendelta import LoraModel
delta_model = LoraModel(backbone_model=model, modified_modules=['fc2'])
delta_model.freeze_module(exclude=["deltas", "layernorm_embedding"], set_state_dict=True)
# save only  trained parameters
delta_model.save_finetuned(save_path)

RLHF with HH Anthropic data

Do RLHF using Anthropic's HH data, using existing models.
Depends on #25

(cc @haileyschoelkopf)

DDP and hydra model

🐛 Describe the bug

Hydra model doesn't play nicely with ddp

  File "examples/ppo_sentiments.py", line 18, in <module>
    model = trlx.train(
  File "/trlx/trlx/trlx.py", line 92, in train
    model.learn()
  File "/trlx/trlx/model/accelerate_base_model.py", line 209, in learn
    loss, stats = self.loss(batch)
  File "/trlx/trlx/model/accelerate_ppo_model.py", line 112, in loss
    logits, _, vpred = self.model(
  File "/trlx/.env/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/trlx/.env/lib64/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by
passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the
return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 31 32 33
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

accelerate launch --num_processes 2 --num_machines 1 --config_file ddp.yaml examples/ppo_sentiments.py

This is a relevant discussion
pytorch/pytorch#43259

Which trlX version are you using?

stage-api @ 8057d16

Additional system and package information

No response

PPO Implementation Details - Checklist

The 37 Implementation Details of PPO, a blog post published at ICLR, details a number of PPO implementation details to improve both efficiency and model performance. See also: Andrychowicz et al., Engstrom et al.

Some of these optimizations are minor and probably irrelevant, many are already implemented here, and some may provide performance boosts to trlx. This issue documents these details as a checklist, to track the progress of this repository towards the entire list.

Other items in the blog post are environment/network specific to problems trlx does not tackle. Andrychowicz also contains other hyperparameter choices not mentioned here which may be of interest.

Example in read me is now wrong

Since the reward function was moved outside of the orchestrator, the example in the read me is no longer correct and will no longer run. I will update it.

Implement model saving/loading

Learnt Reward Modelling example

Create an example showing reward modeling. This could use a synthetic reward source artificially limited, or the HHH Anthropic data (already on the Stability cluster).
More ideas for tasks: #13 (comment)
(cc @haileyschoelkopf)

Conceptual explanation of hydra models

Could you please give a conceptual explanation of the hydra models? Very interested in how they work! Thank you!

fix type errors and add mypy

We should also add it as a pre-merge hook

Self Play

Self play, and generally multi-LM-agent settings are something we are very interested in exploring. What does it take to support this? Does it already work without big overheads?

Significant performance drop of ILQL when using multi-GPU training

🐛 Describe the bug

I was running experiments using the ILQL sentiment example code. When using a single A100 GPU, I got an evaluation score of 0.9286 after 1k steps of training. However, when I switched to multi-GPU training (2 A100s), after 1000 steps, I got a score of 0.692. I use the Huggingface accelerate. All hyper-parameters are the same. Any idea why this happens?

Multi-GPU training command:
accelerate launch --config_file accelerator_config.yaml examples/ilql_sentiments.py

Which trlX version are you using?

trlX==0.3.0

Additional system and package information

pytorch==1.13.0+cu116

trlx has no attribute 'train'

🐛 Describe the bug

Trying to run a train loop with trlx

Erroring with trlx has no attribute train

Code being ran (from README):

import trlx

# optimize some reward function
model = trlx.train('gpt2', reward_fn=lambda samples: [sample.count('cats') for sample in samples])

# or steer a model with a collection of rated samples
model = trlx.train('EleutherAI/gpt-j-6B', dataset=[('dolphins', 'geese'), (1.0, 100.0)])

# model is a wrapper with some logit preprocessing
model.generate(**tokenizer('Q: Who rules the world? A:', return_tensors='pt'), do_sample=True)

Behavior:

  File "/home/[user]/trlx/main.py", line 4, in <module>
    model = trlx.train('gpt2', reward_fn=lambda samples: [sample.count('cats') for sample in samples])
AttributeError: module 'trlx' has no attribute 'train'

Which trlX version are you using?

master

Additional system and package information

OpenSUSE + Python 3.10.7

Large reward model issue.

If the reward model cannot fit on a single GPU, which will be the case when we are training our instruct GPT model, then the current system fails since you would have to run two accelerate instances at once.

What deepspeed config was this tested on?

📚 The doc issue

There is no information on what config or machines this was tested on, nor what the results actually were. I was unable to get my configuration to work for the example code, but I might be using an untested deepspeed configuration (e.g., stage 3 offloading). I'd like to test with the validated configuration.

Suggest a potential alternative/fix

Could you add the tested configurations and machines? Thanks!

How we can improve the documentation for beginners

📚 The doc issue

Hey there 👋 I've done a first review of the documentation.

For context, I've skimmed the documentation as a beginner to see where are the friction points.

I've some solutions for some, that's why I opened a PR: #64

For others I prefer for now to open an issue to discuss with you on how we can improve it.

So here's the points:

The documentation is clear and very well written. I think though we need to explain more how TRLX works in a nutshell. For instance, Leandro in TRL have a section where he made very good illustration to explain how it works https://github.com/lvwerra/trl/#how-it-works

For the example page I was thinking on a readme explaining each different example (and update the website documentation)
A simple colab as a quick starter can be interesting to have. For instance for SB3 integration we created a small one where in 10min you trained your first agent and load a PPO agent from the Hub to play Space Invaders. Having that kind of quick run colab can help people to rapidly get the big picture of the library.
We miss a Bibtex for citing.

WDYT? 🤔

Suggest a potential alternative/fix

PR: #64

FasterTransformer reward model support

🚀 The feature, motivation, and pitch

We need the ability to use massive reward models, as this will be necessary for our Instruct GPT model. Currently the size of the reward model is greatly limited and using GPU accelerators for them comes with weird sets of limitations.

Alternatives

We could alternatively use a different accelerate script for the reward model, or include the reward model within the student class. Doing the latter would be trivial but result in kind of gross code and not very easily extendible.

Additional context

No response

AttributeError: 'DistributedDataParallel' object has no attribute 'generate'

🐛 Describe the bug

When I ran accelerate launch examples/ppo_sentiments.py, the error below happened. Am I supposed to unwrap the ddp model?

AttributeError: 'DistributedDataParallel' object has no attribute 'generate'
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│                                                                                           │
│ /home/user/bob_workspace/code/trlx/examples/ppo_sentiments.py:38 in <module
│   35 │   orch: PPOOrchestrator = get_orchestrator(cfg.train.orchestrator)(                │
│   36 │   │   model, pipeline, reward_fn=reward_fn, chunk_size=cfg.method.chunk_size       │
│   37 │   )                                                                                │
│ ❱ 38 │   orch.make_experience(cfg.method.num_rollouts)                                    │
│   39 │   model.learn()                                                                    │
│   40 │                                                                                    │
│   41 │   print("DONE!")                                                                   │
│ /home/user/bob_workspace/code/trlx/trlx/orchestrator/ppo_orchestrator.py:64 in        │
│    63 │   │   │                                                                   [82/2259]
│ ❱  64 │   │   │   query_tensors, response_tensors, response_text = self.rl_model.act(batc │
│    65 │   │   │   texts = [q + r for q, r in zip(batch.text, response_text)]              │
│    66 │   │   │   scores = self.score(texts)                                              │
│    67                                                                                     │
│                                                                                           │
│ /home/user/bob_workspace/code/trlx/trlx/model/accelerate_base_model.py:121 in act     │
│                                                                                           │
│   118 │   │   │   │   self.dummy_input.to(self.accelerator.device)                        │
│   119 │   │   │   )  # Dummy pass to make things play nice with accelerate                │
│   120 │   │   │   # Removed synced gpus                                                   │
│ ❱ 121 │   │   │   response = self.model.generate(                                         │
│   122 │   │   │   │   query_tensors,                                                      │
│   123 │   │   │   │   pad_token_id=self.tokenizer.eos_token_id,                           │
│   124 │   │   │   │   **self.config.method.gen_kwargs                                     │
│                                                                                           │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1185 in __getattr__     │
│                                                                                           │
│   1182 │   │   │   modules = self.__dict__['_modules']                                    │
│   1183 │   │   │   if name in modules:                                                    │
│   1184 │   │   │   │   return modules[name]                                               │
│ ❱ 1185 │   │   raise AttributeError("'{}' object has no attribute '{}'".format(           │
│   1186 │   │   │   type(self).__name__, name))                                            │
│   1187 │                                                                                  │
│   1188 │   def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:

My accelerate config

- `Accelerate` version: 0.13.2
- Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
- Python version: 3.9.5
- Numpy version: 1.23.4
- PyTorch version (GPU?): 1.11.0 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - downcast_bf16: no

Which trlX version are you using?

trlx==1.0.0

Additional system and package information

No response

Benchmark suite

We should use RL4LMs benchmark suite, I think it is a strong candidate to show the strengths and weaknesses of TRLX.

Better colab notebook example.

We want an example that prompt engineers a language model to be the critic. Jasper offered during TRLX weekly. Creating this issue for reference.

Ray Tune sweep does not support multi GPU

🐛 Describe the bug

There is no way to use multiple gpu if you're using Ray Tune, apparently we probably need to wrap ray.train.torch.TorchTrainer for it to work.

It appears this is what pytorch lightning does.

Which trlX version are you using?

trlx==0.3.0

Additional system and package information

No response

Add Jax support

🚀 The feature, motivation, and pitch

Add jax support for RLHF on TPUs.

Alternatives

No response

Additional context

No response

Gradient metrics

Measure gradient norms and gradient noise of RL training. This is to address open questions around scaling gradients when mixing RL and non-RL tasks, as well as informing optimal batch sizes

Support direct loading into rollout storage for reward labeled datasets

[bug] Support for Soft Prompts in PPO Model

Towards replication of ELM Stage 3, I'm looking into adding softprompts to train a conditional learnable embedding with PPO for each terrain mentioned in the paper.

Following https://github.com/kipgparker/soft-prompt-tuning.

Outlined code snippet example, and tracebacks for varying number of softprompt tokens. Will come back to this, but let me know if you have any suggestions for modifying the orchestrator.

Using the ppo_sentiments example and soft prompt implementation:

if __name__ == "__main__":
    cfg = TRLConfig.load_yaml("configs/ppo_config.yml")

    sentiment_pipe = pipeline(
        "sentiment-analysis", "lvwerra/distilbert-imdb", device=-1
    )

    def reward_fn(samples: List[str]):
        sent_kwargs = {
            "return_all_scores": True,
            "function_to_apply": None,
            "batch_size": cfg.method.chunk_size,
        }
        pipe_outputs = sentiment_pipe(samples, **sent_kwargs)
        scores = torch.tensor([output[1]["score"] for output in pipe_outputs])
        return scores

    model: AcceleratePPOModel = get_model(cfg.model.model_type)(cfg)

    # setup soft prompt embeddings with 'n' prefix tokens, init from model vocab
    n_tokens = 1
    initialize_from_vocab = True
    
    s_wte = SoftEmbedding(model.model.gpt.get_input_embeddings(), 
                      n_tokens=n_tokens, 
                      initialize_from_vocab=initialize_from_vocab)

    model.model.gpt.set_input_embeddings(s_wte)
    
    pipeline: PPOPipeline = get_pipeline(cfg.train.pipeline)(model.tokenizer, cfg)
    orch: PPOOrchestrator = get_orchestrator(cfg.train.orchestrator)(
        model, pipeline, reward_fn=reward_fn, chunk_size=cfg.method.chunk_size
    )
    orch.make_experience(cfg.method.num_rollouts)
    model.learn()

    print("DONE!")

When n_tokens = 1, the following error occurs:

Traceback (most recent call last):
  File "/home/aleph/adai/trlx/examples/ppo_sentiments.py", line 101, in <module>
    orch.make_experience(cfg.method.num_rollouts)
  File "/home/aleph/adai/trlx/trlx/orchestrator/ppo_orchestrator.py", line 69, in make_experience
    logits, _, v = self.rl_model.model(all_tokens)
ValueError: not enough values to unpack (expected 3, got 2)

For n_tokens = 20:

Traceback (most recent call last):
  File "/home/aleph/adai/trlx/examples/ppo_sentiments.py", line 101, in <module>
    orch.make_experience(cfg.method.num_rollouts)
  File "/home/aleph/adai/trlx/trlx/orchestrator/ppo_orchestrator.py", line 62, in make_experience
    query_tensors, response_tensors, response_text = self.rl_model.act(batch)
  File "/home/aleph/adai/trlx/trlx/model/accelerate_base_model.py", line 104, in act
    _ = self.model(
  File "/home/aleph/anaconda3/envs/trlx/lib/python3.8/site-packages/torch-1.12.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aleph/adai/trlx/trlx/model/nn/ppo_models.py", line 76, in forward
    transformer_outputs = self.gpt.transformer(
  File "/home/aleph/anaconda3/envs/trlx/lib/python3.8/site-packages/torch-1.12.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aleph/anaconda3/envs/trlx/lib/python3.8/site-packages/transformers-4.22.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 851, in forward
    hidden_states = inputs_embeds + position_embeds
RuntimeError: The size of tensor a (20) must match the size of tensor b (4) at non-singleton dimension 1

Deadlock (nothing happening) in multi-GPU setting

Hey all, I am using this colab notebook as a reference (that I found in the discord server) to train examples/ppo_sentiments.py using HF Accelerate in a GCP VM with 2 K80s.

This is my accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: '[all]'
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false

Unfortunately, nothing happens after this point, as shown in the image:

To avoid getting this warning, huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks..., I set TOKENIZERS_PARALLELISM=false.

I am not an expert in using Accelerate; any help would be appreciated. Also, for context, I am trying to run this example to build W&B sweeps (hyperparameter optimization), mentioned in #12.

cc: @LouisCastricato

Getting Started As a Domain Expert

🚀 The feature, motivation, and pitch

Getting Started Guide for Domain Experts

A step-by-step cookbook that enables people with domain knowledge to rapidly begin contributing (and receiving) value to the open-source RLHF project.

Motivation: I am able to install tlrx and am confident that I will be able to get component scripts running. However, I do not have a clear path to doing something that is quickly useful to me and to others.

Goal:

I would like to be able to pick a particular domain that I am expert in--for example, book publishing, military history, climate change--and provide "human feedback" that visibly makes results better for a) me on a toy setup b) everyoe when deployed

Suggested approach:

Step-by-step guide with sample artifacts.

Alternatives

I might have taken this to Discord, but I hate Discord. It is noisy and chaotic. Github is much better suited for domain expertise projects.

Additional context

Glad to help.

Stale configs for `ppo_gptj.yml`

🐛 Describe the bug

The ppo_gptj.yml config is currently out of date from recent updates.

n_ctx, grad_clip, log_interval, input_size, gen_size, accelerate, accelerate_config_path are not fields of TrainConfig and should be removed.
device is not a field of ModelConfig and should be removed.

Do we want this config or should it be removed as it's mostly a duplicate of ppo_config.yml here with a model path change.

Which trlX version are you using?

trlx==0.2.0

Additional system and package information

No response

Autoformat/Code Style

Lets use black, unless someone knows of a better alternative.
How should we enforce autoformatting? I'm leaning towards a pre-commit hook.

We should do this ASAP otherwise the pre-formatting and post-formatting diffs will be bigger

randomwalks.py doesn't work on v0.3 or current main

🐛 Describe the bug

On both v0.3 and https://github.com/CarperAI/trlx/commit/ff0d0776ce9189c7e0ebc954dd14bbca1136a450, following the instructions from README.md and running

wandb disable && python examples/randomwalks.py

produces the following error:

Traceback (most recent call last):
  File "/home/dpaleka/code/trlx/examples/randomwalks.py", line 103, in <module>
    trlx.train(
  File "/home/dpaleka/code/trlx/trlx/trlx.py", line 95, in train
    model.learn()
  File "/home/dpaleka/code/trlx/trlx/model/accelerate_base_model.py", line 240, in learn
    results = self.evaluate()
  File "/home/dpaleka/code/trlx/trlx/model/accelerate_base_model.py", line 160, in evaluate
    samples = self.generate(prompts)
  File "/home/dpaleka/code/trlx/trlx/model/accelerate_base_model.py", line 133, in generate
    return self.accelerator.unwrap_model(self.model).generate(
  File "/home/dpaleka/code/trlx/trlx/model/nn/ilql_models.py", line 306, in generate
    logits[torch.where(logit_mask[input_ids[:, -1].squeeze()])] = -np.inf
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Which trlX version are you using?

trlx==0.3

Additional system and package information

No response

Add flake8 and isort to pre-commit

This needs the current repo to be flake8 passing, which it doesn't

Amos optimizer support

🚀 The feature, motivation, and pitch

https://arxiv.org/abs/2210.11693

Amos reports better scaling (for multi accelerator) and better performance when compared to AdamW for autoregressive and masked language modeling. We should apply it to trlX and see if it helps speed up RLHF.

Alternatives

We could alternatively just stay with AdamW, which is very tried and tested.

Additional context

We need to seriously consider the wall time constraints of Amos and if it creates any serious optimization bottlenecks for us.

NeMo-Megatron Integration

🚀 The feature, motivation, and pitch

We should try out github.com/nVIDIA/neMo, using NeMo-Megatron. This would involve

Make pytorch-lightning based trainer
Make NeMoILQLModel - an implementation of ILQL training using NeMo on top of a base NeMo GPT
Make NeMoPPOModel
Figure out how to run PyTorch-Lightning on the cluster (which nemo uses)
Benchmark

Given that nemo works with vanilla pytorch modules, we should be able to reuse a lot of code from the current accelerate implementation

randomwalks.py with PPO

🚀 The feature, motivation, and pitch

There should be a working example of randomwalks.py (random walks shortest path from the Decision Transformer paper) that uses the PPO orchestrator.

Alternatives

No response

Additional context

No response

Implement A2C

Implement additional online RL algorithms

NeoX support

We need support for NeoX, EleutherAI's fork of Megatron-Deepspeed. This is already something in active development, just making a GitHub issue for tracking.

How to attribute reward to multiple model runs in the same trajectory with PPO

I want to finetune a base model M to maximize a reward R, when the model is used inside of a more complex system.
Take a simple example of the setting. The trajectory is as follows: sample prompt_1 from a dataset of prompts, then

prompt1 -> M(prompt1)  = out_1
out_1 -> F(out_1) = prompt_2
prompt_2 -> M(prompt_2) = out_2
out_2 -> R(out_2) = reward

where F : str -> str and R : str -> int are some methods defined in my code.
Is there a way to do this in the current TRLX framework, preferably online with PPO?
Alternative suggestions are welcome.

ReadTheDocs

We need documentation. WIP.

Hyper parameter sweeping

We need a way to sweep over a set of hyper parameters. Maybe something like wandb sweep? @reciprocated mentioned this is very important; however, @ShivanshuPurohit has mentioned wandb sweep does not work with NeoX, so I am hesitant.

Ratio != 1 at start of PPO training (during loss function calculation)

🐛 Describe the bug

I've run ppo_sentiments.py, and an older version, and seeing that ratio is != 1 at step 0 (before optimizer step), at this line:
https://github.com/CarperAI/trlx/blob/main/trlx/model/nn/ppo_models.py#L165

https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ - reference regarding ratio = 1 at first epoch/mini-batch update:

Check if ratio=1: Check if the ratio are always 1s during the first epoch and first mini-batch update, when new and old policies are the same and therefore the ratio are 1s and has nothing to clip. If ratio are not 1s, it means there is a bug and the program has not reconstructed the probability distributions used in rollouts.

When making experience, the ratio you'd get here at the start of training (before optimization) is 1: https://github.com/CarperAI/trlx/blob/main/trlx/orchestrator/ppo_orchestrator.py#L130

Seems like a currently unknown cause/bug, which leads to unexpected ratio values.

Which trlX version are you using?

0.3.0

Additional system and package information

Python 3.8

Python 3.7 support (for Colab)

🐛 Describe the bug

In Colab running !pip install git+https://github.com/CarperAI/trlx resulted in

ERROR: No matching distribution found for numpy>=1.23.2

Numpy did drop support in recent releases for Python 3.7 which is what Google Colab is using right now. One might consider downgrade numpy. Per discussions in Discord I put it here so that we don't forget.

Which trlX version are you using?

No response

Additional system and package information

numpy==1.23.2

Create a unified API between PPO and ILQL

Currently ILQL was implemented more or less in a vacuum of our PPO implementation. As such, ILQL has features that our PPO implementation needs. This includes

Removing the dependency on GPT2 for PPO. This is currently being done here, but we can create a more standardized approach
Adding the ability to pass custom reward functions to the orchestrator. This is an absolute must have.
Standardized naming conventions between functions.

Installation error due to multiple top-level packages

Issue

Installing trlx in a fresh environment, following the README.md guide (python setup.py develop), results in the following package error:

error: Multiple top-level packages discovered in a flat-layout: ['trlx', 'configs', 'unittests'].

To avoid accidental inclusion of unwanted files or directories,
setuptools will not proceed with this build.

If you are trying to create a single distribution with multiple packages
on purpose, you should not rely on automatic discovery.
Instead, consider the following options:

1. set up custom discovery (`find` directive with `include` or `exclude`)
2. use a `src-layout`
3. explicitly set `py_modules` or `packages` with a list of names

To find more information, look for "package discovery" on setuptools docs.

Environment

Python version : 3.10.6

Compiler    : Clang 13.1.6 (clang-1316.0.21.2.5)
OS          : Darwin
Release     : 21.5.0
Machine     : arm64
Processor   : arm

How to attribute different rewards to parts of the same rollout with PPO?

This is related to #69 (which is why I phrased it in a similar way), but still feels a bit different.

Let's say the model generates a sequence of three related sentences (or paragraphs or tokens) after being prompted (i.e. the rollout). Is there a way to assign them different rewards individually instead of just one single aggregate reward, say based on different criteria? Perhaps I have a constant mass of reward I want to differentially assign to the several parts, but the sum is always constant. In the limit of generality, this would mean being able to assign specific reward values for each individual token/action in the rollout/trajectory. In this use case, the individual rewards can only be computed after the whole sequence of parts has been generated (i.e. you can't reward step 1 before generating step 3).

Is this possible with trlx? Would it require a custom orchestrator or is there a way to specify individual token rewards right away while keeping the standard structure? Is this even possible with PPO in the first place, or is there a fundamental misunderstanding on my part?

Thanks for building this!

Best practices on repeatedly generating experience and training on it?

All the current (online RL) trlx examples seem to only involve generating experience once and then using the resulting rollout store to update the weights (for potentially more epochs). How should one go about incrementally generating experience, training on it, generating XP w/ updated model, training again...? I thought of just calling orch.make_experience() and model.learn() in a loop multiple times, but that sounds pretty dumb. Is there a better way?

Loosely related, from "Fine-Tuning Language Models from Human Preferences":

If the trained policy π is very different from the zero-shot
policy ρ, the reward model will suffer a large distributional
shift from training on samples from ρ to evaluation on sam-
ples from π. To prevent this, we can collect human data
throughout RL fine-tuning, continuously gathering new data
by sampling from π and retraining the reward model. As
section 3 shows, online data collection was important for
summarization but not for the simpler style tasks.

Question can also be thought of as the one cycle in the computational-ish graph in the section at the bottom from trl:

Example/Test Model Benchmarks (Canonical WandB runs)

🚀 The feature, motivation, and pitch

If we had links to benchmarks for the example (and/or test) models, it would be easier to add new models, and keep track of improvements in method implementations. Additionally, during refactoring, it would allow checking that no performance degrading changes were introduced.

This can be a minimal version of #13

Alternatives

No response

Additional context

No response

carperai / trlx Goto Github PK

trlx's People

Contributors

Stargazers

Watchers

Forkers

trlx's Issues

🚀 The feature, motivation, and pitch

Alternatives

Additional context

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🚀 The feature, motivation, and pitch

Alternatives

Additional context

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

📚 The doc issue

Suggest a potential alternative/fix

📚 The doc issue

Suggest a potential alternative/fix

🚀 The feature, motivation, and pitch

Alternatives

Additional context

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🚀 The feature, motivation, and pitch

Alternatives

Additional context

🚀 The feature, motivation, and pitch

Alternatives

Additional context

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🚀 The feature, motivation, and pitch

Alternatives

Additional context

🚀 The feature, motivation, and pitch

🚀 The feature, motivation, and pitch

Alternatives

Additional context

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

Issue

Environment

🚀 The feature, motivation, and pitch

Alternatives

Additional context

Recommend Projects

Recommend Topics

Recommend Org