apartresearch / interpreting-reward-models Goto Github PK

✱ Interpreting implicit reward models learnt in RLHF using sparse autoencoders.

License: MIT License

Python 14.00% Shell 1.06% Jupyter Notebook 84.94%

interpreting-reward-models's Issues

More reward models: Create a contrastive pairs dataset for helpful / harmless

We want to create a contrastive pairs dataset derived from helpful / harmless, where we flip a token to get "helpful" vs "non helpful" contrastive pairs.

More reward models: Train a DPO model and autoencoders for hh-hrlf

For now, do this just for gpt-125m-neo to make this fast.

Make the pipeline configurable while doing this, so we can easily run this for a range of other models quicker.

Enhance: Support training on different of activations between base and reward models.

More data: Create two datasets optimized for Vader, and upload to datasets

The current IMDB dataset suffers from very few examples of the actual vader lexicon. As such, let's create two new datasets that have high overlap with the vader lexicon.

A simple version, that picks from openwebtext, and uses sentences that have high overlap with vader.
A "poisoned" version that flips the reward of 30 of the vader tokens. This will give us a base line to see if our IRM's can recover these tokens.

The columns of the dataset will be text, lexicon_tokens, token_rewards_dict and poisoned which is a (usually empty) list of tokens. There were will be 30 of these.

The vader lexicon tokens will be ordered by their frequency in english, and the top 4000 will be picked, with 5 occurrences each.

apartresearch / interpreting-reward-models Goto Github PK

interpreting-reward-models's Issues

More reward models: Create a contrastive pairs dataset for helpful / harmless

More reward models: Train a DPO model and autoencoders for hh-hrlf

Enhance: Support training on different of activations between base and reward models.

More data: Create two datasets optimized for Vader, and upload to datasets

More reward models: Create contrastive pairs dataset for unalignment dataset.

Enhance: Integrate with SAELens for the sparse autoencoder training.

More reward models: Train a DPO model and autoencoders for unalignment dataset.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent