Giter VIP home page Giter VIP logo

self-explore's Introduction

Self-Explore

Self-Explore to avoid ️the p️️it!
Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards


This is the official github repository for Self-Explore.

Paper Link: https://arxiv.org/abs/2404.10346

Overview:

Overview Image

Setting

Run pip install -r requirements.txt
All experiments were carried out using 4 x NVIDIA A100 80GB, with CUDA version 12.0.

Data

In the data directory, you will find the train and test file for GSM8K and MATH.

Training

Stage 1. Run SFT:

Run SFT (or FT, in short) to get the base generator.
In /scripts/{task}/sft/run_ft.sh you'll see the script necessary for this. (For data_path, please put the trian file.)
Put necessary paths to the files and models then simply run sh scripts/{task}/sft/run_ft.sh in the main directory.

Stage 2. Get RFT Data:

Now you'll need to generate N instances per problem.
To do this, go to gen directory and run sh gen_rft_data.sh.
This assumes you are using 4 GPUs, and generates the predictions in parallel using each GPU.
Once completed, you will see RFT and DPO training file.

Stage 3. Run RFT:

Run RFT to get the RFT model, which acts our explorer and reference model when training for DPO.
in /scripts/{task}/sft/run_rft.sh you'll see the script necessary for this.
Put necessary paths to the files and models then simply run sh /scripts/{task}/sft/run_rft.sh in the main directory.

Stage 4. 🔎 Explore :

To find the first pit, let the RFT model explore from each step within rejected sample.
You can do this by running gen_step_explore.sh in gen directory. (For data_path here, please put the DPO file generated).
Then you will get a file named ending in gpair_{k}.jsonl
which is your fine-grained pairwise training data.

Stage 5. Train with Preference Learning Objective:

You can apply any arbitrary preference learning objective, but in our work, we chose DPO (Direct Preference Optimization).
To do this refer to scripts/{task}/dpo/run_dpo.sh.

  • To run with the outcome-supervision labels, set the training data as the DPO file generated in Stage 3.
  • To run with the step-level fine-grained labels (ours), set the training data as the gpair file generated in Stage 4.

Evaluation

Under eval/{task} directory, you'll find the script needed for running evaluation.

Results

Result Image

Models

We release our best trained DeepSeek-Math's GSM8K and MATH trained checkpoints on huggingface.

Model Accuracy Download
DeepSeek_Math_Self_Explore_GSM8K 78.62 🤗 HuggingFace
DeepSeek_Math_Self_Explore_MATH 37.68 🤗 HuggingFace

Acknowledgemenets

Our evaluation codes are borrowed from:

Citation

@misc{hwang2024selfexplore,
      title={Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards}, 
      author={Hyeonbin Hwang and Doyoung Kim and Seungone Kim and Seonghyeon Ye and Minjoon Seo},
      year={2024},
      eprint={2404.10346},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

self-explore's People

Contributors

hbin0701 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.