GSM8K-AI-SubQ

This repository contains GSM8K-AI-SubQ dataset, scripts for its collection and scripts for baselines.

The dataset was created to conduct research in the direction of distillation of LLMs reasoning abilities, particularly their ability of splitting problems into simpler sub-problems. We have employed ChatGPT for the generation of the dataset. It is based on GSM8K dataset and includes examples of ChatGPT problems decomposition and its own feedback on generated sub-questions. Our data also includes ChatGPT's answers for sub-questions, but we didn't conduct any experiments for this part of reasoning. We hope that our dataset will help further advancements of offline RL algorithms in the area of reasoning.

For more details see our work "Distilling LLMs' Decomposition Abilities into Compact Language Models".

Repository structure

Each of the directories contains README.md with relevant instructions and comments. All the requirements can be installed with

python3 -m pip install -r requirements.txt

baselines contains the scripts of baseline algorithms: Behavioral Cloning (BC), Filtered BC and ILQL.
data_generation_and_evaluation contains the scripts and data required for the generation of the dataset and scripts for evaluation of results.
dataset contains the GSM8K-AI-SubQ dataset.
eval_responses contains test set sub-questions generated with different baselines and answers of different language models to these sub-questions.
results_processing contains scripts for results processing.

Evaluation results

ChatGPT as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.476	0.508	0.538	0.507
Filtered BC	0.493	0.527	0.576	0.532
ILQL-sparse	0.474	0.513	0.531	0.506
ILQL-full	0.482	0.505	0.533	0.507
ChatGPT	-	-	-	0.682

LLaMA 7B as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.118	0.154	0.164	0.145
Filtered BC	0.125	0.159	0.162	0.149
ILQL-sparse	0.122	0.141	0.164	0.142
ILQL-full	0.123	0.147	0.163	0.144
ChatGPT	-	-	-	0.234

LLaMA 13B as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.184	0.212	0.247	0.214
Filtered BC	0.194	0.230	0.245	0.223
ILQL-sparse	0.178	0.204	0.247	0.210
ILQL-full	0.183	0.205	0.247	0.212
ChatGPT	-	-	-	0.353

Mistral as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.240	0.264	0.290	0.265
Filtered BC	0.228	0.256	0.293	0.259
ILQL-sparse	0.223	0.253	0.288	0.255
ILQL-full	0.235	0.252	0.282	0.256
ChatGPT	-	-	-	0.446

Average among sub-question answerers

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.255	0.284	0.310	0.283
Filtered BC	0.260	0.293	0.319	0.291
ILQL-sparse	0.249	0.278	0.308	0.278
ILQL-full	0.256	0.277	0.306	0.280
ChatGPT	-	-	-	0.429

Citing

If you use our work in your research, please use the following bibtex

@article{tarasov2024distilling,
  title={Distilling LLMs' Decomposition Abilities into Compact Language Models},
  author={Tarasov, Denis and Shridhar, Kumar},
  journal={arXiv preprint arXiv:2402.01812},
  year={2024}
}

dt6a / gsm8k-ai-subq Goto Github PK

gsm8k-ai-subq's Introduction

GSM8K-AI-SubQ

Repository structure

Evaluation results

ChatGPT as sub-question answerer

LLaMA 7B as sub-question answerer

LLaMA 13B as sub-question answerer

Mistral as sub-question answerer

Average among sub-question answerers

Citing

gsm8k-ai-subq's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent