This repository contains GSM8K-AI-SubQ dataset, scripts for its collection and scripts for baselines.
The dataset was created to conduct research in the direction of distillation of LLMs reasoning abilities, particularly their ability of splitting problems into simpler sub-problems. We have employed ChatGPT for the generation of the dataset. It is based on GSM8K dataset and includes examples of ChatGPT problems decomposition and its own feedback on generated sub-questions. Our data also includes ChatGPT's answers for sub-questions, but we didn't conduct any experiments for this part of reasoning. We hope that our dataset will help further advancements of offline RL algorithms in the area of reasoning.
For more details see our work "Distilling LLMs' Decomposition Abilities into Compact Language Models".
Each of the directories contains README.md with relevant instructions and comments. All the requirements can be installed with
python3 -m pip install -r requirements.txt
- baselines contains the scripts of baseline algorithms: Behavioral Cloning (BC), Filtered BC and ILQL.
- data_generation_and_evaluation contains the scripts and data required for the generation of the dataset and scripts for evaluation of results.
- dataset contains the GSM8K-AI-SubQ dataset.
- eval_responses contains test set sub-questions generated with different baselines and answers of different language models to these sub-questions.
- results_processing contains scripts for results processing.
Algorithm | DistillGPT | GPT-2 small | GPT-2 medium | Average |
---|---|---|---|---|
BC | 0.476 | 0.508 | 0.538 | 0.507 |
Filtered BC | 0.493 | 0.527 | 0.576 | 0.532 |
ILQL-sparse | 0.474 | 0.513 | 0.531 | 0.506 |
ILQL-full | 0.482 | 0.505 | 0.533 | 0.507 |
ChatGPT | - | - | - | 0.682 |
Algorithm | DistillGPT | GPT-2 small | GPT-2 medium | Average |
---|---|---|---|---|
BC | 0.118 | 0.154 | 0.164 | 0.145 |
Filtered BC | 0.125 | 0.159 | 0.162 | 0.149 |
ILQL-sparse | 0.122 | 0.141 | 0.164 | 0.142 |
ILQL-full | 0.123 | 0.147 | 0.163 | 0.144 |
ChatGPT | - | - | - | 0.234 |
Algorithm | DistillGPT | GPT-2 small | GPT-2 medium | Average |
---|---|---|---|---|
BC | 0.184 | 0.212 | 0.247 | 0.214 |
Filtered BC | 0.194 | 0.230 | 0.245 | 0.223 |
ILQL-sparse | 0.178 | 0.204 | 0.247 | 0.210 |
ILQL-full | 0.183 | 0.205 | 0.247 | 0.212 |
ChatGPT | - | - | - | 0.353 |
Algorithm | DistillGPT | GPT-2 small | GPT-2 medium | Average |
---|---|---|---|---|
BC | 0.240 | 0.264 | 0.290 | 0.265 |
Filtered BC | 0.228 | 0.256 | 0.293 | 0.259 |
ILQL-sparse | 0.223 | 0.253 | 0.288 | 0.255 |
ILQL-full | 0.235 | 0.252 | 0.282 | 0.256 |
ChatGPT | - | - | - | 0.446 |
Algorithm | DistillGPT | GPT-2 small | GPT-2 medium | Average |
---|---|---|---|---|
BC | 0.255 | 0.284 | 0.310 | 0.283 |
Filtered BC | 0.260 | 0.293 | 0.319 | 0.291 |
ILQL-sparse | 0.249 | 0.278 | 0.308 | 0.278 |
ILQL-full | 0.256 | 0.277 | 0.306 | 0.280 |
ChatGPT | - | - | - | 0.429 |
If you use our work in your research, please use the following bibtex
@article{tarasov2024distilling,
title={Distilling LLMs' Decomposition Abilities into Compact Language Models},
author={Tarasov, Denis and Shridhar, Kumar},
journal={arXiv preprint arXiv:2402.01812},
year={2024}
}