Rethinking Optimization and Architecture for Tiny Language Models

This is the official implementation of Rethinking Optimization and Architecture for Tiny Language Models, an empirical investigation about how to construct powerful language models.

Four strategies are proposed to improve performance:

🎯 Compact Tokenizer: efficient coverage of corpus;
🔍 Architecture Tweak: better depth and width tradeoffs;
🎁 Parameter Inheritance: powerful knowledge from larger LLMs;
🔥 Multiple-Round Training: memory reinforcement of tiny models.

Based on the observations above, PanGu-π-1B Pro and PanGu-π-1.5B Pro are trained on 1.6T multilingual corpora. Model configurations are shown as follows:

Benchmark Results

Training

This repository is modified from the InternEvo training framework.

Here are the steps to organize the codes:

Clone the InternEvo repository and configure the runtime environment.
Copy the configuration files configs/LLM1B.py to the InternEvo/configs/ directory.
Copy the start script src/start_finetune.py to the InternEvo root directory.

You can follow the guide of InternEvo to pretrain data and train models (https://github.com/InternLM/InternEvo/blob/develop/doc/en/usage.md).

The model's depth, width, and expanding rate can by easily adjusted in the config.

Compact Tokenizer

The compact tokenizer is constructed by removing low-frequency vocabularies. To prune tokenizer, you can follow these steps:

Counting the frequency of tokens cached by the original big tokenizer.
- python src/step1_token_frequency_stat.py --src cached_data_dir --dst tmp_stat_files_dir. Thhe script counts the frequency of all tokens in the cached_data_dir folder and generates a corresponding JSON file in the tmp_stat_files_dir folder.
- python src/step2_token_frequency_stat_combie.py --src tmp_stat_files_dir --dst total_token_freq.json. Combine all JSON files in the tmp_stat_files_dir folder and write the frequency of tokens in total_token_freq.json
Firstly add the special tokens, and then add the tokens with the highest word frequency to the new tokenizer.
- python src/step3_generate_new_tokenizer.py --origin_tokenizer_dir origin_tokenzier --vocab_num compact_tokenizer_size --output new_tokenizer_dir --token_freq_file total_token_freq.json. This script will generate a new tokenzier in the new_tokenizer_dir folder with compact_tokenizer_size tokens.

Parameter Inheritance

To pretrain by inheriting parameter from a large model, you can use the following command:

python start_finetune.py --config ./configs/LLM1B.py

Note that MODEL_ONLY_FOLDER is the model's checkpoint pruned from a large model.

If you want to train from scratch, you need the set load_given_ckpt=False in the config.

Multiple-Round Training

To extract a certain proportion of challenging examples from the last epoch, you can utilize the following steps:

Compute the batch-wise loss $L={l_1,l_2,\cdots,l_N}$ using the pre-trained frozen model from the previous epoch, where $N$ represents the total number of batches. For instance, a dataset containing 150B tokens might yield approximately 75000 batches when utilizing a batch size of 2M.
Calculate the sampling probability $p_i = \exp(l_i) \bigg/ {\sum \limits_{j=1}^N \exp(l_j)}$.
Sample $N_0$ batches out of $N$ according to the sampling probability $\boldsymbol{p}$, i.e., filtered = torch.multinomial(p, N_0, replacement=False)
Concatenate all the filtered batches to create the training dataset for the subsequent epoch.

Inference

Convert the model weight to Hugging Face format using the script tools/transformers/convert2hf.py.

python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizer_path/

Then the model can be inferred with Hugging Face.

Acknowledgements

Citation

@article{tang2024rethinking,
  title={Rethinking Optimization and Architecture for Tiny Language Models},
  author={Tang, Yehui and Liu, Fangcheng and Ni, Yunsheng and Tian, Yuchuan and Bai, Zheyuan and Hu, Yi-Qi and Liu, Sichao and Jui, Shangling and Han, Kai and Wang, Yunhe},
  journal={arXiv preprint arXiv:2402.02791},
  year={2024}
}

@article{wang2023pangu,
  title={PanGu-$$\backslash$pi $: Enhancing Language Model Architectures via Nonlinearity Compensation},
  author={Wang, Yunhe and Chen, Hanting and Tang, Yehui and Guo, Tianyu and Han, Kai and Nie, Ying and Wang, Xutao and Hu, Hailin and Bai, Zheyuan and Wang, Yun and others},
  journal={arXiv preprint arXiv:2312.17276},
  year={2023}
}

equationliu / rethinktinylm Goto Github PK

rethinktinylm's Introduction

Rethinking Optimization and Architecture for Tiny Language Models

Benchmark Results

Training

Compact Tokenizer

Parameter Inheritance

Multiple-Round Training

Inference

Acknowledgements

Citation

rethinktinylm's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent