Giter VIP home page Giter VIP logo

rethinktinylm's Introduction

Rethinking Optimization and Architecture for Tiny Language Models

This is the official implementation of Rethinking Optimization and Architecture for Tiny Language Models, an empirical investigation about how to construct powerful language models.

Four strategies are proposed to improve performance:

  • 🎯 Compact Tokenizer: efficient coverage of corpus;
  • 🔍 Architecture Tweak: better depth and width tradeoffs;
  • 🎁 Parameter Inheritance: powerful knowledge from larger LLMs;
  • 🔥 Multiple-Round Training: memory reinforcement of tiny models.

Based on the observations above, PanGu-π-1B Pro and PanGu-π-1.5B Pro are trained on 1.6T multilingual corpora. Model configurations are shown as follows:

Benchmark Results

Training

This repository is modified from the InternEvo training framework.

Here are the steps to organize the codes:

  1. Clone the InternEvo repository and configure the runtime environment.
  2. Copy the configuration files configs/LLM1B.py to the InternEvo/configs/ directory.
  3. Copy the start script src/start_finetune.py to the InternEvo root directory.

You can follow the guide of InternEvo to pretrain data and train models (https://github.com/InternLM/InternEvo/blob/develop/doc/en/usage.md).

The model's depth, width, and expanding rate can by easily adjusted in the config.

Compact Tokenizer

The compact tokenizer is constructed by removing low-frequency vocabularies. To prune tokenizer, you can follow these steps:

  1. Counting the frequency of tokens cached by the original big tokenizer.
    • python src/step1_token_frequency_stat.py --src cached_data_dir --dst tmp_stat_files_dir. Thhe script counts the frequency of all tokens in the cached_data_dir folder and generates a corresponding JSON file in the tmp_stat_files_dir folder.
    • python src/step2_token_frequency_stat_combie.py --src tmp_stat_files_dir --dst total_token_freq.json. Combine all JSON files in the tmp_stat_files_dir folder and write the frequency of tokens in total_token_freq.json
  2. Firstly add the special tokens, and then add the tokens with the highest word frequency to the new tokenizer.
    • python src/step3_generate_new_tokenizer.py --origin_tokenizer_dir origin_tokenzier --vocab_num compact_tokenizer_size --output new_tokenizer_dir --token_freq_file total_token_freq.json. This script will generate a new tokenzier in the new_tokenizer_dir folder with compact_tokenizer_size tokens.

Parameter Inheritance

To pretrain by inheriting parameter from a large model, you can use the following command:

python start_finetune.py --config ./configs/LLM1B.py

Note that MODEL_ONLY_FOLDER is the model's checkpoint pruned from a large model.

If you want to train from scratch, you need the set load_given_ckpt=False in the config.

Multiple-Round Training

To extract a certain proportion of challenging examples from the last epoch, you can utilize the following steps:

  1. Compute the batch-wise loss $L={l_1,l_2,\cdots,l_N}$ using the pre-trained frozen model from the previous epoch, where $N$ represents the total number of batches. For instance, a dataset containing 150B tokens might yield approximately 75000 batches when utilizing a batch size of 2M.
  2. Calculate the sampling probability $p_i = \exp(l_i) \bigg/ {\sum \limits_{j=1}^N \exp(l_j)}$.
  3. Sample $N_0$ batches out of $N$ according to the sampling probability $\boldsymbol{p}$, i.e., filtered = torch.multinomial(p, N_0, replacement=False)
  4. Concatenate all the filtered batches to create the training dataset for the subsequent epoch.

Inference

Convert the model weight to Hugging Face format using the script tools/transformers/convert2hf.py.

python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizer_path/

Then the model can be inferred with Hugging Face.

Acknowledgements

Citation

@article{tang2024rethinking,
  title={Rethinking Optimization and Architecture for Tiny Language Models},
  author={Tang, Yehui and Liu, Fangcheng and Ni, Yunsheng and Tian, Yuchuan and Bai, Zheyuan and Hu, Yi-Qi and Liu, Sichao and Jui, Shangling and Han, Kai and Wang, Yunhe},
  journal={arXiv preprint arXiv:2402.02791},
  year={2024}
}

@article{wang2023pangu,
  title={PanGu-$$\backslash$pi $: Enhancing Language Model Architectures via Nonlinearity Compensation},
  author={Wang, Yunhe and Chen, Hanting and Tang, Yehui and Guo, Tianyu and Han, Kai and Nie, Ying and Wang, Xutao and Hu, Hailin and Bai, Zheyuan and Wang, Yun and others},
  journal={arXiv preprint arXiv:2312.17276},
  year={2023}
}

rethinktinylm's People

Contributors

yuchuantian avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.