Giter VIP home page Giter VIP logo

how-to-train-tokenizer's Introduction

how-to-train-tokenizer

怎么从零到一训练一个LLM分词器

SentencePiece简介

  • SentencePiece 首先将所有输入转换为 unicode 字符。这意味着它不必担心不同的语言、字符或符号,可以以相同的方式处理所有输入;
  • 空白也被当作普通符号来处理。Sentencepiece显式地将空白作为基本标记来处理,用一个元符号 “▁”( U+2581 )转义空白,这样就可以实现简单地decoding;
  • Sentencepiece 可以直接从 raw text 进行训练;
  • 支持 BPE 和 UniLM 训练方法。

代码说明

├── data
│     └── corpus.txt 训练语料
├── llama
│     ├── tokenizer_checklist.chk
│     └── tokenizer.model
├── merged_tokenizer_hf 合并结果 hf格式
│     ├── special_tokens_map.json
│     ├── tokenizer_config.json
│     └── tokenizer.model
├── merged_tokenizer_sp
│     └── open_llama.model # 
├── merge_tokenizer
│     └── tokenizer.model
├── open_llama.model 训练的sp模型
├── open_llama.vocab 训练的sp词汇表
├── README.md
├── step0_step0_process_text.py 基于多分数据集准备训练语料
├── step1_make_corpus.py 基于中文Wikipedia数据准备训练语料
├── step2_train_tokenzier.py  训练分词器
├── step3_tokenzier_segment.py 测试训练后的模型,包括编码和解码测试样例
└── step4_merge_tokenizers.py 与原版llama的分词器进行合并,得到hf格式的tokenizer

img.png

中文Wikipedia数据中一共有2521667条数据

训练语料统计

  • comments2019数据集大小为3730782
  • news2016zh数据集大小为18032857
  • webText2019zh数据集大小为5705070
  • cls数据集大小为396209
  • lcsts数据集大小为1481435
  • wikipiedia-zh数据集大小为2521667

经过step1_make_corpus.py合并之后,共有9853042数据

测试效果

32000 50000
['<s>', '</s>', '<unk>']
[1, 2, 0]
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
32000
Before:32000
New model pieces: 77526
Chinese-LLaMA tokenizer has been saved to merged_tokenizer_hf
['<s>', '</s>', '<unk>']
[1, 2, 0]
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
Test text:

-例子1

 白日依山尽,黄河入海流。欲穷千里目,更上一层楼。
Tokenized by LLaMA tokenizer:['▁', '白', '日', '<0xE4>', '<0xBE>', '<0x9D>', '山', '<0xE5>', '<0xB0>', '<0xBD>', ',', '黄', '河', '入', '海', '流', '。', '<0xE6>', '<0xAC>', '<0xB2>', '<0xE7>', '<0xA9>', '<0xB7>', '千', '里', '目', ',', '更', '上', '一', '<0xE5>', '<0xB1>', '<0x82>', '<0xE6>', '<0xA5>', '<0xBC>', '。']
Tokenized by GoGPT-LLaMA tokenizer:['▁白', '日', '依', '山', '尽', ',', '黄河', '入海', '流', '。', '欲', '穷', '千里', '目', ',', '更', '上一', '层楼', '。']
  • 例子2
 大模型是指具有非常大的参数数量的人工神经网络模型。 在深度学习领域,大模型通常是指具有数亿到数万亿参数的模型。
Tokenized by LLaMA tokenizer:72,['▁', '大', '模', '型', '是', '指', '<0xE5>', '<0x85>', '<0xB7>', '有', '非', '常', '大', '的', '参', '数', '数', '量', '的', '人', '工', '神', '经', '网', '<0xE7>', '<0xBB>', '<0x9C>', '模', '型', '。', '▁', '在', '深', '度', '学', '<0xE4>', '<0xB9>', '<0xA0>', '<0xE9>', '<0xA2>', '<0x86>', '<0xE5>', '<0x9F>', '<0x9F>', ',', '大', '模', '型', '通', '常', '是', '指', '<0xE5>', '<0x85>', '<0xB7>', '有', '数', '<0xE4>', '<0xBA>', '<0xBF>', '到', '数', '万', '<0xE4>', '<0xBA>', '<0xBF>', '参', '数', '的', '模', '型', '。']
Tokenized by GoGPT-LLaMA tokenizer:28,['▁大', '模型', '是指', '具有', '非常大的', '参数', '数量的', '人工', '神经网络', '模型', '。', '▁在', '深度学习', '领域', ',', '大', '模型', '通常是', '指', '具有', '数', '亿', '到', '数万', '亿', '参数的', '模型', '。']

为什么需要词表扩充

原版LLaMA模型的词表大小是32K,其主要针对英语进行训练(具体详见LLaMA论文),对多语种支持不是特别理想(可以对比一下多语言经典模型XLM-R的词表大小为250K)。通过初步统计发现,LLaMA词表中仅包含很少的中文字符,所以在切词时会把中文切地更碎,需要多个byte token才能拼成一个完整的汉字,进而导致信息密度降低。比如,在扩展词表后的模型中,单个汉字倾向于被切成1个token,而在原版LLaMA中可能就需要2-3个才能组合成一个汉字,显著降低编解码的效率

参考百川分词部分介绍,由于目前公开LLaMA模型对中文语料存在解码效率较低的问题,可以提升训练和推理效率

how-to-train-tokenizer's People

Contributors

yanqiangmiffy avatar

Stargazers

 avatar Yuechao Wu avatar chenzihong avatar  avatar  avatar Eason Leo avatar 胖次 avatar Huichang Shen avatar  avatar 小赵要努力 avatar  avatar  avatar  avatar  avatar  avatar Tom pei avatar Chen Xi avatar  avatar ppkliu avatar Edison Pan avatar  avatar Ysx avatar rayx avatar Szw avatar wifibaby4u avatar  avatar Kutori avatar Remixa avatar Tailin avatar Jc Guo avatar  avatar Kangning Huang avatar  avatar  avatar Heng-Shiou Sheu avatar fingerx avatar  avatar  avatar 一叶之秋 avatar Chaofan Tao avatar Shuanghong Shen avatar  avatar 唐月明 avatar  avatar Furong avatar Willy Tunson avatar Insight avatar aurae avatar Lil2J avatar Angela Lee avatar  avatar Tianhui Zhang avatar wang avatar  avatar lorinma avatar Zijian Zhang avatar lanye avatar  avatar  avatar  avatar  avatar  avatar simpleAI avatar Yingfei(Jeremy) Xiang avatar  avatar Jingxin Lee avatar Philip avatar W. avatar  avatar TianMin avatar  avatar  avatar  avatar  avatar Sakurasou avatar  avatar ZT avatar Chi Zhang avatar  avatar Simon avatar JinfengZhang avatar Shuoshuo Sun avatar  avatar codingma avatar Yidan Liu avatar Lusheng Zhang avatar  avatar Raymond Zhang avatar 王琛 avatar  avatar Minzheng_Wang avatar zhongzh avatar  avatar Ethan avatar yiran.vang avatar Ping Pan avatar  avatar  avatar Scottish_Fold007 avatar Tingyun Mao avatar

Watchers

James Cloos avatar Ping Pan avatar  avatar Simon avatar  avatar ChunFuWu avatar

how-to-train-tokenizer's Issues

falcon词表扩充

你好,我用该项目训练中sentencepiece训练了一个中文词表,和falcon的英文词表无法合并,使用AutoTokenizer加载的falcon英文词表,没有sp_model属性,请问该怎么解决呢

dataset?

hi, thanks for your share. could you tell me where you downloaded these dataset? thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.