Light

kojimano / megatron-deepspeed-abci Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 2.0 4.43 MB

License: Other

Shell 7.91% Python 83.00% Makefile 0.02% C++ 6.26% C 0.26% Cuda 2.54%

megatron-deepspeed-abci's Issues

データの置き場所

対象
・ABCI転送する前処理済みのデータ

時期
・リハーサル前

ABCIグランドチャレンジでどのデータを使うか決める

（多人数で編集済み）

CommonCrawlベースのデータ
- CACC
- mc4
- OSCAR
- CC100
- OSCAR2301
新聞データ #3
Wikipedia
- ja
- en
Twitterデータ
- 事前学習にはあまり向いていないらいしい。
MTデータ
- WMT22
コード
- bigcode 大きいのでサブセットを使う。
- the stack
対話
- Presto
- リンク集

他

聞いておきたいこと

・今は学習中にtokenizeしているか、事前にtokenizeしているか
・学習中にtokenizeするとどのくらいパフォーマンスが落ちるか

model_tokenizer

どのtokenizerを使うのか。

ABCI 544 GPU rehersal

Resume training
Check generation
Optimizer set-up

stats of CACC data

CAデータのtoken数を測る。

instruction tuningについて

事前学習後にinstruction tuning を行う。（RLHFよりも効果が大きいという話を以前どこかで聞いた記憶がある。）

日本語で行う場合のデータをどうするか。また、evaluation用のデータ（タスク）とinstruction tuning用のデータ（タスク）は分ける必要がありそう。

Debug data processing pipelines

Wikipediaのpreprocessing scriptが生成するjsonlの“text”のフィールドに空欄の多い件
Abejaのtokenizerの動作確認

青空文庫 Preprocessing / Binarize Data

タイムライン

[] (~ 5/10?)　SambaNova
[] 5/16 - 5/23 ABCI grand challenge

CyberAgent Preprocessing / Binarize Data

Sambanovaで使うデータセットの用意

04/27のミーティングより

方針

mc4など前処理に時間がかかりそうなものは避ける
あくまでABCIグランドチャレンジに向けた練習という立ち位置

よって、以下を用いる。

Wikipedia（未DL）
CC100（DL済み）
- 比較的クリーンなので

前処理の優先順位

正規化（NFD or NFC）
重複削除

前処理はどこでなにをやるか

#9 とも関連。

ABCI GPU ベンチマーク

Model benchmarking results

Overview

Model hyperparameters

13B OPT-13B
- --num-layers 40
- --hidden-size 5120
- --num-attention-heads 40
10B GLM-10B <== use this one
- --num-layers 48
- --hidden-size 4096
- --num-attention-heads 64
10B Megatron-10B
- --num-layers 50
- --hidden-size 4096
- --num-attention-heads 32

Notations

MBS = micro batch size
GBS = global batch size
Sec/it = seconds per iteration
Est. Aggr. PetaFLOPs = TFLOPs * Nodes / 1024

Preliminary Experiments

#GPUs	#Layers	DP	MP	PP	MBS	GBS	SL	AC	Max Mem (allocated)	Max Mem (reserved)	Sec/it	TFLOPs	Notes
4	4	1	2	2	1	8	1024	Yes	8584 MiB	9936 MiB	0.5	45.63	4/28
4	4	1	2	2	1	8	1024	No	8585 MiB	10278 MiB	0.44	45.09	4/28
4	2	1	2	2	1	8	2048	Yes	4458 MiB	5336 MiB	0.6	47.8	4/28
4	4	1	2	2	1	8	2048	Yes	8525 MiB	10142 MiB	0.97	51.64	4/28
4	4	1	1	4	1	8	2048	Yes	6057 / 10970 MiB (OOM)	7980 / 13278 MiB (OOM)	-	43.7	4/28
4	2	1	4	1	1	8	2048	Yes	4458 MiB	4458 MiB	0.6	47.5	4/28
4	4	1	4	1	1	8	2048	No	7462 MiB	9236 MiB	0.8	44.6	4/28
4	4	1	4	1	1	8	2048	Yes	7463 MiB	8134 MiB	1.0	47.0	4/28
4	4	1	4	1	2	8	2048	Yes	7462 MiB	8528 MiB	0.8	60.9	4/28
4	4	1	4	1	4	8	2048	Yes	7479 MiB	8890 MiB	0.8	60.9	4/28
4	4	1	4	1	4	8	2048	No	11793 MiB	13516 MiB	0.6	57.9	4/28
4	6	4	1	1	1	8	2048	Yes	10467 MiB (OOM)	11272 MiB (OOM)	-	-	4/28

Memory usages seems to increase after logging?

Experiments-1

#GPUs	Size	DP	MP	PP	MBS	GBS	SL	AC	Zero	Max Mem (allocated)	Max Mem (reserved)	TFLOPs	Sec/it	Est. Aggr. PetaFLOPs	B tokens	Notes
32	10B	1	4	8	1	90	1024	No	1	OOM MiB	OOM MiB	-	-	-	-	4/28
32	10B	1	4	8	1	90	2048	Yes	1	- MiB	- MiB	39.3	12.4	-	152	4/28
32	10B	1	4	8	2	90	2048	Yes	1	7875 MiB	8892 MiB	40.1	12.2	-	155	4/28
32	10B	1	4	8	4	90	2048	Yes	1	- MiB	- MiB	-	-	-	-	4/28
32	13B	1	4	8	1	8	2048	Yes	1	7568 MiB	8586 MiB	23.5	2.3	-	-	4/28
32	13B	1	4	8	1	512	2048	Yes	1	8966 MiB	10100 MiB	42.7	83.5	-	-	4/28
32	13B	1	4	8	1	90	1024	No	1	OOM MiB	OOM MiB	-	-	-	-	4/28
32	13B	1	4	8	1	90	2048	Yes	1	8964 MiB	10124 MiB	40.0	15.4	-	123	4/28
32	13B	1	4	8	2	90	2048	Yes	1	9303 MiB	10648 MiB	48.7	12.8	-	148	4/28
32	13B	1	4	8	4	88	2048	Yes	1	12243 MiB (OOM)	14108 MiB (OOM)	44.2	13.8	-	-	4/28

Deepspeed (Reduce PP bubble / disable activation checkpoints)

#GPUs	#Layers	DP	MP	PP	MBS	GBS	AC	Zero	Max Mem (allocated)	Max Mem (reserved)	TFLOPs	Sec/it	B tokens	Notes
32	10	4	1	1	1	88	Yes	None	7540 MiB	9116 MiB	43.2	1.2	-	5/2
32	10	4	1	1	1	88	Yes	1	5050 MiB	- MiB	43.1	-	1.2	5/2
32	10	4	1	1	1	88	Yes	2	5490 MiB	- MiB	42.9	1.2	-	5/2

Activation Partitioning and Activation Checkpointing Chunks

#GPUs	Size	DP	MP	PP	MBS	GBS	AC	AC chunk	DAC	Max Mem (allocated)	Max Mem (reserved)	TFLOPs	Sec/it		Notes
4	10B (6 layers)	1	4	1	1	88	2048	No	-	No	7758 MiB	8732.MiB	46.25	8.5	4/28
4	10B (6 layers)	1	4	1	2	88	2048	No	-	No	10614 MiB (OOM)	11858 MiB (OOM)	49.75	7.9	4/28
4	10B (6 layers)	1	4	1	1	88	2048	Yes	1	No	6931 MiB	7162 MiB	46.36	11.3	4/28
4	10B (6 layers)	1	4	1	2	88	2048	Yes	1	No	6931 MiB	7538 MiB	50.33	10.4	4/28
4	10B (6 layers)	1	4	1	2	88	2048	Yes	1	Yes	6979 MiB	7242 MiB	49.9	10.5	4/28
4	10B (6 layers)	1	4	1	4	88	2048	Yes	1	Yes	7027 MiB	8808 MiB	53.05	9.9	4/28
4	10B (6 layers)	1	4	1	8	88	2048	Yes	1	Yes	7124 MiB	10592 MiB	53.26	9.9	4/28
4	10B (6 layers)	1	4	1	2	88	2048	Yes	2	Yes	- MiB	- MiB	-	-	bug did not work ...

Notes

Activation Partitioning seems to be deepspeed feature and combined with AC
DAC stands for distribute-checkpointed-activations
Bug for controlling the chunk size of activation checkpointing

General Preprocessing Pipeline

Integrate preprocessing script into the Megatron script and preprocess into mmap format (HF blog, script)
Design options
- tokenizer (Rinna)
- 正規化
- 重複除去

data_news_articles

朝日新聞 (KS, 問い合わせ中)
このフォームをメールで送れば良い、とのこと。
毎日新聞 (KS, 問い合わせ中)
報知新聞 (商用利用の場合、公開は不可)
日経 (KS, 問い合わせ中)
読売新聞 (KS, 問い合わせ中)
東証 (KS, 問い合わせ中)
資料を送ってもらった。

Tokenizer周りについて

（次回以降のプロジェクトで実施したい？）

【今後Tokenizerを自分たちで学習する場合について小島さんとのディスカッション】

sample from diverse corpora
train sentencepiece
- Sentencepieceにはbyte_fallbackオプションがあるので大丈夫そうです。(google/sentencepiece#621 (comment))

Dataset Preprocessing Validation

evaluation_jglue

SambaNovaのサーバーで何を学習するか

Add github dataset from redpajama

download script
preprocess
binarize for megatoron

Add `\n` replace script

学習データの最終的な固め方

どのような形にして保存しておけば良いか。
・1行1文のプレーンテキスト？jsonl？
・圧縮する？（ディスク容量との兼ね合い）
・分割する？（ファイル読み込みの実装、メモリとの兼ね合い）

Wikipedia Preprocessing / Binarize Data

SambaNovaの学習でどのデータを使うか

CACCのtokenizeについて確認

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.