Optimizer zero state sharding should not affect the results of any experiments. However, @ngoyal2707 and I have observed that this isn't the case for Megatron-LM models in Fairseq.
First, we show that results can replicate across model parallel sizes.
python fairseq_train.py --task masked_lm /checkpoint/bioseq_nonsecure/namangoyal/model_parallel_data/small_sample_valid_ur50-bin --dataset-impl fasta --save-dir checkpoints/zero-0-mp-1 --dropout 0.1 --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 --tokens-per-sample 128 --sample-break-mode none --max-tokens 128 --memory-efficient-fp16 --no-progress-bar --log-interval 1 --seed 4 --max-epoch 1 --max-update 50 --encoder-layers 4 --no-save --arch model_parallel_roberta_large --model-parallel-size 2 --update-freq 2 2>&- | grep "ppl"
2020-09-25 13:32:52 | INFO | train_inner | epoch 001: 5 / 8323965 loss=0.966, ppl=1.95, wps=0, ups=0, wpb=1024, bsz=8, num_updates=1, lr=2.24975e-07, gnorm=0.055, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001: 6 / 8323965 loss=0.99, ppl=1.99, wps=10858.5, ups=10.6, wpb=1024, bsz=8, num_updates=2, lr=3.4995e-07, gnorm=0.07, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001: 7 / 8323965 loss=0.968, ppl=1.96, wps=14452.4, ups=14.1, wpb=1024, bsz=8, num_updates=3, lr=4.74925e-07, gnorm=0.067, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001: 8 / 8323965 loss=1.032, ppl=2.04, wps=15902.2, ups=15.51, wpb=1024, bsz=8, num_updates=4, lr=5.999e-07, gnorm=0.073, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001: 9 / 8323965 loss=1.007, ppl=2.01, wps=14162.8, ups=13.81, wpb=1024, bsz=8, num_updates=5, lr=7.24875e-07, gnorm=0.089, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001: 10 / 8323965 loss=1.03, ppl=2.04, wps=14513.6, ups=14.15, wpb=1024, bsz=8, num_updates=6, lr=8.4985e-07, gnorm=0.082, loss_scale=8, train_wall=0, wall=204
python fairseq_train.py --task masked_lm /checkpoint/bioseq_nonsecure/namangoyal/model_parallel_data/small_sample_valid_ur50-bin --dataset-impl fasta --save-dir checkpoints/zero-0-mp-1 --dropout 0.1 --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 --tokens-per-sample 128 --sample-break-mode none --max-tokens 128 --memory-efficient-fp16 --no-progress-bar --log-interval 1 --seed 4 --max-epoch 1 --max-update 50 --encoder-layers 4 --no-save --arch model_parallel_roberta_large --model-parallel-size 4 --update-freq 4 2>&- | grep "ppl"
2020-09-25 14:44:30 | INFO | train_inner | epoch 001: 5 / 8323965 loss=0.966, ppl=1.95, wps=0, ups=0, wpb=1024, bsz=8, num_updates=1, lr=2.24975e-07, gnorm=0.055, loss_scale=8, train_wall=0, wall=212
2020-09-25 14:44:30 | INFO | train_inner | epoch 001: 6 / 8323965 loss=0.99, ppl=1.99, wps=3017.4, ups=2.95, wpb=1024, bsz=8, num_updates=2, lr=3.4995e-07, gnorm=0.07, loss_scale=8, train_wall=0, wall=212
2020-09-25 14:44:30 | INFO | train_inner | epoch 001: 7 / 8323965 loss=0.968, ppl=1.96, wps=2964.3, ups=2.89, wpb=1024, bsz=8, num_updates=3, lr=4.74925e-07, gnorm=0.067, loss_scale=8, train_wall=0, wall=212
2020-09-25 14:44:31 | INFO | train_inner | epoch 001: 8 / 8323965 loss=1.032, ppl=2.04, wps=3059.4, ups=2.99, wpb=1024, bsz=8, num_updates=4, lr=5.999e-07, gnorm=0.073, loss_scale=8, train_wall=0, wall=213
2020-09-25 14:44:31 | INFO | train_inner | epoch 001: 9 / 8323965 loss=1.007, ppl=2.01, wps=2869.3, ups=2.8, wpb=1024, bsz=8, num_updates=5, lr=7.24875e-07, gnorm=0.089, loss_scale=8, train_wall=0, wall=213
However, if we now add optimizer zero state sharding, the numbers change.
python fairseq_train.py --task masked_lm /checkpoint/bioseq_nonsecure/namangoyal/model_parallel_data/small_sample_valid_ur50-bin --dataset-impl fasta --save-dir checkpoints/zero-0-mp-1 --dropout 0.1 --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 --tokens-per-sample 128 --sample-break-mode none --max-tokens 128 --memory-efficient-fp16 --no-progress-bar --log-interval 1 --seed 4 --max-epoch 1 --max-update 50 --encoder-layers 4 --no-save --arch model_parallel_roberta_large --model-parallel-size 2 --update-freq 2 --zero-sharding os 2>&- | grep "ppl"
020-09-25 15:15:17 | INFO | train_inner | epoch 001: 5 / 8323965 loss=0.966, ppl=1.95, wps=0, ups=0, wpb=1024, bsz=8, num_updates=1, lr=2.24975e-07, gnorm=0.055, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:17 | INFO | train_inner | epoch 001: 6 / 8323965 loss=0.968, ppl=1.96, wps=10795.5, ups=10.54, wpb=1024, bsz=8, num_updates=2, lr=3.4995e-07, gnorm=0.06, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:17 | INFO | train_inner | epoch 001: 7 / 8323965 loss=0.953, ppl=1.94, wps=16335, ups=15.94, wpb=1024, bsz=8, num_updates=3, lr=4.74925e-07, gnorm=0.064, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:17 | INFO | train_inner | epoch 001: 8 / 8323965 loss=0.994, ppl=1.99, wps=15692, ups=15.31, wpb=1024, bsz=8, num_updates=4, lr=5.999e-07, gnorm=0.065, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:18 | INFO | train_inner | epoch 001: 9 / 8323965 loss=0.963, ppl=1.95, wps=17342.3, ups=16.92, wpb=1024, bsz=8, num_updates=5, lr=7.24875e-07, gnorm=0.09, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:18 | INFO | train_inner | epoch 001: 10 / 8323965 loss=0.972, ppl=1.96, wps=16860.9, ups=16.45, wpb=1024, bsz=8, num_updates=6, lr=8.4985e-07, gnorm=0.084, loss_scale=8, train_wall=0, wall=199
Collecting environment information...
PyTorch version: 1.5.0a0+4ff3872
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100
Nvidia driver version: 418.116.00
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] msgpack-numpy==0.4.5
[pip] numpy==1.18.3
[pip] numpydoc==0.9.2
[pip] pytorch-lightning==0.8.1
[pip] pytorch-pretrained-bert==0.6.2
[pip] pytorch-transformers==1.1.0
[pip] torch==1.5.0a0+4ff3872
[conda] blas 1.0 mkl
[conda] libblas 3.8.0 15_mkl conda-forge
[conda] libcblas 3.8.0 15_mkl conda-forge
[conda] liblapack 3.8.0 15_mkl conda-forge
[conda] magma-cuda101 2.5.2 1 pytorch
[conda] mkl 2020.1 217
[conda] mkl-include 2020.0 166
[conda] mkl-service 2.3.0 py36he904b0f_0
[conda] mkl_fft 1.0.15 py36ha843d7b_0
[conda] mkl_random 1.1.0 py36hd6b4f25_0
[conda] pytorch-lightning 0.8.1 <pip>