facebookresearch / fairscale Goto Github PK

PyTorch extensions for high performance and large scale training.

License: Other

Python 99.24% C 0.01% C++ 0.02% Cuda 0.70% Shell 0.04%

fairscale's Introduction

Description

FairScale is a PyTorch extension library for high performance and large scale training. This library extends basic PyTorch capabilities while adding new SOTA scaling techniques. FairScale makes available the latest distributed training techniques in the form of composable modules and easy to use APIs. These APIs are a fundamental part of a researcher's toolbox as they attempt to scale models with limited resources.

FairScale was designed with the following values in mind:

Usability - Users should be able to understand and use FairScale APIs with minimum cognitive overload.
Modularity - Users should be able to combine multiple FairScale APIs as part of their training loop seamlessly.
Performance - FairScale APIs provide the best performance in terms of scaling and efficiency.

Watch Introductory Video

Installation

To install FairScale, please see the following instructions. You should be able to install a package with pip or conda, or build directly from source.

Getting Started

The full documentation contains instructions for getting started, deep dives and tutorials about the various FairScale APIs.

FSDP

FullyShardedDataParallel (FSDP) is the recommended method for scaling to large NN models. This library has been upstreamed to PyTorch. The version of FSDP here is for historical references as well as for experimenting with new and crazy ideas in research of scaling techniques. Please see the following blog for how to use FairScale FSDP and how does it work.

Testing

We use circleci to test FairScale with the following PyTorch versions (with CUDA 11.2):

the latest stable release (e.g. 1.10.0)
the latest LTS release (e.g. 1.8.1)
a recent nightly release (e.g. 1.11.0.dev20211101+cu111)

Please create an issue if you are having trouble with installation.

Contributors

We welcome contributions! Please see the CONTRIBUTING instructions for how you can contribute to FairScale.

License

FairScale is licensed under the BSD-3-Clause License.

Citing FairScale

If you use FairScale in your publication, please cite it by using the following BibTeX entry.

@Misc{FairScale2021,
  author =       {{FairScale authors}},
  title =        {FairScale:  A general purpose modular PyTorch library for high performance and large scale training},
  howpublished = {\url{https://github.com/facebookresearch/fairscale}},
  year =         {2021}
}

fairscale's People

Contributors

Stargazers

Watchers

Forkers

froody creatorcen yyht sidgoyal78 blefaudeux ml-ai-nlp-ir pierrehao theniteshsingh pgsrv ankitshah009 young768 distributed-deep-learning vittorio-caggiano lly-zero-one joshim5 aurickq machinelearningzuu ml-lab zeta1999 zhenyangiacas yuanyuan0057 stas00 qijune zzszmyf mikerabbat tchaton vitaliyli donnyyou lightning-sandbox jessijzhao swapnil-gandhi stjordanis tubbz-alt han-tun tchigher wintersurvival bnojavan haven-jeon adefazio jawad1347 vingovan imaginary-person luisfalva gazzola darwin-systems han76024 hulaba leezu mrzzd hpcrl daitr616 sshleifer ff7250 sarma-chsaps minhokimm202101 vfdev-5 limbo0000 minghao2016 ml-sphere brettkoonce cxz lmathia2 hiyoung-asr seannaren delfosseaurelien kevinmtian izougend shuyingsunshine21 longervision srinidhipy mattiasmar cyanguwa diego-plan9 quentinduval ncilfone epwalsh shermineh-gh cvpaul jittor-image-models chaitanyakasaraneni manikant92 pbelevich ddua ljz756245026 zhaojuanmao liangluofb holmes-gu jfc4050 ezyang yimikai patil2099 iyerr3 h-huang mrshenli ellipses circleci ys610zz jasperzhong simrankoul7 another-pjohnson

fairscale's Issues

bechmarks/transformer.py doesn't specify chunks when constructing Pipe

🐛 Bug

The whole point of the Pipe module is to split a batch into #chunks microbatches and then process these through the stages of the pipeline in order to achieve parallelism by having multiple microbatches being processed on different GPUs at the same time. The benchmark in bechmarks/transformer.py doesn't specify chunks so it defaults to chunks=1, which doesn't make use of any of the microbatch logic. Further, changing the benchmark to set chunks=2 or chunks=4 yields a slowdown, when I would expect that more chunks -> more parallelism.

Command

PYTHONPATH=$PWD python benchmarks/transformer.py

To Reproduce

Steps to reproduce the behavior:

PYTHONPATH=$PWD python benchmarks/transformer.py
Change L263 to specify chunks=2 and rerun the command, e.g. p = pipe.Pipe(model, balance, chunks=2)
Change L263 to specify chunks=4 and rerun the command

Expected behavior

chunks=N is faster than chunks=1 for some N when there are more than 1 devices

Environment

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0
[pip3] torchtext==0.7.0
[pip3] torchvision==0.7.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0da4684_0 conda-forge
[conda] numpy 1.19.1 py37hbc911f0_0
[conda] numpy-base 1.19.1 py37hfa32c7d_0
[conda] pytorch 1.6.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torchtext 0.7.0 pypi_0 pypi
[conda] torchvision 0.7.0 py37_cu101 pytorch

Async broadcast request in OSS

🚀 Feature

Make sure that the broadcast request to sync shards are async, and not blocking the main thread

Motivation

OSS is currently too slow

Pitch

Easy fix, no compromise

Alternatives

Stay slow for no reason

Additional context

Clear bottleneck when tracing execution

Fail on install - Import Error : numpy.core.multiarray failed to import

❓ Questions and Help

Hi, I encountered a error while installing fairscale, The steps as follows:

Download Anaconda3-2020.07-Linux-x86_64.sh and install Anaconda
pip install fairscale

It seems that the numpy version should be old, but with "print (numpy.version.version)" the result is 1.18.

Does anyone have some idea?

[ShardedDDP] Better integration with ShardedOptimizer

🚀 Feature

Change ShardedDDP interface to be exactly the same as DDP (do not take the optimizer as an input)

Motivation

There's a small difference right now in DDP and ShardedDDP initialization, in that ShardedDDP requires the ShardedOptimizer to be passed in.

This is mainly for two reasons:

ShardedDDP pulls the partition from ShardedOptimizer
Buckets used to reduce and broadcast the grads and parameters are shared, since these steps do not overlap in time

Pitch

use a partitioning defined externally, partition in the same way in ShardedDDP and ShardedOptimizer
add the buckets as model attributes, so that ShardedOptim can piggy back on them
[stretch] make it possible to share buckets with criterion/loss, which is also typically wrapped in a DDP engine

Alternatives

Status quo

Additional context

as discussed with @joshim5

[bug] Load state dict imprecision with fused Adam

🐛 Bug

Load state dict introduces imprecision when using fp16 params and fp32 optimizer states (mixed precision and memory-efficient mixed-precision). Using super().load_state_dict results in the optimizer state being cast to fp16. Adam's load_state_dict then casts them back to fp32, but this introduces small changes in the values.

To Reproduce

Steps to reproduce the behavior:

Remove the pytest.mark.xfail from tests/optim/test_adam.py::test_state_dict_mixed_precision and tests/optim/test_adam.py::test_state_dict_memory_efficient
Run tests

Expected behavior

This should pass

Environment

PyTorch version: 1.5.0a0+4ff3872
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.18.3
[pip] torch==1.5.0a0+4ff3872
[pip] torchtext==0.6.0
[conda] blas 1.0 mkl
[conda] magma-cuda101 2.5.2 1 pytorch
[conda] mkl 2020.0 166
[conda] mkl-include 2020.0 166
[conda] mkl-service 2.3.0 py36he904b0f_0
[conda] mkl_fft 1.0.15 py36ha843d7b_0
[conda] mkl_random 1.1.0 py36hd6b4f25_0
[conda] torch 1.5.0a0+4ff3872 pypi_0 pypi

Additional context

Privatize the sharded optimizer within the sharded DDP

🚀 Feature

Align the OSS_ddp architecture with this proposal pytorch/pytorch#42849.
A main difference is that oss_ddp owns the construction of the sharded optimizer, it does not have to be exposed.

Motivation

Prototype something closer to the RFC
Bring the sharding in the same place, so that we can eventually align the sharding from the optimizer and the model parameter sharding. Right now DDP gets the full model

Pitch

Keep backward compatibilty: using OSS directly stays possible, but if targeting oss_ddp, then it owns the sharded optimizer. Eventually surface the sharding and align both

Alternatives

Keep the current approach in two seperate steps (shard DDP / shard optim), but it does not seem very future proof

Additional context

[OSS] Faster broadcast in step()

🚀 Feature

Bucketize the param broadcasts, this step is a pain point empirically

Motivation

Speed, no compromise other than some code complexity, easy decision.

Pitch

Decide parameter buckets at construction time, bucketize the smallest params but keep the biggest ones async

Alternatives

Status quo, too slow

Additional context

Right now for a typical computer vision training task, OSS takes a 30% speed hit when going multi node

benchmark test too tight?

Hi Ben,

Is the benchmark test too tight? I see this with my draft PR:

How to use Pipe in evaluation mode

❓ Questions and Help

Was curious if the Pipe module can be used in evaluation mode... I have a script which is just the modified benchmark/pipe.py but stripped down/self-contained and runs the evaluate function.

https://gist.github.com/SeanNaren/6a91ad9c5d82bf7e6ccb3081ff086067

Running this. gives me an error in evaluate (I have to run it twice, once to download the data then crashes/then once properly):

~/fairscale$ python benchmarks/pipe.py  --num-decoder-layers 4 --chunks 1 --pipelined-backward
dist init r=0, world=2, host=localhost
dist init r=1, world=2, host=localhost
> initializing model parallel with size 1
--------------------------------------------------------------------------------------------------------------
| start of epoch 1
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
| start of epoch 1
--------------------------------------------------------------------------------------------------------------
[E thread_pool.cpp:112] Exception in thread pool task: TypeError: 'NoneType' object is not callable
[W tensorpipe_agent.cpp:545] RPC agent for Test1 encountered error when reading incoming request from Test0: EOF: end of file (this is expected to happen during shutdown)
Traceback (most recent call last):
  File "benchmarks/pipe.py", line 512, in <module>
    bench_multi_process(args, all_at_once=True)
  File "benchmarks/pipe.py", line 481, in bench_multi_process
    mp.spawn(run_worker, args=(world_size, args), nprocs=world_size, join=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/jovyan/fairscale/benchmarks/pipe.py", line 470, in run_worker
    run_mp_worker(args, world_size)
  File "/home/jovyan/fairscale/benchmarks/pipe.py", line 461, in run_mp_worker
    args)
  File "/home/jovyan/fairscale/benchmarks/pipe.py", line 348, in benchmark_language_model
    val_loss = evaluate(model, val_data, criterion, bptt, ntokens)
  File "/home/jovyan/fairscale/benchmarks/pipe.py", line 328, in evaluate
    output = output.to(targets.device)
AttributeError: 'list' object has no attribute 'to'

cmd:

python benchmarks/pipe.py  --num-decoder-layers 4 --chunks 1

From investigating on 2 GPU machine with the above cmd, rank 0 returns a Batch object, and rank 1 returns a tensor. Is there a way for these to be synced? I assume rank 0 contains just the output for the layers on rank 0, however i'm wondering what the flow should be for evaluation.

The goal is to prototype single-process single-device Pipe parallelism + data parallel for pytorch lightning (for extra info!). Any thoughts here would be appreciated as well :)

Faster OSS

🚀 Feature

Reduce the runtime cost tied to activating OSS

Motivation

Right now the speed penalty of using OSS over a couple of nodes, for a typical ClassyVision load is 2-10x.

Pitch

There are a bunch of low hanging fruits, from oss_ddp to broadcast batching which could be used to reduce this impact

Alternatives

Not using OSS I guess

Fail on install - CUDA 11.1.0

🐛 Bug

Hi,
pip install on below environment throws an error. I'm happy to provide more info if it would be useful. Thanks!

fatal error: multi_tensor_apply.cuh: No such file or directory

Environment (NVIDIA-Python Docker: 20.10)

Python: 3.6
PyTorch: 1.7.0
CUDA: 11.1.0
cuDNN: 8.0.4

Error

Error:
b'  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpl6xin7ge
       cwd: /tmp/pip-install-0_xiegmi/fairscale
  Complete output (158 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/fairscale
  copying fairscale/__init__.py -> build/lib.linux-x86_64-3.6/fairscale
  creating build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/grad_scaler.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/oss.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/utils.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/adam.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  creating build/lib.linux-x86_64-3.6/fairscale/nn
  copying fairscale/nn/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn
  creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/dependency.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/pipe.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/stream.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/batchnorm.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/pipeline.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/checkpoint.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/copy.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/phony.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/worker.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/microbatch.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  creating build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/random.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/initialize.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/layers.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/cross_entropy.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/mappings.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/utils.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  creating build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
  copying fairscale/nn/data_parallel/sharded_ddp.py -> build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
  copying fairscale/nn/data_parallel/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
  creating build/lib.linux-x86_64-3.6/fairscale/nn/moe
  copying fairscale/nn/moe/top2gate.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
  copying fairscale/nn/moe/moelayer.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
  copying fairscale/nn/moe/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
  creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  copying fairscale/nn/pipe/balance/profile.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  copying fairscale/nn/pipe/balance/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  copying fairscale/nn/pipe/balance/blockpartition.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/skippable.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/tracker.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/layout.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/portal.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/namespace.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  running egg_info
  writing fairscale.egg-info/PKG-INFO
  writing dependency_links to fairscale.egg-info/dependency_links.txt
  writing requirements to fairscale.egg-info/requires.txt
  writing top-level names to fairscale.egg-info/top_level.txt
  reading manifest file \'fairscale.egg-info/SOURCES.txt\'
  reading manifest template \'MANIFEST.in\'
  writing manifest file \'fairscale.egg-info/SOURCES.txt\'
  creating build/lib.linux-x86_64-3.6/fairscale/clib
  creating build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  copying fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp -> build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  copying fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -> build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  running build_ext
  building \'fairscale.fused_adam_cuda\' extension
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  Emitting ninja build file /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options \'\'"\'"\'-fPIC\'"\'"\'\' -O3 --use_fast_math -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -std=c++14
  FAILED: /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o
  /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options \'\'"\'"\'-fPIC\'"\'"\'\' -O3 --use_fast_math -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -std=c++14
  /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu:12:10: fatal error: multi_tensor_apply.cuh: No such file or directory
   #include "multi_tensor_apply.cuh"
            ^~~~~~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  [2/2] c++ -MMD -MF /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda.o.d -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
  In file included from /usr/local/lib/python3.6/dist-packages/torch/include/ATen/Parallel.h:149:0,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/extension.h:4,
                   from /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp:1:
  /usr/local/lib/python3.6/dist-packages/torch/include/ATen/ParallelOpenMP.h:84:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
   #pragma omp parallel for if ((end - begin) >= grain_size)
  
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 1522, in _run_ninja_build
      env=env)
    File "/usr/lib/python3.6/subprocess.py", line 438, in run
      output=stdout, stderr=stderr)
  subprocess.CalledProcessError: Command \'[\'ninja\', \'-v\']\' returned non-zero exit status 1.

an error importing AMP in fairscale when using a Pytorch version less than 1.6

🐛 Bug

Command

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).

You can run the script with:

# For security purposes, please check the contents of collect_env.py before running it.
python -m torch.utils.collect_env

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

[feat] Integration of SlowMo into FairScale

🚀 Feature

Add the distributed optimization algorithms SGP, LocalSGD and SlowMo in Fairscale in order to reduce the amount of communication between different nodes, and hence, speed up training. The latest version of the codebase is present here.

Motivation

Due to physical limitations on the speed of interconnect between different nodes, training time is often slowed down. The above algorithms (SGP, LocalSGD, SlowMo) reduce inter-node communication during model training, thus leading to increased speeds . Through our benchmarking, we have observed speedups of up to 50% (by WPS) on translation models using these algorithms.

Pitch

We want to incorporate these algorithms into fairscale.nn. As part of this, we also want to make the interface simpler to use for users and would welcome your suggestions on this topic. We also plan to add additional documentation and tests if needed. As a later step, we also would like to try to make sure that it works with other optimization algorithms like Zero or ModelParallel, which are already implemented in Fairscale.

User-side Integration Example

# Initialization
model = torch.nn.Transformer(...)
model = fairscale.GossipDataParallel(model, slowmo_parameters, sgp: bool, ...)

# Typical training loop
outputs = model(inputs)
loss = loss_criterion(outputs, targets)
loss.backward()
optimizer.step()
model.perform_additional_optimizer_actions()  # This performs SGP/LocalSGD/SlowMo

GossipDataParallel implementation
perform_additional_optimizer_actions implementation

Known issues

We are currently running into NCCL errors related to using multiple NCCL process groups and are working on fixing those.

oss_ddp will assert in forward when running in eval mode

🐛 Bug

In training, we only forward which will cause the need_reduction check to fail. This can be fixed setting need_reduction in a backward hook instead of forward.

Command

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).

You can run the script with:

# For security purposes, please check the contents of collect_env.py before running it.
python -m torch.utils.collect_env

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

Error running Pipe tutorial

🐛 Bug

Running into an error when using the toy model in the tutorial

Command

from the tutorial

import fairscale
import torch
import torch.nn as nn

model = nn.Sequential(
            torch.nn.Linear(10, 10),
            torch.nn.ReLU(),
            torch.nn.Linear(10, 5)
        )

model = fairscale.nn.Pipe(model, balance=[2, 1])

then adding

import torch.optim as optim
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = model(torch.randn(20, 10))

throws:

Expected behavior

Environment

PyTorch version: 1.6.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.3 LTS (x86_64)
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.7.0
[pip3] torchvision==0.7.0+cu101
[conda] Could not collect

Can fairscale run megatron-style model parallel combined with Gpipe?

[feat] Support TPUs for intra-layer model parallel training

🚀 Feature

The current implementation is based on Megatron and only supports GPUs. Now that we’re migrating fairseq to this implementation, we should add TPU support here as well.

Motivation

Several fairseq users and internal projects would benefit from TPU support. For example, see facebookresearch/fairseq#2503.

Pitch

Replace all the CUDA-specific calls with device-agnostic versions that are compatible with PyTorch/XLA.

Alternatives

Not support TPUs.

Additional context

I have a preliminary version of the needed changes, but they are based on an old version of Megatron, so would need to be rebased over the (newer) Megatron fork in fairscale.

Broken pip fairscale package

🐛 Bug

There is an error with the installation of the fairscale package

Command

pip install fairscale

Reproduce

Complete output from command python setup.py egg_info:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-build-vc622m0j/fairscale/setup.py", line 10, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

[feat] Create a benchmark pulling in both Pipe and Zero

🚀 Feature

Add a new toy benchmark which mobilizes Pipe and Zero at the same time

Motivation

check that the interfaces, codebases are compatible
expose hypothetical perf problems
expose combined gains
get a baseline asap to monitor progress on both fronts

Pitch

New benchmark, either vision or NLP, to be implemented which Pipes the model and zeros the optimizer

Alternatives

Head in the sand and hoping for the best

Additional context

Nope

[OSS] Expose `clip_grad_norm` for the sharded state

🚀 Feature

Expose a method capable of returning the grad norm and clipping them which works on the sharded state

Motivation

Similar to https://pytorch.org/docs/master/generated/torch.nn.utils.clip_grad_norm_.html but with OSS and sharded gradients

Pitch

Could be done within ShardedDataParallel, reduce the norms and normalize the shards

Alternatives

Not exposing this, could be problematic for some usecases

Additional context

Exposed in a similar context by fairseq_optimizer

AdaScale: work with gradient accumulation

Mike reported that this doesn't seem to work well. I will debug it. In the end, I will ensure it works and tested.

ShardedDDP: automatic gradient reduce

🚀 Feature

Change ShardedDDP so that the gradient reduction happens automatically during the BW pass

Motivation

Changing DDP is already demanding for most frameworks using fairscale, changing the training loop on top of that to add a manual reduce makes it even more so.

Pitch

Use either autograd functions or backward_hook registration to reduce the grads properly and release the buffers after the reduction is done

Alternatives

Current ShardedDDP is less likely to be used

Additional context

[doc] tutorial_pipe_rpc.py without mpi.

🚀 Feature

Motivation

Hey @froody, I have been looking at the examples and I wonder if PipeRPCWrapper can be used without MPI (annoying to install). If such thing is possible, would you mind to put together such an example ?

Pitch

Alternatives

Additional context

Optimizer state sharding: collect all states from a single rank

🚀 Feature

Make it possible to collect all sharded optimizer states on a single rank.

Motivation

When saving the current job's state (checkpointing for later job resume), some frameworks expect to be able to fetch the whole training state (including the optimizer's) from a single rank, and they will not trivially be able to query all ranks at that time.

Pitch

Add a "sync state" or similar which is called periodically on all ranks, which brings all the shards to a given rank, in CPU memory space
Query this reference rank whenever you want to get the full optimizer state for checkpointing

Alternatives

Modify the frameworks so that they query all replicas, get all the sharded states, bring them back to a single rank or save them directly from there. This is currently not an easy sell

[Bug] OSS - Dataclasses dependences is missing

🐛 Bug

The module fairscale.optim.oss has a dependence on the package Dataclassess

To Reproduce

Steps to reproduce the behavior:

python -m venv tfs_env
source tfs_env/bin/activate
git clone https://github.com/facebookresearch/fairscale.git
cd fairscale/
pip install --upgrade pip
pip install fairscale
python
from fairscale.optim.oss import OSS

~/testCode/fairscale/fairscale/__init__.py in <module>
     10 ################################################################################
     11 
---> 12 from . import nn

~/testCode/fairscale/fairscale/nn/__init__.py in <module>
----> 1 from .pipe import Pipe
      2 
      3 __all__ = ["Pipe"]

~/testCode/fairscale/fairscale/nn/pipe/__init__.py in <module>
     20 """A Pipe implementation in PyTorch."""
     21 from .checkpoint import is_checkpointing, is_recomputing
---> 22 from .pipe import Pipe
     23 
     24 __all__ = ["Pipe", "is_checkpointing", "is_recomputing"]

~/testCode/fairscale/fairscale/nn/pipe/pipe.py in <module>
     32 from . import microbatch
     33 from .batchnorm import DeferredBatchNorm
---> 34 from .pipeline import Pipeline, PipelineStyle
     35 from .skip.layout import SkipLayout, inspect_skip_layout
     36 from .skip.skippable import Skippable, verify_skippables

~/testCode/fairscale/fairscale/nn/pipe/pipeline.py in <module>
     26 from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple, Type, Union, cast
     27 
---> 28 from dataclasses import dataclass
     29 import numpy as np
     30 import torch

ModuleNotFoundError: No module named 'dataclasses'

Add more documentation

🚀 Feature

Surface all the features exposed by fairscale
Add code snippets
Add benchmark examples
Add links towards other projects which are using fairscale already
Add some thoughts on when and where use the different methods exposed by fairscale

Motivation

Fairscale is hard to grok from the outside, lower the barrier of entry

Pitch

Easier to onboard people, easier to increase traction, clean things up

Alternatives

live in a cave

Additional context

[OSS] Support Apex AMP

🐛 Bug

Using OSS in conjunction with AMP lvl 02 breaks, looks like they are changing the optimizer param_groups and this collides with what OSS is doing

Command

Wrap an AMP enabled OSS, with optim O2

To Reproduce

Start a ClassyVision training job, enable both amp and OSS

cc @mannatsingh @prigoyal

[OSS] Handle torch AMP auto skip inf/nan steps

🚀 Feature

Using torch amp can lead to step() being skipped in the presence of inf or nans. An issue with that with OSS is that each rank will see a different step, meaning that some ranks could skip and others try to step (typically vanishing gradient on the way up and underflow with fp16). We need to handle that properly

Motivation

Without that using torch amp with OSS can easily deadlock, see for instance #179

Pitch

A somewhat elegant and not too far reaching solution to that could be to add a config flag to OSS, following which each rank will check for inf/nan on its grads, and signal all ranks that they should skip step() in that case. Users willing to apply OSS and torch amp would be advised to switch this flag on to prevent deadlocks

Alternatives

Prevent mixed precision and OSS, but that makes OSS a lot less compelling

Additional context

#179 and https://pytorch.org/docs/stable/notes/amp_examples.html?highlight=mixed%20precision#id2

[feat] AdaScale: ideas of Pipeline and FP16

🚀 Feature

While @erikwijmans was using AdaScale, he had some very interesting ideas. Filing here for tracking:

cc: @mikerabbat

One interesting that could be done for the fairscale implementation is combining it with Pipeline to get a better estimate of variance when you can run large batches even on just a single or a couple GPUs (i.e. when you are running things on a 32 GB GPU when it would fit fine in a < 12 GB GPU)
support FP16: "I just made a version of it that expects fp16 model params/grads and fp32 optimizer params/grads and am using that + PyTorch's GradScaler to replace Apex O2 mode and then wrapped my existing optimizer with it"
"It's a little convoluted currently because AdaScale needs to know about the grad scaling factor, but that is done by a different class. Probably makes sense to have it take both the optimizer and the grad scaler and instead manage both"
"I also did change grad_var = (local_grad_sqr - total_grad_sqr) * self._scale / (self._world_size - 1) to grad_var = (local_grad_sqr - total_grad_sqr) * self._world_size / (self._world_size - 1) as when using scale != world_size, the former would scale the LR too fast"

[feat] Sync OSS.param_groups and the sharded optimizer param_groups

🚀 Feature

Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual optimizer.

Motivation

Some frameworks make it possible to alter any attribute here, not just LR (momentum, beta for Adam, ..). We do not currently support this and silently fail

Pitch

Part of the not really well defined PyTorch optimizer features which are nice to have

Alternatives

At least add a warning when we're out of sync
Nuke the .param_groups access
Force a new OSS optimizer to be created everytime the user wants to change an attribute in param_groups

Additional context

[test] benchmarks: check final model against golden checkpoint

🚀 Feature

@myleott had an excellent suggestion. To test for regressions, we should check final model_params of the benchmark tests against a golden checkpoint.

Motivation

Better regression testing.

Pitch

Find regressions via unit tests instead of downstream.

Alternatives

Additional context

[feat] OSS: Support nvidia's LARC

🚀 Feature

Make it possible to support LARC with OSS

Motivation

LARC is a must have for large batch jobs, right now OSS will break on LARC because of the closure() being passed

Pitch

Should be doable to gracefully handle optimizers with do not support closures in step()

Alternatives

Not supporting LARC, reduces a lot of OSS interest

Additional context

cc @mannatsingh @prigoyal @msbaines

Two workflow related questions

❓ Questions and Help

These are not with high priority. When I was working on PR #14, I have these two issues around workflow remain unresolved. File them here in case we want to resolve.

requirements-dev.txt and requirements-test.txt are a little bit confusing. Unclear which one a developer should be using. I suggest we either merge them or document the use cases.
when merging a PR one have three choices: merge, squash, and rebase. It is not clear which one we should be using. Merge doesn't change any commit history but one can't modify their commit msg. Squash and rebase could make history more linear and with fewer small commits. But it changes commit history and may render older GitHub review comments hard to find. I don't know if we should recommend one method from the above to make it consistent.

OSS to expose classic optimizer compliant param_groups

🚀 Feature

Expose the "classic" param_groups from the OSS optimizer wrapper, either via a new API (not as nice) or if possible via the classic .param_groups attribute (get and set)

Motivation

Some frameworks like classy vision use that to alter the LR per layer or over time (or both). Exposing this would make OSS integration a lot easier

Pitch

Either use dedicated getter and setter properties which change behaviour depending on whether the call comes from the outside or the super constructor, or add a specific API to get the consolidated param_groups and set the consolidated param_groups

Alternatives

Destroy the optimizer every time there's an adjustment
Only expose adjustments which apply to the whole param_groups ("set all LR to a given value")

pytest AdaScale

🚀 Feature

Once Mike and I gain more experiences with AdaScale with different projects, we should get it pytested. Before that, we will not use it widely in different projects.

Motivation

Make sure code is covered with unit test.

Pitch

pretty standard.

Alternatives

no test?

Additional context

Mike and I are doing some reproducibility study with AdaScale.

cc: @msbaines @blefaudeux

[feat] OSS: Support seamless state consolidation

🚀 Feature

Currently state_dict() being called on a single rank cannot return the full optimizer state, because this rank only has a shard's worth. Checkpointing the full optimizer state requires calling consolidate_state() on all rank prior to pulling the state, which is not intuitive and not pytorch compliant.

Taking inspiration from other projects sharding the model or parameter state, we could use the torch RPC framework to store remote references to other ranks' states, which would allow for a seamless to_here() consolidation. In that case, we would need to suppose that both RPC and torch distributed are initialized, which is a feature being tackled.

Motivation

Pytorch compliance
Smaller API footprint
Ease of use

Pitch

See the feature description, this is doable and would be a lot more elegant.

Alternatives

Status quo

Additional context

Discussed with Pytorch Distributed and MSFT folks

Add mixed precision autocast to benchmarking script

🚀 Feature

It would be nice to make sure OSS/ShardedDDP plays nice with torch autocast! Adding it to the oss.py benchmark will allow us to ensure there is no glaring issues. I don't mind working on this!

Motivation

Integrating OSS/ShardedDDP into PL :)

[OSS-SDP] Acuracy bug - results differ from DDP and OSS

🐛 Bug

See #130 for a repro , with 4+ GPU there's a measurable accuracy discrepancy on the same problem vs normal DDP and OSS+DDP. It does not show with 2 GPUs.

To Reproduce

Steps to reproduce the behavior:

'python3 fairscale/benchmark/oss.py' on a machine with 4 or more GPUs
Observe that the two first runs (DDP and OSS+DDP) match, but that the third one differs measurably
Example with CircleCI

Expected behavior

The logs should exactly match for all three methods

Environment

`
Torch version: 1.6.0+cu101
Collecting environment information...
PyTorch version: 1.6.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla M60
GPU 1: Tesla M60
GPU 2: Tesla M60
GPU 3: Tesla M60

Nvidia driver version: 418.87.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.6.0
[pip3] torchvision==0.7.0
[conda] Could not collect
`

Additional context

In this toy example all the ranks get the same seed, but the data served for every rank differ (as it should)

[ShardedOptimizer] Use views in the buckets / save memory

🚀 Feature

Same as the new DDP feature, skip the copies (at least optionally) and use "virtual" buckets (slices/views depending on languages) instead

Motivation

For big models and jobs with many ranks, the buckets multiply and end up taking some space, which goes contrary to the whole point of ShardedOptimizer

Pitch

Have a look at the view/slice approach, may not map directly but worth investigating

Alternatives

Status quo, works

Additional context

Get AdaScale type checked

🐛 Bug

We should get AdaScale type checked. I am working on its result reproducibility. After that, I will get to this.

Command

To Reproduce

Steps to reproduce the behavior:

remove type: ignore in the source and init.py that importing it
mypy should fail

Expected behavior

mypy should pass without the ignore

Test failure with pytorch 1.6

🐛 Bug

tests/nn/pipe/test_deferred_batch_norm.py:41: RuntimeError
# Tilt mean by single batch.
for i, single in enumerate(input):

      single += 2 ** i

E RuntimeError: Output 0 of UnbindBackward is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

Command

pytest -k test_input_requiring_grad

Expected behavior

Test passes as on torch 1.5

Environment

Collecting environment information...
PyTorch version: 1.7.0.dev20200721
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.7.0.dev20200721
[pip3] torchvision==0.8.0.dev20200721
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37h516909a_0 conda-forge
[conda] mkl_fft 1.1.0 py37hc1659b7_1 conda-forge
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] numpy 1.18.5 py37ha1c710e_0
[conda] numpy-base 1.18.5 py37hde5b4d6_0
[conda] pytorch 1.7.0.dev20200721 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch-nightly
[conda] torchvision 0.8.0.dev20200721 py37_cu101 pytorch-nightly

[feat] OSS: make it compatible with nvidia's AMP

🚀 Feature

AMP is using a load(get()) trick to make sure that the state tensors are cast in the right format, which breaks OSS assumptions, see https://github.com/NVIDIA/apex/blob/2ec84ebdca59278eaf15e8ddf32476d9d6d8b904/apex/amp/_initialize.py#L205

Motivation

nvidia's AMP is getting deprecated in favour of Pytorch, but some of its features are not yet there. It brings a lot of speed to the table which is necessary for big jobs

Pitch

Make it happen (tm)

Alternatives

Do not support AMP

Additional context

Zero sharding in Fairseq affects Megatron-LM results

🐛 Bug

Optimizer zero state sharding should not affect the results of any experiments. However, @ngoyal2707 and I have observed that this isn't the case for Megatron-LM models in Fairseq.

Command

To Reproduce

First, we show that results can replicate across model parallel sizes.

python fairseq_train.py --task masked_lm   /checkpoint/bioseq_nonsecure/namangoyal/model_parallel_data/small_sample_valid_ur50-bin --dataset-impl fasta  --save-dir checkpoints/zero-0-mp-1    --dropout 0.1   --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0   --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07   --tokens-per-sample 128 --sample-break-mode none   --max-tokens 128 --memory-efficient-fp16 --no-progress-bar --log-interval 1 --seed 4 --max-epoch 1 --max-update 50 --encoder-layers 4 --no-save  --arch model_parallel_roberta_large --model-parallel-size 2 --update-freq 2 2>&- | grep "ppl"

2020-09-25 13:32:52 | INFO | train_inner | epoch 001:      5 / 8323965 loss=0.966, ppl=1.95, wps=0, ups=0, wpb=1024, bsz=8, num_updates=1, lr=2.24975e-07, gnorm=0.055, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001:      6 / 8323965 loss=0.99, ppl=1.99, wps=10858.5, ups=10.6, wpb=1024, bsz=8, num_updates=2, lr=3.4995e-07, gnorm=0.07, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001:      7 / 8323965 loss=0.968, ppl=1.96, wps=14452.4, ups=14.1, wpb=1024, bsz=8, num_updates=3, lr=4.74925e-07, gnorm=0.067, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001:      8 / 8323965 loss=1.032, ppl=2.04, wps=15902.2, ups=15.51, wpb=1024, bsz=8, num_updates=4, lr=5.999e-07, gnorm=0.073, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001:      9 / 8323965 loss=1.007, ppl=2.01, wps=14162.8, ups=13.81, wpb=1024, bsz=8, num_updates=5, lr=7.24875e-07, gnorm=0.089, loss_scale=8, train_wall=0, wall=203
2020-09-25 13:32:53 | INFO | train_inner | epoch 001:     10 / 8323965 loss=1.03, ppl=2.04, wps=14513.6, ups=14.15, wpb=1024, bsz=8, num_updates=6, lr=8.4985e-07, gnorm=0.082, loss_scale=8, train_wall=0, wall=204

python fairseq_train.py --task masked_lm   /checkpoint/bioseq_nonsecure/namangoyal/model_parallel_data/small_sample_valid_ur50-bin --dataset-impl fasta  --save-dir checkpoints/zero-0-mp-1    --dropout 0.1   --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0   --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07   --tokens-per-sample 128 --sample-break-mode none   --max-tokens 128 --memory-efficient-fp16 --no-progress-bar --log-interval 1 --seed 4 --max-epoch 1 --max-update 50 --encoder-layers 4 --no-save  --arch model_parallel_roberta_large --model-parallel-size 4 --update-freq 4 2>&- | grep "ppl"

2020-09-25 14:44:30 | INFO | train_inner | epoch 001:      5 / 8323965 loss=0.966, ppl=1.95, wps=0, ups=0, wpb=1024, bsz=8, num_updates=1, lr=2.24975e-07, gnorm=0.055, loss_scale=8, train_wall=0, wall=212
2020-09-25 14:44:30 | INFO | train_inner | epoch 001:      6 / 8323965 loss=0.99, ppl=1.99, wps=3017.4, ups=2.95, wpb=1024, bsz=8, num_updates=2, lr=3.4995e-07, gnorm=0.07, loss_scale=8, train_wall=0, wall=212
2020-09-25 14:44:30 | INFO | train_inner | epoch 001:      7 / 8323965 loss=0.968, ppl=1.96, wps=2964.3, ups=2.89, wpb=1024, bsz=8, num_updates=3, lr=4.74925e-07, gnorm=0.067, loss_scale=8, train_wall=0, wall=212
2020-09-25 14:44:31 | INFO | train_inner | epoch 001:      8 / 8323965 loss=1.032, ppl=2.04, wps=3059.4, ups=2.99, wpb=1024, bsz=8, num_updates=4, lr=5.999e-07, gnorm=0.073, loss_scale=8, train_wall=0, wall=213
2020-09-25 14:44:31 | INFO | train_inner | epoch 001:      9 / 8323965 loss=1.007, ppl=2.01, wps=2869.3, ups=2.8, wpb=1024, bsz=8, num_updates=5, lr=7.24875e-07, gnorm=0.089, loss_scale=8, train_wall=0, wall=213

However, if we now add optimizer zero state sharding, the numbers change.

python fairseq_train.py --task masked_lm   /checkpoint/bioseq_nonsecure/namangoyal/model_parallel_data/small_sample_valid_ur50-bin --dataset-impl fasta  --save-dir checkpoints/zero-0-mp-1    --dropout 0.1   --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0   --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07   --tokens-per-sample 128 --sample-break-mode none   --max-tokens 128 --memory-efficient-fp16 --no-progress-bar --log-interval 1 --seed 4 --max-epoch 1 --max-update 50 --encoder-layers 4 --no-save  --arch model_parallel_roberta_large --model-parallel-size 2 --update-freq 2 --zero-sharding os 2>&- | grep "ppl"

020-09-25 15:15:17 | INFO | train_inner | epoch 001:      5 / 8323965 loss=0.966, ppl=1.95, wps=0, ups=0, wpb=1024, bsz=8, num_updates=1, lr=2.24975e-07, gnorm=0.055, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:17 | INFO | train_inner | epoch 001:      6 / 8323965 loss=0.968, ppl=1.96, wps=10795.5, ups=10.54, wpb=1024, bsz=8, num_updates=2, lr=3.4995e-07, gnorm=0.06, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:17 | INFO | train_inner | epoch 001:      7 / 8323965 loss=0.953, ppl=1.94, wps=16335, ups=15.94, wpb=1024, bsz=8, num_updates=3, lr=4.74925e-07, gnorm=0.064, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:17 | INFO | train_inner | epoch 001:      8 / 8323965 loss=0.994, ppl=1.99, wps=15692, ups=15.31, wpb=1024, bsz=8, num_updates=4, lr=5.999e-07, gnorm=0.065, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:18 | INFO | train_inner | epoch 001:      9 / 8323965 loss=0.963, ppl=1.95, wps=17342.3, ups=16.92, wpb=1024, bsz=8, num_updates=5, lr=7.24875e-07, gnorm=0.09, loss_scale=8, train_wall=0, wall=199
2020-09-25 15:15:18 | INFO | train_inner | epoch 001:     10 / 8323965 loss=0.972, ppl=1.96, wps=16860.9, ups=16.45, wpb=1024, bsz=8, num_updates=6, lr=8.4985e-07, gnorm=0.084, loss_scale=8, train_wall=0, wall=199

Expected behavior

We expect the results to be independent of the zero-sharding parameter.

Environment

Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).

Collecting environment information...
PyTorch version: 1.5.0a0+4ff3872
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.5
[pip] numpy==1.18.3
[pip] numpydoc==0.9.2
[pip] pytorch-lightning==0.8.1
[pip] pytorch-pretrained-bert==0.6.2
[pip] pytorch-transformers==1.1.0
[pip] torch==1.5.0a0+4ff3872
[conda] blas                      1.0                         mkl
[conda] libblas                   3.8.0                    15_mkl    conda-forge
[conda] libcblas                  3.8.0                    15_mkl    conda-forge
[conda] liblapack                 3.8.0                    15_mkl    conda-forge
[conda] magma-cuda101             2.5.2                         1    pytorch
[conda] mkl                       2020.1                      217
[conda] mkl-include               2020.0                      166
[conda] mkl-service               2.3.0            py36he904b0f_0
[conda] mkl_fft                   1.0.15           py36ha843d7b_0
[conda] mkl_random                1.1.0            py36hd6b4f25_0
[conda] pytorch-lightning         0.8.1                     <pip>

benchmark circleci flaky

🐛 Bug

my last commit broke circleci: https://app.circleci.com/pipelines/github/facebookresearch/fairscale/723/workflows/756f3e58-06ae-424a-a320-e85174a59483/jobs/2771

But it seems to be an performance/speed assertion that my change shouldn't affect. @blefaudeux, can you take a look?

cc: @msbaines

Command

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).

You can run the script with:

# For security purposes, please check the contents of collect_env.py before running it.
python -m torch.utils.collect_env

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

OSS: improve on the default benchmark, pass in something realistic

🚀 Feature

Instead of optimizing on random data, pick up something sensible from torchvision

Motivation

Easier to compare to existing workloads to people, easier to make sure that we test on something realistic

Pitch

Pick a small dataset from torchvision and use it

Alternatives

keep optimizing on random data
create synthetic data on the fly

Additional context

Pipe across machines

❓ Questions and Help

Hi, from the tutorial I see an example of running Pipe with 2 GPUs. Can Pipe run training across machines? If not, what is missing / any caveat?

OSS: Make the checkpoints partition-agnostic

🚀 Feature

Change the consolidated state dict so that it becomes partition-independent

Motivation

This would make it possible to change the number of hosts when restarting a job

Pitch

state_dict() and load_state_dict() need to flatten/shard everything out, instead of storing data per rank

Alternatives

Current status, same number of ranks before and after

Additional context

Capturing elements of a discussion with the DeepSpeed MSFT team

Name 'nn.TransformerEncoderLayer' is not defined

🐛 Bug

benchmarks/transformer.py:43: error: Name 'nn.TransformerEncoderLayer' is not defined

Command

To Reproduce

Steps to reproduce the behavior:

conda install -c conda-forge pre-commit
pre-commit run --all-files

Environment

[pip3] numpy==1.18.5
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py38h516909a_0 conda-forge
[conda] mkl_fft 1.1.0 py38hc1659b7_1 conda-forge
[conda] mkl_random 1.1.0 py38h962f231_0
[conda] pytorch 1.5.1 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torchvision 0.6.1 py38_cu101 pytorch

Rewrite state_dict in a more pytorch idiomatic way

🚀 Feature

Change the param_groups handling in the state dict, in order to follow more closely the default PyTorch assumptions
https://pytorch.org/docs/stable/optim.html#torch.optim.Optimizer.state_dict

Motivation

Some users may assume that the default pytorch optimizer interface with respect to the state dict is respected by fairscale/oss
People familiar with pytorch optimizers would have an easier learning curve when peeking into OSS

Pitch

Rewrite the exposed state dict in order to return "state" and "param_groups" in accordance to pytorch expectations, without duplications

Alternatives

rely on the python/pytorch memory model to remove duplicates in memory and while serializing
add wrappers on the user side

Additional context

Support broadcast_buffers in OssDdp

🚀 Feature

We should add support for the broadcast_buffers flag to OssDdp.

Motivation

Distributed training with BatchNorm requires it. We removed it from the fairseq implementation because it slows things down a bit, but for the generalized implementation here we should add it back (as a configurable option).

Additional context

See documentation for broadcast_buffers in the main DDP module: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html