microsoft / msrflute Goto Github PK
View Code? Open in Web Editor NEWFederated Learning Utilities and Tools for Experimentation
Home Page: https://aka.ms/flute
License: MIT License
Federated Learning Utilities and Tools for Experimentation
Home Page: https://aka.ms/flute
License: MIT License
Hi,
I recently installed FLUTE and was trying the sample example given this repo's description.
Following is the command which I ran:
python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing/mockup -outputPath scratch -config testing/configs/hello_world_local.yaml -task nlg_gru -backend nccl
Now, Following is the error which I received.
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Root Cause (first observed failure):
[0]:
time : 2022-10-03_17:01:38
host : linuxsys
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 44679)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Request you to please guide me to solve this isue.
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
Hello,
My research group at the University of Cambridge is looking to benchmark Flute on a multi-node setup using our machine cluster.
We have been unable to find an example script for launching multi-node executions, could you please provide it for us?
HI there maintainers,
first off I'm thankful to the devs and engineering that went behind setting up this framework .I tried picking it up and as a to simulating GPU parallel computing with NCCL I ran into some issues .
here's the error i'm currently trying to fix .
error [1]
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
My system is ZorinOS 16 which is based on ubuntu20.04 , I'm trying to use an Nvidia RTX 3060 GPU
nvidia-smi returns the following
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 31% 27C P8 14W / 170W | 1426MiB / 12045MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1484 G /usr/lib/xorg/Xorg 128MiB |
| 0 N/A N/A 1633 G /usr/bin/gnome-shell 89MiB |
| 0 N/A N/A 7155 G ...548701901119532058,131072 28MiB |
| 0 N/A N/A 12476 C ...da3/envs/FLUTE/bin/python 1175MiB |
and nvcc --version returns the following
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0
this screenshot displays that I have pytorch environment almost ready to go .
now when trying to install nccl , I can't find a way to confirm if the installation is succesful , or where the nccl home is .
using the command (python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl
) in readme yields the following and no models being stored in the scratch folders error [1]'s original stack
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
The data can be found here: The data can be found here: The data can be found here: ./testing ./testing./testing
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'max_grad_norm', 'batch_size'} in [server_config][val][data_config]Mon Feb 13 12:39:20 2023 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'max_grad_norm', 'num_frames'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]
Mon Feb 13 12:39:20 2023 : Backend: nccl
Mon Feb 13 12:39:20 2023 : Backend: nccl
Mon Feb 13 12:39:20 2023 : Backend: nccl
Added key: store_based_barrier_key:1 to store for rank: 0Added key: store_based_barrier_key:1 to store for rank: 2
Added key: store_based_barrier_key:1 to store for rank: 1
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 1
Traceback (most recent call last):
File "e2e_trainer.py", line 238, in <module>
run_worker(model_path, config, task, data_path, local_rank, backend)
File "e2e_trainer.py", line 100, in run_worker
torch.cuda.set_device(device)
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/__init__.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 0Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 2
Preparing model .. Initializing
Traceback (most recent call last):
File "e2e_trainer.py", line 238, in <module>
run_worker(model_path, config, task, data_path, local_rank, backend)
File "e2e_trainer.py", line 100, in run_worker
torch.cuda.set_device(device)
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/__init__.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
GRU(
(embedding): Embedding()
(rnn): GRU2(
(w_ih): Linear(in_features=160, out_features=1536, bias=True)
(w_hh): Linear(in_features=512, out_features=1536, bias=True)
)
(squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Mon Feb 13 12:39:20 2023 : initialize model with default settings
Mon Feb 13 12:39:20 2023 : trying to move the model to GPU
Mon Feb 13 12:39:21 2023 : model: GRU(
(embedding): Embedding()
(rnn): GRU2(
(w_ih): Linear(in_features=160, out_features=1536, bias=True)
(w_hh): Linear(in_features=512, out_features=1536, bias=True)
)
(squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Mon Feb 13 12:39:21 2023 : torch.cuda.memory_allocated(): 10909184
/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/memory.py:397: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
FutureWarning)
Mon Feb 13 12:39:21 2023 : torch.cuda.memory_cached(): 23068672
Mon Feb 13 12:39:21 2023 : torch.cuda.synchronize(): None
Loading json-file: ./testing/data/nlg_gru/val_data.json
Loading json-file: ./testing/data/nlg_gru/test_data.json
Loading json-file: ./testing/data/nlg_gru/train_data.json
Mon Feb 13 12:39:21 2023 : Server data preparation
Mon Feb 13 12:39:21 2023 : No server training set is defined
Mon Feb 13 12:39:21 2023 : Prepared the dataloaders
Mon Feb 13 12:39:21 2023 : Loading Model from: None
Could not load the run context. Logging offline
Attempted to log scalar metric System memory (GB):
15.414344787597656
Attempted to log scalar metric server_config.num_clients_per_iteration:
10
Attempted to log scalar metric server_config.max_iteration:
3
Attempted to log scalar metric dp_config.eps:
0
Attempted to log scalar metric dp_config.max_weight:
0
Attempted to log scalar metric dp_config.min_weight:
0
Attempted to log scalar metric server_config.optimizer_config.type:
adam
Attempted to log scalar metric server_config.optimizer_config.lr:
0.003
Attempted to log scalar metric server_config.optimizer_config.amsgrad:
True
Attempted to log scalar metric server_config.annealing_config.type:
step_lr
Attempted to log scalar metric server_config.annealing_config.step_interval:
epoch
Attempted to log scalar metric server_config.annealing_config.gamma:
1.0
Attempted to log scalar metric server_config.annealing_config.step_size:
100
Mon Feb 13 12:39:21 2023 : Launching server
Mon Feb 13 12:39:21 2023 : server started
Attempted to log scalar metric Max iterations:
3
Attempted to log scalar metric LR for agg. opt.:
0.003
Mon Feb 13 12:39:21 2023 : Running ['val'] at itr=0
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12703 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12704) of binary: /home/crns/anaconda3/envs/FLUTE/bin/python
Traceback (most recent call last):
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
e2e_trainer.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-02-13_12:39:24
host : crns-IdeaCentre-Gaming5-14IOB6
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 12705)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-02-13_12:39:24
host : crns-IdeaCentre-Gaming5-14IOB6
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 12704)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
and before that I tried running pytest -v -s in ./testing
so my guess was that I haven't setup NCCL properly , I tried to find the legacy build compatible with mine from https://developer.nvidia.com/nccl/nccl-legacy-downloads and got NCCL 2.11.4, for CUDA 11.4, September 7, 2021
and as instructed used " sudo apt install libnccl2=2.11.4-1+cuda11.4 libnccl-dev=2.11.4-1+cuda11.4 " as instructed which went smoothly but I still encountered the older stack trace .
going to nvidia's NCCL test repo , I skip the installation steps because I have an official release then try to do "make" then "./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1" (I tried changed the -g argument to 4 or keeping ngpus) and got the same error either way
./build/all_reduce_perf: symbol lookup error: ./build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
now that's where I stopped with those 2 issues where I feel solving one would help the other .
but before I got this far I had to reformat the workstation acouple times seeing Nvidia fails to keep all the necessary compatibility information in one place but this post saved me
in my previous environments , I managed to get FLute running on gloo but I still had a similer warning stack trace but models could be saved .
in this fresh environment I also had trouble importing and using the python built-in subprocess module specifically because the "run" method generated errors that I worked around around with this https://stackoverflow.com/questions/40590192/getting-an-error-attributeerror-module-object-has-no-attribute-run-while but even then I was still receiving an error with that solution because "text" had a TypeError and couldn't be passed to Popen class constructor Failed: TypeError: __init__() got an unexpected keyword argument 'text'
so my investigation led to the fact that the text argument was added after python 3.7 and when your readme.md suggests 3.8 thus the problem I can understand if you have been working on this project for a long time but this could have been a seperate issue because it causes the tests in pytest -v -s to fail. that you can label as an enhancement but I felt it could be related to why the processes aren't being assigned to the virtual gpus properly.
other honorable mentions include using : sickit-learn instead of deprecated sklearn in requirements.txt and that using newest version of pytorch 1.13 compatible with cuda 11.7 leaves the speech recognition task with deprecated torchaudio
Apologies if I mentioned several irrelevant steps or issues but I hope that I can get an exact answer to error[1]'s stack trace and quickly get back to focusing on the experimentation side research . thanks to the msrflute team and hope to hear from u soon
Hello,
While running a series of benchmarks between FLUTE and other frameworks we have observed a consistently high degree of GPU compute utilisation with low memory utilisation on the part of FLUTE. The backend used was NCCL.
Despite outclassing the other frameworks in compute utilisation, FLUTE underperforms in terms of round duration compared to one of the others by a factor of 2-4x. All experiments were carried out using the same hardware resource with either 2 or 4 GPUs and our results hold for both fast aggregation and normal aggregation.
Could you highlight some potential bottlenecks that FLUTE may encounter in an image task such that the high GPU utilisation does not translate to lower round duration?
We are interested in providing a fair comparison and would like some pointers for potential issues that may suppress the performance of FLUTE.
Hi,
Following is the error, I am getting on running the sample code given in the documentation. Request you to please look into it and help me resolving the issue.
vision@vision:~/aviral/msrflute$ python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl
WARNING:main:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
When profiling is enabled, server.py references self.run_metrics, which should be self.run_stats
This issue is to discuss a known limitation which is that FLUTE expects a minimum of two GPUs for any CUDA-based training. There must always be a Worker 0 GPU and then at least one more for client training. It would be valuable to be able to specify arbitrary mappings so that, say, Worker 0 and Worker 1 share the same GPU. From a memory standpoint this should be ok because they never need the GPU at the same time. I'm not sure that torch.distributed can support arbitrary mappings (note: CUDA_VISIBLE_DEVICES=0,0 doesn't work as a solution). Alternatively if we could assign worker 0 to cpu and worker 1+ to GPUs that might be a reasonable solution- relatively speaking, model aggregation is less expensive and could potentially be done on CPU.
Thoughts?
$ git submodule update --init --recursive
I could not get this to work at all . Finally I had to go through commit logs to find out the prv_accountant repository
https://github.com/microsoft/prv_accountant
I cloned it , set the path variable for prv_accountant module after setup.py and it worked .
Hello
This is a suggestion, not a bug report.
May we expect to see an Xbox client ?
I am ready to help.
Is it possible to disable the annealing LR scheduler? If it is removed from the config.yaml
file, the training process will not start.
When enabling the replay server option in the server it breaks because of the following:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.