python3 -m torch.distributed.launch --nnodes=1 --nproc_per_node 2 --master_port=25621 /rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py
Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:02:00)
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
File already exists: output.tmp/dist_init
Distributed init: rank 0/2 - output.tmp/dist_init
File already exists: output.tmp/dist_init
Distributed init: rank 1/2 - output.tmp/dist_init
Traceback (most recent call last):
File "/rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py", line 24, in <module>
Traceback (most recent call last):
File "/rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py", line 24, in <module>
train.main()
File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 118, in main
train.main()
File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 118, in main
process_group = torch.distributed.init_process_group(
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
process_group = torch.distributed.init_process_group(
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 255, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 255, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:02:00)
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:02:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3500208) of binary: /rscratch/zhendong/llmenv/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
But if I change the init method from file to env, this error would disappear, instead I would get a time our error when trying to run all reduce
process_group = torch.distributed.init_process_group(
backend=config["distributed_backend"],
init_method="file://" + os.path.abspath(config["distributed_init_file"]),
# init_method= 'env://',
timeout=datetime.timedelta(seconds=120),
world_size=config['n_workers'],
rank=config["rank"],
)
File already exists: output.tmp/dist_init
Distributed init: rank 1/2 - output.tmp/dist_init
File already exists: output.tmp/dist_init
Distributed init: rank 0/2 - output.tmp/dist_init
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
timer - epoch: 1.000, value: 0.045 (event:batch.forward)
timer - epoch: 1.000, value: 0.143 (event:batch.backward)
timer - epoch: 1.000, value: 0.001 (event:batch.evaluate)
timer - epoch: 0.005, value: 0.207 (event:batch)
timer - epoch: 1.000, value: 0.045 (event:batch.forward)
timer - epoch: 1.000, value: 0.397 (event:batch.backward)
timer - epoch: 1.000, value: 0.001 (event:batch.evaluate)
timer - epoch: 1.000, value: 0.001 (event:batch.evaluate)
timer - epoch: 1.000, value: 0.145 (event:batch.backward)
timer - epoch: 1.000, value: 0.001 (event:batch.evaluate)
timer - epoch: 0.021, value: 0.210 (event:batch)
timer - epoch: 1.000, value: 0.138 (event:batch.backward)
timer - epoch: 1.000, value: 0.139 (event:batch.backward)
timer - epoch: 1.000, value: 0.044 (event:batch.forward)
timer - epoch: 1.000, value: 0.152 (event:batch.backward)
timer - epoch: 1.000, value: 0.001 (event:batch.evaluate)
timer - epoch: 1.000, value: 0.001 (event:batch.evaluate)
timer - epoch: 1.000, value: 0.045 (event:batch.forward)
timer - epoch: 1.000, value: 0.001 (event:batch.evaluate)
timer - epoch: 1.000, value: 0.045 (event:batch.forward)
timer - epoch: 1.000, value: 0.045 (event:batch.forward)
timer - epoch: 0.221, value: 0.199 (event:batch)
timer - epoch: 1.000, value: 0.140 (event:batch.backward)
timer - epoch: 1.000, value: 0.045 (event:batch.forward)
timer - epoch: 1.000, value: 0.044 (event:batch.forward)
Traceback (most recent call last):
File "/rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py", line 24, in <module>
train.main()
File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 279, in main
epoch_metrics.reduce()
File "/rscratch/zhendong/mfguo/powersgd/paper-code/mean_accumulator.py", line 27, in reduce
self.average[key].reduce()
File "/rscratch/zhendong/mfguo/powersgd/paper-code/mean_accumulator.py", line 34, in reduce
handle_tc = torch.distributed.all_reduce(total_count, async_op=True)
File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 233, in all_reduce_with_logging
ret = all_reduce_orig(*args, **kwargs)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1320, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f06053b8612 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x7f06053b4d7f in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7f0639a0507f in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7f0639a06001 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7f0639a0608b in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f06399d7702 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f06399d7702 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f06399d7702 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb1 (0x7f06067f8421 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x204 (0x7f06067fc8b4 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xf2fe35 (0x7f06067ffe35 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #11: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0xf (0x7f060680111f in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x2ac (0x7f0606806f3c in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #13: <unknown function> + 0x8a06db (0x7f064f1a76db in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x21e8d5 (0x7f064eb258d5 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #15: PyCFunction_Call + 0x59 (0x5f6939 in /rscratch/zhendong/llmenv/bin/python3)
frame #16: _PyObject_MakeTpCall + 0x296 (0x5f7506 in /rscratch/zhendong/llmenv/bin/python3)
frame #17: /rscratch/zhendong/llmenv/bin/python3() [0x50b8d3]
frame #18: _PyEval_EvalFrameDefault + 0x5796 (0x570556 in /rscratch/zhendong/llmenv/bin/python3)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x393 (0x5f6ec3 in /rscratch/zhendong/llmenv/bin/python3)
frame #21: PyObject_Call + 0x62 (0x5f60b2 in /rscratch/zhendong/llmenv/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x1f3c (0x56ccfc in /rscratch/zhendong/llmenv/bin/python3)
frame #23: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #24: _PyFunction_Vectorcall + 0x393 (0x5f6ec3 in /rscratch/zhendong/llmenv/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1910 (0x56c6d0 in /rscratch/zhendong/llmenv/bin/python3)
frame #26: _PyFunction_Vectorcall + 0x1b6 (0x5f6ce6 in /rscratch/zhendong/llmenv/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x859 (0x56b619 in /rscratch/zhendong/llmenv/bin/python3)
frame #28: _PyFunction_Vectorcall + 0x1b6 (0x5f6ce6 in /rscratch/zhendong/llmenv/bin/python3)
frame #29: _PyEval_EvalFrameDefault + 0x859 (0x56b619 in /rscratch/zhendong/llmenv/bin/python3)
frame #30: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #31: _PyFunction_Vectorcall + 0x393 (0x5f6ec3 in /rscratch/zhendong/llmenv/bin/python3)
frame #32: _PyEval_EvalFrameDefault + 0x5796 (0x570556 in /rscratch/zhendong/llmenv/bin/python3)
frame #33: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #34: PyEval_EvalCode + 0x27 (0x68e547 in /rscratch/zhendong/llmenv/bin/python3)
frame #35: /rscratch/zhendong/llmenv/bin/python3() [0x67dbf1]
frame #36: /rscratch/zhendong/llmenv/bin/python3() [0x67dc6f]
frame #37: /rscratch/zhendong/llmenv/bin/python3() [0x67dd11]
frame #38: PyRun_SimpleFileExFlags + 0x197 (0x67fe37 in /rscratch/zhendong/llmenv/bin/python3)
frame #39: Py_RunMain + 0x212 (0x6b7c82 in /rscratch/zhendong/llmenv/bin/python3)
frame #40: Py_BytesMain + 0x2d (0x6b800d in /rscratch/zhendong/llmenv/bin/python3)
frame #41: __libc_start_main + 0xf3 (0x7f065a2fd083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x2e (0x5fb85e in /rscratch/zhendong/llmenv/bin/python3)
[E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6201, OpType=ALLREDUCE, Timeout(ms)=120000) ran for 123097 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6201, OpType=ALLREDUCE, Timeout(ms)=120000) ran for 123097 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2863885) of binary: /rscratch/zhendong/llmenv/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I only change dist._GradBucket to dist.GradBucket and torch.futures.Future to torch.futures.Future[torch.Tensor] to make the codebase compatible with higher version of PyTorch. Really appreciate any help!