Giter VIP home page Giter VIP logo

powersgd's Introduction

PowerSGD

Practical Low-Rank Gradient Compression for Distributed Optimization

Video

Abstract: We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization. Despite the significant attention received, current compression schemes either do not scale well or fail to achieve the target test accuracy. We propose a new low-rank gradient compressor based on power iteration that can i) compress gradients rapidly, ii) efficiently aggregate the compressed gradients using all-reduce, and iii) achieve test performance on par with SGD. The proposed algorithm is the only method evaluated that achieves consistent wall-clock speedups when benchmarked against regular SGD with an optimized communication backend. We demonstrate reduced training times for convolutional networks as well as LSTMs on common datasets.

Reference implementation

This is a reference implementation for the PowerSGD algorithm.

Installation:

pip install git+https://github.com/epfml/powersgd.git

Usage:

+ from powersgd import PowerSGD, Config, optimizer_step

  model = torchvision.models.resnet50(pretrained=True)
  params = list(model.parameters())
  optimizer = torch.optim.SGD(params, lr=0.1)

+ powersgd = PowerSGD(params, config=Config(
+     rank=1,  # lower rank => more aggressive compression
+     min_compression_rate=10,  # don't compress gradients with less compression
+     num_iters_per_step=2,  #   # lower number => more aggressive compression
+     start_compressing_after_num_steps=0,
+ ))

  for each batch:
      loss = ...
-     optimizer.zero_grad()
      loss.backward()
-     optimizer.step()
+     optimizer_step(optimizer, powersgd)

Differences with the paper version

The version in this code base is a slight improvement over the version in the PowerSGD paper. It looks a bit like Algorithm 2 in this follow-up paper.

We found that there are two ways to control the approximation quality in PowerSGD: the first is the 'rank' of the approximation, and the second is the 'number of powerSGD iterations' in between gradient steps, while keeping the rank 1. Because the cost of orthogonalisation grows as $O(\text{rank}^2)$, increasing the rank can become inefficient, leaving changing the number of iterations as the best option.

In the original PowerSGD paper, more iterations only improves the quality of the rank-k approximation, as the approximation converges to the "best rank k approximation". In the follow-up paper, intermediate results from these rank 1 power iterations are all used and communicated, effectively increasing the rank as the number of iterations grows.

In the original PowerSGD paper, we used two iterations per SGD step (a left and a right iteration). In this setting, there is not much of a difference. The difference appears when you use more power iteration steps per SGD step.

PyTorch implementation

PyTorch features an implementation of PowerSGD as a communucation hook for DistributedDataParallel models. Because of the integration with DDP, the code is more involved than the code in this repository.

Research code

Research code for the experiments in the PowerSGD paper is located under paper-code.

Selected follow-up work

  • (Cho et al., 2019) concurrently developed an algorithm that is fundamentally very similar to PowerSGD.
  • (Ramesh et al., 2021 - DALL-E) share valuable recommendations in using PowerSGD for large-scale transformer training.
  • (Agarwal et al., 2020) share insights into adaptive compression with PowerSGD.
  • (Vogels et al., 2020) adapt PowerSGD to work in a decentralized setting (with sparse connectivity between workers.)
  • (Wang, 2021) introduces a variation to PowerSGD and describes his experience with PowerSGD on large language models.
  • (Please submit a PR if you want your work to be included here.)

Reference

If you use this code, please cite the following paper

@inproceedings{vkj2019powersgd,
  author = {Vogels, Thijs and Karimireddy, Sai Praneeth and Jaggi, Martin},
  title = "{{PowerSGD}: Practical Low-Rank Gradient Compression for Distributed Optimization}",
  booktitle = {NeurIPS 2019 - Advances in Neural Information Processing Systems},
  year = 2019,
  url = {https://arxiv.org/abs/1905.13727}
}

powersgd's People

Contributors

martinjaggi avatar saipraneet avatar tvogels avatar vineeths96 avatar younik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

powersgd's Issues

Request for an Example to Reproduce the Paper Code with Environment and Hardware Details

Hello,

I'm attempting to run and reproduce the results of the code provided in this repository, specifically the implementation related to PowerSGD. To ensure a smooth and accurate reproduction, could you please provide a detailed example or guide that includes the following information?

  1. Hardware Platform Details:

    • Number and type of GPUs used.
    • Any specific hardware requirements or configurations.
  2. Software Environment:

    • The version of PyTorch and CUDA.
    • Any other dependencies or libraries required to run the code.
  3. Execution Instructions:

    • Detailed commands to launch the training process, especially if it involves distributed training using python -m torch.distributed.launch or tochrun.

Additionally, it appears that certain parts of the code might require adjustments to work correctly with the distributed launch utility. Specifically, the references to the rank variable in the following lines seem to be problematic when launching with python -m torch.distributed.launch:

These lines of code seem to directly access a rank variable which may not be properly initialized in a distributed training context initiated by torch.distributed.launch.

Could you please clarify these aspects or suggest any necessary modifications to successfully run the distributed training as intended?

Thank you very much for your assistance and for sharing your work. I'm looking forward to successfully reproducing the results and exploring the capabilities of PowerSGD.

Best regards,
Lichen

Verify error is constant as number of workers varies

To me, the most interesting part of this code is the all-reduce scheme with error correction. It appears the error will be constant as the number of workers varies from the (brief) description of error correction on the NeurIPS poster.

all reduce

In the RankKReducer, the all reduce result is not divided by the number of workers. For example,

all_reduce(self.p_memory)

Will this be an issue or am I missing something in the codes?

An error-feedback issue.

Hi,

Is "memory_out" for the error-feedback in class RankKReducer of gradient_reducers.py?
def reduce(self, grad_in, grad_out, memory_out):

It seems that the local error is computed here:

with self.timer("reduce.outerprod", verbosity=2):
    for p, q, (tensor, out, mem) in zip(ps, qs, high_rank_tensors):
        # Set the output gradient
        torch.matmul(p, q.t(), out=out.data[:])
        mem.data[:] = tensor - out

However, I think the local compressed gradient rather than the averaged version should be subtracted from the local gradient. That is, remove the last line in the codes above and change

with self.timer("reduce.compute.q", verbosity=2):
    for p, q, (tensor, _, _) in zip(ps, qs, high_rank_tensors):
    matrix = tensor.view(tensor.shape[0], -1)
    torch.matmul(matrix.t(), p, out=q)

to

with self.timer("reduce.compute.q", verbosity=2):
    for p, q, (tensor, _, mem) in zip(ps, qs, high_rank_tensors):
        matrix = tensor.view(tensor.shape[0], -1)
        torch.matmul(matrix.t(), p, out=q)
        mem.data[:] = tensor - torch.matmul(p, q.t()).view_as(tensor)

Is that correct?

Besides, the input and kernel dimensions of the 4d gradient in the convolutional layers should be flattened. Are there such computations in this repository? I tried but couldn't find them.

Thanks

PowerSGD is similiar efficient with torch.svd_lowrank

Here I write a function of power iteration.



import torch
import time
def poweriter(input, p_buffer, q_buffer, iter):
    for i in range(iter):
        if i == iter - 1:
            p_buffer[0] = torch.linalg.qr(p_buffer[0]).Q
        q_buffer[0] = input @ p_buffer[0]
        if i == iter - 1:
            q_buffer[0] = torch.linalg.qr(q_buffer[0]).Q
        p_buffer[0] = input.permute((0, 1, 3, 2)) @ q_buffer[0]
    return q_buffer[0] @ p_buffer[0].permute((0, 1, 3, 2))


input = torch.rand([64, 32, 112, 112]).requires_grad_()
p_buffer = torch.rand([64, 32, 112, 3])
q_buffer = torch.rand([64, 32, 112, 3])
for i in range(10):
    output = poweriter(input,[p_buffer],[q_buffer],1)
start = time.time()
output = poweriter(input,[p_buffer],[q_buffer],2)
end = time.time()
print("powersvd_time:",end - start,"error:",torch.abs(output - input).mean())
input = input.view(64,32,-1)
start = time.time()
U,S,V = torch.svd_lowrank(input, q = 3)
S = torch.diag_embed(S)
V = V.transpose(-1, -2)
output = torch.matmul(U[..., :, :], S[..., :, :])
output = torch.matmul(output[..., :, :], V[..., :, :])
end = time.time()

print("svdlow_time:",end - start,"error:",torch.abs(output - input).mean())

I only do orthogonalization at last and only do 2 iterations but it seems that power iteration is not so much faster than svd_lowrank. And also get more errors than before(if there is no feedback).
The result is

powersvd_time: 0.1424109935760498 error: tensor(0.2390)
svdlow_time: 0.16568613052368164 error: tensor(0.2343)

May I suppose that actually I could use svd_lowrank during training instead of power iteration to get similar results with the paper?
Since I have tried to use svd_lowrank to compress gradients and it shows that in the same situation, svd_lowrank gets better and cost more time( but relatively small).

Problems to run for more than 1 worker

Hello,

I am very new to this paradigm of parallel training, so I am probably making some rookie mistake. The issue is whenever I am trying to increase the number of workers from 1 to 2, the code hangs at the init_process_group stage.

The system that I am running on has 2 GPUs. First I tried modifying train.py itself as mentioned in the readme. Then with respect to another issue I tried running using mpirun. Both of them are just getting stuck at the mentioned stage.

With mpirun -np 2 python3 train.py I am getting the following error

Failed to create a completion queue (CQ):

Hostname: compute-0-0
Requested CQE: 16384
Error: Cannot allocate memory

Check the CQE attribute.

Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: compute-0-0

Distributed init: rank 0/2 - ./output.tmp/dist_init
Distributed init: rank 0/2 - ./output.tmp/dist_init

I checked for some limits of memlock as mentioned in some websites, but they are all set to UNLIMITED.

Also, if I can run without mpirun, I would prefer that.

Since I require this for one of my course project, can you please guide me as to how to run this for more than 2 workers.

Any help is appreciated.

Thanks,
Soumya

UPDATE


I was working with Pytorch version 1.7.1 which I updated to 1.10.0. Post that I set the following

export NCCL_SOCKET_IFNAME=en,eth

Now the error that is coming is as follows:

compute-0-0:188492:188492 [0] bootstrap.cc:40 NCCL WARN Bootstrap : no socket interface found
compute-0-0:188492:188492 [0] NCCL INFO init.cc:98 -> 3
compute-0-0:188492:188492 [0] NCCL INFO init.cc:150 -> 3
compute-0-0:188492:188492 [0] NCCL INFO init.cc:167 -> 3
Traceback (most recent call last):
File "run_new.py", line 44, in
train.main()
File "/home/soumyad/powersgd/train.py", line 177, in main
bits_communicated += reducer.reduce(send_buffers, grads, memories)
File "/home/soumyad/powersgd/gradient_reducers.py", line 753, in reduce
all_reduce(self.p_memory)
File "/home/soumyad/powersgd/gradient_reducers.py", line 1185, in all_reduce
return torch.distributed.all_reduce(*args, **kwargs)
File "/home/soumyad/powersgd/.powersgd/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1285, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption.

Any suggestions as to what I should try?

Information on Timer Class

I have a question regarding the implementation of the Timer class. In the line here, it is said that if the verbosity level is high the time is not measured. But in many places, like here, the time is measured and displayed in the final output.

What am I missing here?

what's the difference between paper code and recent code?

Hi,

I'm impressed by the efficiency of PowerSGD.

Could you let me know the difference between paper code and the recent code which is located at powersgd/powergsd.py

Is there any performance difference?

I think the paper code and the recent code is not identical.

Does recent code also converge well?

Can I know the reason?

TopKReducer class

In the TopKReducer Class, at this line here, we have,

top_size = max(1, int(0.5 * self.rank * tensor.nelement()))

I believe that self.rank should be self.compression. If not can you explain the intuition behind this line?

torch lr scheduler

hi, powersgd is good in my practice, In which way can powersgd work with torch lr scheduler?

How to run test script?

import train
train.config["num_epochs"] = 200
train.config["n_workers"] = 8
train.config["rank"] = 0
train.main()

I run above script file.
But It didn't work.

If I change n_workers to 1, then it work.

I think train.py code needs torch.multiprocessing(mp.spawn) or mpirun run script.
Is it right ?

Please. Could you teach me how to run test script?

Why svd_lowrank is faster than power iteration on CPU?

from powersgd import PowerSGD, Config, optimizer_step
import torch
import time
# params = torch.rand([64,32,112,112])
params = torch.rand([64,32,112,112])

# print(params)

powersgd = PowerSGD(params, config=Config(
    rank=1,  # lower rank => more aggressive compression
    min_compression_rate=1,  # don't compress gradients with less compression
    num_iters_per_step=2,  #   # lower number => more aggressive compression
    start_compressing_after_num_steps=0,
))
start = time.time()
powersgd._powersgd.aggregate(params)
end = time.time() -start
print(end)
start = time.time()
params = params.view(64,32,-1)
U,S,V = torch.svd_lowrank(params, q = 1)
end = time.time() -start
print(end)

you could see my code below.
Since your implementation produce P and Q matrixs. And svd_lowrank produce USV matirxs. U is the same shape as P and V is the same shape as Q.
The result is

0.16216802597045898
0.046942949295043945

Missing language_modeling.py?

It seems that there should be a “language_modeling.py” file in the “tasks” folder. But I can only see the “cifar.py”. Is it a missing file? Thanks.

Any result on larger-scale dataset?

Hi, nice work on gradient compression. I have tried Rank 4 reducer on imagenet, and the top1 accuracy drops about 3.5% (76%->72.48%), the hyper-paremeters are almost the same with imagenet example except the lr is proportional increased with batchsize(8K). Is it any workaround here? From my trial, it seems the generalization degrades a lot on a large-scale large-batch-size setting(while it is the exact case that the communication becomes bottleneck). Looking forward to hear your suggestions.

BTW, in the 'rankkreducer', why only Q matrix divided by the world-size, is it a typo or unnecessary since we orthogonalizes P matrix?

Timed out initializing process group in store based barrier

Hi,
when I try to start using

python3 -m torch.distributed.launch --nnodes=1 --nproc_per_node 2 --master_port=25621 /rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py

I would always get a timed out error

Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:02:00)

The full log is

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
File already exists: output.tmp/dist_init
Distributed init: rank 0/2 - output.tmp/dist_init
File already exists: output.tmp/dist_init
Distributed init: rank 1/2 - output.tmp/dist_init




Traceback (most recent call last):
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py", line 24, in <module>
Traceback (most recent call last):
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py", line 24, in <module>
    train.main()
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 118, in main
    train.main()
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 118, in main
    process_group = torch.distributed.init_process_group(
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    process_group = torch.distributed.init_process_group(
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 255, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 255, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:02:00)
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:02:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3500208) of binary: /rscratch/zhendong/llmenv/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

But if I change the init method from file to env, this error would disappear, instead I would get a time our error when trying to run all reduce

    process_group = torch.distributed.init_process_group(
        backend=config["distributed_backend"],
        init_method="file://" + os.path.abspath(config["distributed_init_file"]),
        # init_method= 'env://',
        timeout=datetime.timedelta(seconds=120),
        world_size=config['n_workers'],
        rank=config["rank"],
    )

Here is the new time out error:

File already exists: output.tmp/dist_init
Distributed init: rank 1/2 - output.tmp/dist_init
File already exists: output.tmp/dist_init
Distributed init: rank 0/2 - output.tmp/dist_init
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
timer                          - epoch:  1.000, value:  0.045 (event:batch.forward)
timer                          - epoch:  1.000, value:  0.143 (event:batch.backward)
timer                          - epoch:  1.000, value:  0.001 (event:batch.evaluate)
timer                          - epoch:  0.005, value:  0.207 (event:batch)
timer                          - epoch:  1.000, value:  0.045 (event:batch.forward)
timer                          - epoch:  1.000, value:  0.397 (event:batch.backward)
timer                          - epoch:  1.000, value:  0.001 (event:batch.evaluate)
timer                          - epoch:  1.000, value:  0.001 (event:batch.evaluate)
timer                          - epoch:  1.000, value:  0.145 (event:batch.backward)
timer                          - epoch:  1.000, value:  0.001 (event:batch.evaluate)
timer                          - epoch:  0.021, value:  0.210 (event:batch)
timer                          - epoch:  1.000, value:  0.138 (event:batch.backward)
timer                          - epoch:  1.000, value:  0.139 (event:batch.backward)
timer                          - epoch:  1.000, value:  0.044 (event:batch.forward)
timer                          - epoch:  1.000, value:  0.152 (event:batch.backward)
timer                          - epoch:  1.000, value:  0.001 (event:batch.evaluate)
timer                          - epoch:  1.000, value:  0.001 (event:batch.evaluate)
timer                          - epoch:  1.000, value:  0.045 (event:batch.forward)
timer                          - epoch:  1.000, value:  0.001 (event:batch.evaluate)
timer                          - epoch:  1.000, value:  0.045 (event:batch.forward)
timer                          - epoch:  1.000, value:  0.045 (event:batch.forward)
timer                          - epoch:  0.221, value:  0.199 (event:batch)
timer                          - epoch:  1.000, value:  0.140 (event:batch.backward)
timer                          - epoch:  1.000, value:  0.045 (event:batch.forward)
timer                          - epoch:  1.000, value:  0.044 (event:batch.forward)
Traceback (most recent call last):
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/resnet50_train.py", line 24, in <module>
    train.main()
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 279, in main
    epoch_metrics.reduce()
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/mean_accumulator.py", line 27, in reduce
    self.average[key].reduce()
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/mean_accumulator.py", line 34, in reduce
    handle_tc = torch.distributed.all_reduce(total_count, async_op=True)
  File "/rscratch/zhendong/mfguo/powersgd/paper-code/train_pytorch.py", line 233, in all_reduce_with_logging
    ret = all_reduce_orig(*args, **kwargs)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1320, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f06053b8612 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x7f06053b4d7f in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7f0639a0507f in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7f0639a06001 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7f0639a0608b in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f06399d7702 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f06399d7702 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f06399d7702 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb1 (0x7f06067f8421 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x204 (0x7f06067fc8b4 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xf2fe35 (0x7f06067ffe35 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #11: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0xf (0x7f060680111f in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x2ac (0x7f0606806f3c in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #13: <unknown function> + 0x8a06db (0x7f064f1a76db in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x21e8d5 (0x7f064eb258d5 in /rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #15: PyCFunction_Call + 0x59 (0x5f6939 in /rscratch/zhendong/llmenv/bin/python3)
frame #16: _PyObject_MakeTpCall + 0x296 (0x5f7506 in /rscratch/zhendong/llmenv/bin/python3)
frame #17: /rscratch/zhendong/llmenv/bin/python3() [0x50b8d3]
frame #18: _PyEval_EvalFrameDefault + 0x5796 (0x570556 in /rscratch/zhendong/llmenv/bin/python3)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x393 (0x5f6ec3 in /rscratch/zhendong/llmenv/bin/python3)
frame #21: PyObject_Call + 0x62 (0x5f60b2 in /rscratch/zhendong/llmenv/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x1f3c (0x56ccfc in /rscratch/zhendong/llmenv/bin/python3)
frame #23: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #24: _PyFunction_Vectorcall + 0x393 (0x5f6ec3 in /rscratch/zhendong/llmenv/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1910 (0x56c6d0 in /rscratch/zhendong/llmenv/bin/python3)
frame #26: _PyFunction_Vectorcall + 0x1b6 (0x5f6ce6 in /rscratch/zhendong/llmenv/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x859 (0x56b619 in /rscratch/zhendong/llmenv/bin/python3)
frame #28: _PyFunction_Vectorcall + 0x1b6 (0x5f6ce6 in /rscratch/zhendong/llmenv/bin/python3)
frame #29: _PyEval_EvalFrameDefault + 0x859 (0x56b619 in /rscratch/zhendong/llmenv/bin/python3)
frame #30: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #31: _PyFunction_Vectorcall + 0x393 (0x5f6ec3 in /rscratch/zhendong/llmenv/bin/python3)
frame #32: _PyEval_EvalFrameDefault + 0x5796 (0x570556 in /rscratch/zhendong/llmenv/bin/python3)
frame #33: _PyEval_EvalCodeWithName + 0x26a (0x5697da in /rscratch/zhendong/llmenv/bin/python3)
frame #34: PyEval_EvalCode + 0x27 (0x68e547 in /rscratch/zhendong/llmenv/bin/python3)
frame #35: /rscratch/zhendong/llmenv/bin/python3() [0x67dbf1]
frame #36: /rscratch/zhendong/llmenv/bin/python3() [0x67dc6f]
frame #37: /rscratch/zhendong/llmenv/bin/python3() [0x67dd11]
frame #38: PyRun_SimpleFileExFlags + 0x197 (0x67fe37 in /rscratch/zhendong/llmenv/bin/python3)
frame #39: Py_RunMain + 0x212 (0x6b7c82 in /rscratch/zhendong/llmenv/bin/python3)
frame #40: Py_BytesMain + 0x2d (0x6b800d in /rscratch/zhendong/llmenv/bin/python3)
frame #41: __libc_start_main + 0xf3 (0x7f065a2fd083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x2e (0x5fb85e in /rscratch/zhendong/llmenv/bin/python3)

[E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6201, OpType=ALLREDUCE, Timeout(ms)=120000) ran for 123097 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6201, OpType=ALLREDUCE, Timeout(ms)=120000) ran for 123097 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2863885) of binary: /rscratch/zhendong/llmenv/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/rscratch/zhendong/llmenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

I only change dist._GradBucket to dist.GradBucket and torch.futures.Future to torch.futures.Future[torch.Tensor] to make the codebase compatible with higher version of PyTorch. Really appreciate any help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.