Describe the bug General deion Hi! I am

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Solved in <a class="issue-link js-issue-link" data-error-text="Failed to load title" d

Possible NCCL-level deadlock during checkpointing about speechbrain HOT 7 CLOSED

kokamido commented on June 23, 2024

Possible NCCL-level deadlock during checkpointing

from speechbrain.

Comments (7)

Adel-Moumen commented on June 23, 2024 1

Could you please try with the SpeechBrain version available in the develop branch and get back to me with the results? We fixed several issues with DDP in this new version.

You can install it with the following command:

pip install git+https://github.com/speechbrain/speechbrain.git@develop

from speechbrain.

pplantinga commented on June 23, 2024 1

Hi, thanks for your very detailed investigation of this issue, this makes it much easier to debug and fix on our side. To address these three issues, let me respond below:

Yes this was an issue and we have fixed it.
This approach should be unnecessary, it should "just work" as the default saving function is marked with @main_process_only see this line. However, I have opened a PR #2404 based on this feedback to enable this approach to work, though you'd have to use a @main_process_only function rather than if_main_process.
I don't think this is the right place to insert the print statement. Instead, try putting it inside the default saving function (same line as above). The issue should no longer occur, if it does please let us know.

from speechbrain.

Adel-Moumen commented on June 23, 2024

Hello @kokamido, thanks for opening this issue! Could you please let us know if your speechbrain version is from the main branch or the develop branch? How did you installed SpeechBrain ? Through pip install speechbrain or git clone ? Thanks.

I'm pinging again @pplantinga as this is a very important issue.

from speechbrain.

kokamido commented on June 23, 2024

I installed speechbrain==0.5.16 via pip.
In order to add a "print" described in the "Multiple writings of the same checkpoint" section I modified /usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py file of the speechbrain package installed via pip.

from speechbrain.

kokamido commented on June 23, 2024

I tested develop version of the speechbrain package installed as pip install git+https://github.com/speechbrain/speechbrain.git@develop

1. Write intra-epoch checkpoints only

Seems fixed. It takes a few epochs to crash if I use speechbrain==0.5.16 from pip, but it worked well for 100 epochs if I use develop version. I think it means that this issue is fixed in the develop branch

2. Write end-of-epoch checkpoints in main thread only.

No changes. Both setups (with and without TORCH_DISTRIBUTED_DEBUG=DETAIL) behave as described in the issue

3. Write end-of-epoch checkpoints in all threads.

No changes. Both DDP-workers write a checkpoint according to logs from print(f'{os.environ.get("LOCAL_RANK")}\t{ckpt_dir}/{name}') injected to this line.

100%|██████████| 160/160 [00:01<00:00, 153.53it/s, train_loss=0.68] 
0       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/counter
0       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/brain
1       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/counter
1       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/brain
1       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/optimizer
0       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/optimizer

from speechbrain.

kokamido commented on June 23, 2024

Thanks for the clarification. Now I understand how the checkpoints should be saved, and I have no more questions.

from speechbrain.

Adel-Moumen commented on June 23, 2024

Solved in #2404

from speechbrain.

Possible NCCL-level deadlock during checkpointing about speechbrain HOT 7 CLOSED

Comments (7)

1. Write intra-epoch checkpoints only

2. Write end-of-epoch checkpoints in main thread only.

3. Write end-of-epoch checkpoints in all threads.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent