Comments (7)
Could you please try with the SpeechBrain version available in the develop branch and get back to me with the results? We fixed several issues with DDP in this new version.
You can install it with the following command:
pip install git+https://github.com/speechbrain/speechbrain.git@develop
from speechbrain.
Hi, thanks for your very detailed investigation of this issue, this makes it much easier to debug and fix on our side. To address these three issues, let me respond below:
- Yes this was an issue and we have fixed it.
- This approach should be unnecessary, it should "just work" as the default saving function is marked with
@main_process_only
see this line. However, I have opened a PR #2404 based on this feedback to enable this approach to work, though you'd have to use a@main_process_only
function rather thanif_main_process
. - I don't think this is the right place to insert the print statement. Instead, try putting it inside the default saving function (same line as above). The issue should no longer occur, if it does please let us know.
from speechbrain.
Hello @kokamido, thanks for opening this issue! Could you please let us know if your speechbrain version is from the main branch or the develop branch? How did you installed SpeechBrain ? Through pip install speechbrain or git clone ? Thanks.
I'm pinging again @pplantinga as this is a very important issue.
from speechbrain.
I installed speechbrain==0.5.16 via pip.
In order to add a "print" described in the "Multiple writings of the same checkpoint" section I modified /usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py
file of the speechbrain package installed via pip.
from speechbrain.
I tested develop version of the speechbrain package installed as pip install git+https://github.com/speechbrain/speechbrain.git@develop
1. Write intra-epoch checkpoints only
Seems fixed. It takes a few epochs to crash if I use speechbrain==0.5.16 from pip, but it worked well for 100 epochs if I use develop version. I think it means that this issue is fixed in the develop branch
2. Write end-of-epoch checkpoints in main thread only.
No changes. Both setups (with and without TORCH_DISTRIBUTED_DEBUG=DETAIL) behave as described in the issue
3. Write end-of-epoch checkpoints in all threads.
No changes. Both DDP-workers write a checkpoint according to logs from print(f'{os.environ.get("LOCAL_RANK")}\t{ckpt_dir}/{name}')
injected to this line.
100%|██████████| 160/160 [00:01<00:00, 153.53it/s, train_loss=0.68]
0 experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/counter
0 experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/brain
1 experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/counter
1 experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/brain
1 experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/optimizer
0 experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/optimizer
from speechbrain.
Thanks for the clarification. Now I understand how the checkpoints should be saved, and I have no more questions.
from speechbrain.
Solved in #2404
from speechbrain.
Related Issues (20)
- Circular Import Error HOT 8
- Circular import in ESC-50 classification recipe HOT 2
- Tacotron2.decoder.infer behaves incorrectly HOT 2
- Can't reproduce pretraining results for Wav2vec2 using LibriSpeech recipe HOT 9
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! HOT 2
- not able to import 'HuggingFaceWhisper' from speechbrain.lobes.models.huggingface_whisper HOT 7
- Adapters + LLama -- re-design. HOT 6
- Torch 2.3 breaks DDP? HOT 7
- Training twice as long with Torch > 1.11 HOT 10
- Training regression for Conformer-Transducer models HOT 2
- Math Domain Error in Pretraining tutorial. HOT 1
- Typing syntax not supported in 3.7/3.8 HOT 8
- Potential `SpectrogramDrop` bugs HOT 1
- dtype mismatch in AttentiveStatisticsPooling with FP16 training mode HOT 1
- Task ASR Reported: Caught ZeroDivisionError in DataLoader worker process 0. HOT 4
- Huggingface-Aishell get wrong prediction HOT 2
- AMD ROCm: Conformer-transducer diverges HOT 2
- AMD ROCm: `torch.backends.cudnn.benchmark` should be set to `False` by default on ROCm
- Wav2Vec2WordpieceTokenizer' object has no attribute '_create_trie' HOT 2
- Same result for different samples (with same name) using speech separation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from speechbrain.