Comments (5)
Good! Thanks for getting back to us @kokamido. We solved some DDP / checkpointing issues in the develop branch. We are planning to merge it in main branch very soon. Since this issue is solved, I will proceed by closing it. Feel free to reopen it if you require more in-depth help.
Thanks again for opening the issue! :)
from speechbrain.
Hey @kokamido, thanks for letting us know! Could you please show us your save
directory ?
Ping @pplantinga I think this issue is for you ;)
from speechbrain.
I ran the repro with clean save directory. After the crash it looks like this:
root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# ls experiments/ddp_crash_repro/save/
CKPT+2024-02-08+12-30-07+01 CKPT+2024-02-08+12-30-08+01 CKPT+2024-02-08+12-30-09+01 CKPT+2024-02-08+12-30-10+01 CKPT+2024-02-08+12-30-11+00
CKPT+2024-02-08+12-30-07+02 CKPT+2024-02-08+12-30-08+02 CKPT+2024-02-08+12-30-09+02 CKPT+2024-02-08+12-30-10+02
And error message for this run is
Root Cause (first observed failure):
[0]:
time : 2024-02-08_12:30:11
host : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 143902)
error_file: /tmp/torchelastic_9faem1ym/none_hfidn2p7/attempt_0/1/error.json
traceback : Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/speechbraindebugexample/repro.py", line 48, in fit
super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
self._save_intra_epoch_ckpt()
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
self.checkpointer.save_and_keep_only(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
self.delete_checkpoints(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
self.find_checkpoints(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
ckpts = self.list_checkpoints()
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
with open(ckpt_dir / METAFNAME) as fi:
FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+12-30-10+00/CKPT.yaml'
from speechbrain.
Hey,
could you please fetch the latest speechbrain version available through git clone and let us know if the issue is still there ? Thanks.
Best,
Adel
from speechbrain.
It seems to be fixed for b8a3ee3
from speechbrain.
Related Issues (20)
- AMD ROCm: Conformer-transducer diverges HOT 2
- AMD ROCm: `torch.backends.cudnn.benchmark` should be set to `False` by default on ROCm
- Wav2Vec2WordpieceTokenizer' object has no attribute '_create_trie' HOT 2
- Same result for different samples (with same name) using speech separation HOT 1
- Cannot reproduce the result in speech translation HOT 6
- [Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) HOT 1
- pip install --editable . didn't run in VS code HOT 3
- SoundChoice G2P seems to be broken HOT 1
- Couldn't find appropriate backend to handle uri example.wav and format None. HOT 3
- Should not include punctuation in the G2P samples ? HOT 1
- Valid Loss Increasing While Train Loss Decreases Using Custom Training Data with Speaker ID
- `compute_STFT` performs HtoD transfer at every `forward` call HOT 1
- Multi-GPU issue when pre-training Wav2Vec2 HOT 3
- Repeat the execution of convert_split? HOT 1
- Kmeans .fit() should be changed to .partial_fit() HOT 1
- Pre-v1.0 models with `TransformerASR` `causal=True` (the default) are broken in v1.0 HOT 2
- Improve HF tests HOT 2
- Recipe tests generate a lot of clutter that is not `.gitignore`d
- readthedocs building fails due to fairseq
- readthedocs doesn't trigger automatically anymore HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from speechbrain.