Is there a way to get checkpoint_15500 in inference file?

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Where to download the pretrained model? about tacotron2 HOT 13 CLOSED

nvidia commented on August 15, 2024

Where to download the pretrained model?

from tacotron2.

Comments (13)

MXGray commented on August 15, 2024 1

@rafaelvalle
Can you share your pretrained Tacotron2 model and hparams that generated the sample audio?

from tacotron2.

rafaelvalle commented on August 15, 2024 1

Pre-trained model has been made available on our README page.

from tacotron2.

rafaelvalle commented on August 15, 2024

Checkpoint files are saved to the output_directory that is specified when running the train.py command. The example code on our repo saves to outdir

from tacotron2.

OptimusPrimeCao commented on August 15, 2024

@rafaelvalle
When I run train.py with default hparams on 8GB 1080 gpu, I get this error. I change batchsize to 24 but the error still exists!

src/tcmalloc.cc:278] Attempt to free invalid pointer 0x100000009
Traceback (most recent call last):
  File "train.py", line 291, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 209, in train
    for i, batch in enumerate(train_loader):
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__
    idx, batch = self._get_batch()
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
    return self.data_queue.get()
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 227, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 125263) is killed by signal: Aborted.
Segmentation fault (core dumped)

from tacotron2.

rafaelvalle commented on August 15, 2024

This is probably related to your CPU not being able to load the data.
Try changing the number of workers or smaller batch size.

from tacotron2.

rafaelvalle commented on August 15, 2024

@OptimusPrimeCao The model uses approximately 300MB per sample. Try reducing your batch size to 16.

from tacotron2.

rafaelvalle commented on August 15, 2024

@MXGray We can share the hparams.
Please do not post the same message on multiple issues. I deleted your message from the Audio Examples issue.

from tacotron2.

gsoul commented on August 15, 2024

Please share the hparams.

from tacotron2.

vijaysumaravi commented on August 15, 2024

Is there a way I can continue training my model from a particular point?

To be specific, my training crashed at checkpoint_32000 because of memory issue which I have fixed. Can I now somehow resume training from this point or should I again begin from the start? If yes, how do I do it?

I wasn't sure if opening a new issue for this was required hence posting my comment here.

Any help is appreciated. Thanks!

from tacotron2.

vijaysumaravi commented on August 15, 2024

Never mind, figured it out.

python train.py --output_directory=outdir --log_directory=logdir --checkpoint_path='outdir/checkpoint_32500'

from tacotron2.

beknazar commented on August 15, 2024

Never mind, figured it out.

python train.py --output_directory=outdir --log_directory=logdir --checkpoint_path='outdir/checkpoint_32500'

@vijaysumaravi Mine was also stopped after 32.5 epochs, did you figure out the reason?

from tacotron2.

vijaysumaravi commented on August 15, 2024

Reducing my batch size helped. I was training it on a single GPU.

from tacotron2.

ErfolgreichCharismatisch commented on August 15, 2024

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.

from tacotron2.

Where to download the pretrained model? about tacotron2 HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent