I finally got all the errors resolved, but then this new one came up: <code class=

I fixed this by reducing the batch size to 32 in <a href="https://github.com/NVIDIA/ta

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

CUDA Runtime Error: Out of Memory about tacotron2 HOT 19 CLOSED

nvidia commented on July 16, 2024 1

CUDA Runtime Error: Out of Memory

from tacotron2.

Comments (19)

rafaelvalle commented on July 16, 2024 2

Try decreasing batch size. Approximately 3 samples per gb of GPU memory.

from tacotron2.

adjouama commented on July 16, 2024 1

I fixed this by reducing the batch size to 32 in hparam:
batch_size=32

I use Nvidia GTX 1080 Ti with 11GiB memory

from tacotron2.

MrBreadWater commented on July 16, 2024

I did, all the way down to 16, but that didn't change anything. I can go lower, but that seems suspiciously low.

…

On Fri, Jun 8, 2018 at 7:58 PM Rafael Valle ***@***.***> wrote: Try decreasing batch size. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#33 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWeVKiQcBKhAHMFsuHGxk43NqL6QzXCkks5t6znngaJpZM4UhIuJ> .

from tacotron2.

rafaelvalle commented on July 16, 2024

I don't know from memory but I think the GTX 1050 TI has 4gb of memory, thus you should use a batch of 12 samples. Please try with batch size 12 and let us know.

from tacotron2.

tatamyans commented on July 16, 2024

same problem possibly, also GTX 1050 Ti 4gb, no luck with batch size 12
https://pastebin.com/di1j2jKQ

from tacotron2.

rafaelvalle commented on July 16, 2024

Try batch size 8.

from tacotron2.

tatamyans commented on July 16, 2024

out of memory error is gone, but python crashes after saving model :
Unhandled exception at 0x0000000076FAA0F2 (ntdll.dll) in python.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000123FF8).

python 3.5, 3.6 same thing
maybe unrelated

from tacotron2.

rafaelvalle commented on July 16, 2024

@tatamyans can you provide a full error trace?

from tacotron2.

tatamyans commented on July 16, 2024

sorry, can't provide full trace now, maybe later, thanks

from tacotron2.

gsoul commented on July 16, 2024

/home/soul/projects/nv-tacotron/tacotron2/layers.py:35: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
/home/soul/projects/nv-tacotron/tacotron2/layers.py:15: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  gain=torch.nn.init.calculate_gain(w_init_gain))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-82530a2f6baf> in <module>()
      1 checkpoint_path = "/home/soul/projects/nv-tacotron/tacotron2/outdir/checkpoint_80000"
----> 2 model = load_model(hparams)
      3 try:
      4     model = model.module
      5 except:

/home/soul/projects/nv-tacotron/tacotron2/train.py in load_model(hparams)
     78 
     79 def load_model(hparams):
---> 80     model = Tacotron2(hparams).cuda()
     81     if hparams.fp16_run:
     82         model = batchnorm_to_float(model.half())

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
    247             Module: self
    248         """
--> 249         return self._apply(lambda t: t.cuda(device))
    250 
    251     def cpu(self):

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    174     def _apply(self, fn):
    175         for module in self.children():
--> 176             module._apply(fn)
    177 
    178         for param in self._parameters.values():

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    180                 # Tensors stored in modules are graph leaves, and we don't
    181                 # want to create copy nodes, so we have to unpack the data.
--> 182                 param.data = fn(param.data)
    183                 if param._grad is not None:
    184                     param._grad.data = fn(param._grad.data)

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in <lambda>(t)
    247             Module: self
    248         """
--> 249         return self._apply(lambda t: t.cuda(device))
    250 
    251     def cpu(self):

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25

I get this when I try to do inference on 1080ti.

Training works fine on separate gpu with batch size of 40 all other settings are default. Dataset is LJSpeech-1.1

@rafaelvalle could you please advise here?

from tacotron2.

gsoul commented on July 16, 2024

Hm, perhaps I figured this out: PyTorch demanded GPU0 for inference. And as training was happening on in at the time - it gave OOM error. could produce some speech after I stopped training.

from tacotron2.

rafaelvalle commented on July 16, 2024

@gsoul yes, one can run into OOM if multiple sources are requesting memory from the same GPU. If you want to train and do inference at the same time, you could do inference on the CPU...

from tacotron2.

gsoul commented on July 16, 2024

No, I have 2 * 1080ti in my machine. And used CUDA_VISIBLE_DEVICES to separate inference and training into separate gpus, but was getting error above, until stopped training on gpu0.

from tacotron2.

rafaelvalle commented on July 16, 2024

When running inference, can you confirm that the pytorch code only has access to one of the GPUs?

from tacotron2.

gsoul commented on July 16, 2024

After thinking for some time - perhaps not. I ran:

CUDA_VISIBLE_DEVICE=1 /home/soul/anaconda3/bin/ipython notebook --no-browser --port=8889

But such command limits only ipython process rather than python process that communicates with gpus...

from tacotron2.

imirzadeh commented on July 16, 2024

I had the same problem and I reduced the batch size to make it work

from tacotron2.

rafaelvalle commented on July 16, 2024

Closing due to inactivity.

from tacotron2.

ErfolgreichCharismatisch commented on July 16, 2024

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.

from tacotron2.

one1ine commented on July 16, 2024

Hello everyone, I am having the same GPU memory issue.

Using Nvidia A40 with 46GB memory
Using a batchsize of 8!
Using a custom dataset of 40hrs with a sampling rate of 22050, which is like 6GB of data.
The training is initially running fine but just around finishing 3 epochs, the memory error pops up and stops the training.

Since im already using a batchsize of 8, i dont think lowering it any lower would beneficial.
That said, I'm thinking of trying to clear the cache (torch.cuda.empty_cache()) at the end of every epoch because i think its accumulating and filling up the cache since the memory is consistently popping up after 3 epoch.

Will let you know if it works out.

UPDATE (23/04/14):
So I'm running the training on server using slurm, and after doing the above the slurm job automatically gets killed. So seems clearing the cache after each epoch doesn't seem to work...

from tacotron2.

CUDA Runtime Error: Out of Memory about tacotron2 HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent