Giter VIP home page Giter VIP logo

Comments (19)

rafaelvalle avatar rafaelvalle commented on July 16, 2024 2

Try decreasing batch size. Approximately 3 samples per gb of GPU memory.

from tacotron2.

adjouama avatar adjouama commented on July 16, 2024 1

I fixed this by reducing the batch size to 32 in hparam:
batch_size=32

I use Nvidia GTX 1080 Ti with 11GiB memory

from tacotron2.

MrBreadWater avatar MrBreadWater commented on July 16, 2024

from tacotron2.

rafaelvalle avatar rafaelvalle commented on July 16, 2024

I don't know from memory but I think the GTX 1050 TI has 4gb of memory, thus you should use a batch of 12 samples. Please try with batch size 12 and let us know.

from tacotron2.

tatamyans avatar tatamyans commented on July 16, 2024

same problem possibly, also GTX 1050 Ti 4gb, no luck with batch size 12
https://pastebin.com/di1j2jKQ

from tacotron2.

rafaelvalle avatar rafaelvalle commented on July 16, 2024

Try batch size 8.

from tacotron2.

tatamyans avatar tatamyans commented on July 16, 2024

out of memory error is gone, but python crashes after saving model :
Unhandled exception at 0x0000000076FAA0F2 (ntdll.dll) in python.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000123FF8).

python 3.5, 3.6 same thing
maybe unrelated

from tacotron2.

rafaelvalle avatar rafaelvalle commented on July 16, 2024

@tatamyans can you provide a full error trace?

from tacotron2.

tatamyans avatar tatamyans commented on July 16, 2024

sorry, can't provide full trace now, maybe later, thanks

from tacotron2.

gsoul avatar gsoul commented on July 16, 2024
/home/soul/projects/nv-tacotron/tacotron2/layers.py:35: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
/home/soul/projects/nv-tacotron/tacotron2/layers.py:15: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  gain=torch.nn.init.calculate_gain(w_init_gain))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-82530a2f6baf> in <module>()
      1 checkpoint_path = "/home/soul/projects/nv-tacotron/tacotron2/outdir/checkpoint_80000"
----> 2 model = load_model(hparams)
      3 try:
      4     model = model.module
      5 except:

/home/soul/projects/nv-tacotron/tacotron2/train.py in load_model(hparams)
     78 
     79 def load_model(hparams):
---> 80     model = Tacotron2(hparams).cuda()
     81     if hparams.fp16_run:
     82         model = batchnorm_to_float(model.half())

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
    247             Module: self
    248         """
--> 249         return self._apply(lambda t: t.cuda(device))
    250 
    251     def cpu(self):

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    174     def _apply(self, fn):
    175         for module in self.children():
--> 176             module._apply(fn)
    177 
    178         for param in self._parameters.values():

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    180                 # Tensors stored in modules are graph leaves, and we don't
    181                 # want to create copy nodes, so we have to unpack the data.
--> 182                 param.data = fn(param.data)
    183                 if param._grad is not None:
    184                     param._grad.data = fn(param._grad.data)

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in <lambda>(t)
    247             Module: self
    248         """
--> 249         return self._apply(lambda t: t.cuda(device))
    250 
    251     def cpu(self):

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25

I get this when I try to do inference on 1080ti.

Training works fine on separate gpu with batch size of 40 all other settings are default. Dataset is LJSpeech-1.1

@rafaelvalle could you please advise here?

from tacotron2.

gsoul avatar gsoul commented on July 16, 2024

Hm, perhaps I figured this out: PyTorch demanded GPU0 for inference. And as training was happening on in at the time - it gave OOM error. could produce some speech after I stopped training.

from tacotron2.

rafaelvalle avatar rafaelvalle commented on July 16, 2024

@gsoul yes, one can run into OOM if multiple sources are requesting memory from the same GPU. If you want to train and do inference at the same time, you could do inference on the CPU...

from tacotron2.

gsoul avatar gsoul commented on July 16, 2024

No, I have 2 * 1080ti in my machine. And used CUDA_VISIBLE_DEVICES to separate inference and training into separate gpus, but was getting error above, until stopped training on gpu0.

from tacotron2.

rafaelvalle avatar rafaelvalle commented on July 16, 2024

When running inference, can you confirm that the pytorch code only has access to one of the GPUs?

from tacotron2.

gsoul avatar gsoul commented on July 16, 2024

After thinking for some time - perhaps not. I ran:

CUDA_VISIBLE_DEVICE=1 /home/soul/anaconda3/bin/ipython notebook --no-browser --port=8889

But such command limits only ipython process rather than python process that communicates with gpus...

from tacotron2.

imirzadeh avatar imirzadeh commented on July 16, 2024

I had the same problem and I reduced the batch size to make it work

from tacotron2.

rafaelvalle avatar rafaelvalle commented on July 16, 2024

Closing due to inactivity.

from tacotron2.

ErfolgreichCharismatisch avatar ErfolgreichCharismatisch commented on July 16, 2024

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.

from tacotron2.

one1ine avatar one1ine commented on July 16, 2024

Hello everyone, I am having the same GPU memory issue.

  1. Using Nvidia A40 with 46GB memory
  2. Using a batchsize of 8!
  3. Using a custom dataset of 40hrs with a sampling rate of 22050, which is like 6GB of data.
    The training is initially running fine but just around finishing 3 epochs, the memory error pops up and stops the training.

Since im already using a batchsize of 8, i dont think lowering it any lower would beneficial.
That said, I'm thinking of trying to clear the cache (torch.cuda.empty_cache()) at the end of every epoch because i think its accumulating and filling up the cache since the memory is consistently popping up after 3 epoch.

Will let you know if it works out.

UPDATE (23/04/14):
So I'm running the training on server using slurm, and after doing the above the slurm job automatically gets killed. So seems clearing the cache after each epoch doesn't seem to work...

from tacotron2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.