Comments (19)
Try decreasing batch size. Approximately 3 samples per gb of GPU memory.
from tacotron2.
I fixed this by reducing the batch size to 32 in hparam:
batch_size=32
I use Nvidia GTX 1080 Ti with 11GiB memory
from tacotron2.
from tacotron2.
I don't know from memory but I think the GTX 1050 TI has 4gb of memory, thus you should use a batch of 12 samples. Please try with batch size 12 and let us know.
from tacotron2.
same problem possibly, also GTX 1050 Ti 4gb, no luck with batch size 12
https://pastebin.com/di1j2jKQ
from tacotron2.
Try batch size 8.
from tacotron2.
out of memory error is gone, but python crashes after saving model :
Unhandled exception at 0x0000000076FAA0F2 (ntdll.dll) in python.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000123FF8).
python 3.5, 3.6 same thing
maybe unrelated
from tacotron2.
@tatamyans can you provide a full error trace?
from tacotron2.
sorry, can't provide full trace now, maybe later, thanks
from tacotron2.
/home/soul/projects/nv-tacotron/tacotron2/layers.py:35: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
/home/soul/projects/nv-tacotron/tacotron2/layers.py:15: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
gain=torch.nn.init.calculate_gain(w_init_gain))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-82530a2f6baf> in <module>()
1 checkpoint_path = "/home/soul/projects/nv-tacotron/tacotron2/outdir/checkpoint_80000"
----> 2 model = load_model(hparams)
3 try:
4 model = model.module
5 except:
/home/soul/projects/nv-tacotron/tacotron2/train.py in load_model(hparams)
78
79 def load_model(hparams):
---> 80 model = Tacotron2(hparams).cuda()
81 if hparams.fp16_run:
82 model = batchnorm_to_float(model.half())
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
247 Module: self
248 """
--> 249 return self._apply(lambda t: t.cuda(device))
250
251 def cpu(self):
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
174 def _apply(self, fn):
175 for module in self.children():
--> 176 module._apply(fn)
177
178 for param in self._parameters.values():
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
180 # Tensors stored in modules are graph leaves, and we don't
181 # want to create copy nodes, so we have to unpack the data.
--> 182 param.data = fn(param.data)
183 if param._grad is not None:
184 param._grad.data = fn(param._grad.data)
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in <lambda>(t)
247 Module: self
248 """
--> 249 return self._apply(lambda t: t.cuda(device))
250
251 def cpu(self):
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25
I get this when I try to do inference on 1080ti.
Training works fine on separate gpu with batch size of 40 all other settings are default. Dataset is LJSpeech-1.1
@rafaelvalle could you please advise here?
from tacotron2.
Hm, perhaps I figured this out: PyTorch demanded GPU0 for inference. And as training was happening on in at the time - it gave OOM error. could produce some speech after I stopped training.
from tacotron2.
@gsoul yes, one can run into OOM if multiple sources are requesting memory from the same GPU. If you want to train and do inference at the same time, you could do inference on the CPU...
from tacotron2.
No, I have 2 * 1080ti in my machine. And used CUDA_VISIBLE_DEVICES to separate inference and training into separate gpus, but was getting error above, until stopped training on gpu0.
from tacotron2.
When running inference, can you confirm that the pytorch code only has access to one of the GPUs?
from tacotron2.
After thinking for some time - perhaps not. I ran:
CUDA_VISIBLE_DEVICE=1 /home/soul/anaconda3/bin/ipython notebook --no-browser --port=8889
But such command limits only ipython process rather than python process that communicates with gpus...
from tacotron2.
I had the same problem and I reduced the batch size to make it work
from tacotron2.
Closing due to inactivity.
from tacotron2.
Tutorial: Training on GPU with Colab, Inference with CPU on Server here.
from tacotron2.
Hello everyone, I am having the same GPU memory issue.
- Using Nvidia A40 with 46GB memory
- Using a batchsize of 8!
- Using a custom dataset of 40hrs with a sampling rate of 22050, which is like 6GB of data.
The training is initially running fine but just around finishing 3 epochs, the memory error pops up and stops the training.
Since im already using a batchsize of 8, i dont think lowering it any lower would beneficial.
That said, I'm thinking of trying to clear the cache (torch.cuda.empty_cache()) at the end of every epoch because i think its accumulating and filling up the cache since the memory is consistently popping up after 3 epoch.
Will let you know if it works out.
UPDATE (23/04/14):
So I'm running the training on server using slurm, and after doing the above the slurm job automatically gets killed. So seems clearing the cache after each epoch doesn't seem to work...
from tacotron2.
Related Issues (20)
- data size
- How to use multiple persons voice datasets for training and inference
- ImportError: numpy.core.multiarray failed to import
- RuntimeError: Distributed package doesn't have NCCL built in
- Multi-GPU (distributed) and Automatic Mixed Precision Training
- Why did I report this error and what should I do HOT 2
- local variable 'i' referenced before assignment
- 训练缓慢 GPU利用率很低 cuda利用率100% 每次it需要花费35s以上 HOT 4
- The Tacotron2 model seems to be severely fragmented.
- Adding emotions
- Changing to Arabic
- How to convert Chinese speech into English text? Is there a direct conversion method?
- How to convert Chinese speech into English text? Is there a direct conversion method?(中文语音能否直接转为英文文本,有没有直接转的方式呢?)
- tensorflow 1.15 issue
- For Ready-to-use req.txt for training. (Create new txt file and paste this text, Required python3.6 infact Python 3.6.13) HOT 1
- pad_center issue in sttf.py file HOT 2
- Now TextMelLoader, TextMelCollate are running on CPU side, why not offload those tasks to GPU for processing?
- Why i am getting this error
- hparams error on colab HOT 3
- Input & output streaming
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tacotron2.