Giter VIP home page Giter VIP logo

Comments (20)

ibab avatar ibab commented on May 19, 2024 4

I've fixed the problem in 8add545.
Like atrous_conv2d, I swap the width dimension out into the batch dimension and perform a regular convolution, but without padding in the height dimension.
It's now possible to train large stacks of dilation layers without running out of memory.

Judging from occasional garbage collection log messages, I think the issue mentioned by @jyegerlehner is also valid. It would probably make sense to cut inputs to a fixed size.

from tensorflow-wavenet.

adroit91 avatar adroit91 commented on May 19, 2024 1

@ibab Ran into the same issue with Titan X. Now, trying out your latest commit. Would you have any numbers to share about GPU used (I think you've mentioned K40c somewhere), time taken for convergence, and maybe a comment about quality of results you've seen?

from tensorflow-wavenet.

ibab avatar ibab commented on May 19, 2024 1

I'm still in the process of finding good hyperparameters, and finding the cause of the generation issue in #13.
After that, I'll generate audio samples and provide statistics on how long it takes to train the model.

from tensorflow-wavenet.

genekogan avatar genekogan commented on May 19, 2024

i think i ran into this issue, just pasting the stack trace below if it's useful.... it ran fine for a few hours though. wonder why it would break midway. is it at risk of running out of memory for larger audio files?

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ******************************________*************************************************xxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 1.86GiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[256,256,1,7603]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[256,1,7603,256]
[[Node: dilated_stack/layer4/conv_g = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dilated_stack/layer4/conv_g/SpaceToBatch, dilated_stack/layer4/Variable_1/read)]]
[[Node: loss/Mean/_61 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_839_loss/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Storing checkpoint to ./model.ckpt
Loss: 79.2337722778
Loss: 210.806060791
Loss: 97.7300491333
Loss: 1392.23522949
Loss: 235.05305481
Loss: 31.506072998
Loss: 26.7505722046
Loss: 26.29870224
Loss: 9.93331241608
Loss: 6.45463895798
Loss: 6.28830051422
Loss: 6.08931684494
Loss: 5.92027044296
Loss: 5.75096750259
Loss: 5.67673921585
Loss: 5.60965442657
Loss: 5.58204841614
Loss: 5.57294654846
Loss: 5.57038593292
Loss: 5.55689048767
Loss: 5.56215429306
Loss: 5.55251312256
Loss: 5.56506061554
Loss: 5.55785942078
Loss: 5.54626560211
Traceback (most recent call last):
File "train.py", line 144, in
main()
File "train.py", line 129, in main
summary, loss_value, _ = sess.run([summaries, loss, optim])
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[256,1,7603,256]
[[Node: dilated_stack/layer4/conv_g = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dilated_stack/layer4/conv_g/SpaceToBatch, dilated_stack/layer4/Variable_1/read)]]
[[Node: loss/Mean/_61 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_839_loss/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'dilated_stack/layer4/conv_g', defined at:
File "train.py", line 144, in
main()
File "train.py", line 91, in main
loss = net.loss(audio_batch)
File "/home/gene/projects/tensorflow-wavenet/wavenet.py", line 102, in loss
raw_output = self._create_network(encoded)
File "/home/gene/projects/tensorflow-wavenet/wavenet.py", line 71, in _create_network
dilation=dilation)
File "/home/gene/projects/tensorflow-wavenet/wavenet.py", line 31, in _create_dilation_layer
name="conv_g")
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 221, in atrous_conv2d
name=name)
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/mnt/drive1/virtualenvs/caffe/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1232, in init
self._traceback = _extract_stack()

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on May 19, 2024

@ibab
I don't think there's any reason why the number of channels in the residual blocks needs to be equal to the number of quantization levels (256). I think 256 resblock channels could be overkill and uses too much memory. If n=20 can encode an entire 28x28 MNIST image, I would think n=16 or n=8 could encode a single amplitude "pixel". If you look at Figure 4 in the paper, they show a single causal convolution outside of the residual blocks. I'm guessing its role is to create the number of channels in the shape that is then carried through the resblocks. Then the 1x1 convs project back up from the resblock channels (e.g. n=8) to the quantization levels (256) just before the softmax. This would be a dramatic memory savings. I think I could make this change, except I don't know how to make it causal. I don't understand how your tf.pad call makes the dilated convs causal.

from tensorflow-wavenet.

ibab avatar ibab commented on May 19, 2024

Yeah, this makes sense.
I think I was thrown off by the fact that we would have to keep expanding the number of channels back to a fixed size before feeding them to the next layer, so there is no step-by-step reduction as I would expect for a convnet.

I'd happily accept a PR for this.
I think it should be enough to simply change the number of channels, the convolution should stay causal.
The idea behind the tf.pad call is to shift the output to the right by the dilation rate.
Looking at figure 3 in the paper, padding with 1, 2, 4, ... in each layer should add up to the total amount of padding required to right-align the output of the filters with their inputs.
(I've also used VALID padding, which leaves out the last few values so we don't need to remove them before applying the padding).

from tensorflow-wavenet.

ibab avatar ibab commented on May 19, 2024

@jyegerlehner: Are you currently working on a fix?
I've implemented the changes locally, and the network is converging to a significantly lower loss πŸ‘
I can push the changes if you're not currently working on something, otherwise I'll wait for your PR.

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on May 19, 2024

@ibab, I just got back to work on this; was planning to, but I see you've already done it! πŸ‘

from tensorflow-wavenet.

sjain07 avatar sjain07 commented on May 19, 2024

I changed the quantization levels to 16 running on a g2.2xlarge aws instance. Getting an OOM exception.
wavenet_parms.json looks like:
{
"filter_width": 2,
"quantization_steps": 16,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32],
"residual_channels": 64,
"dilation_channels": 32
}

Stacktrace:
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 426.62MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:940] Resource exhausted: OOM when allocating tensor with shape[1024,32,1,3413]
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096): Total Chunks: 1, Chunks in use: 0 5.5KiB allocated for chunks. 512B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192): Total Chunks: 1, Chunks in use: 0 8.0KiB allocated for chunks. 8.0KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144): Total Chunks: 1, Chunks in use: 0 410.8KiB allocated for chunks. 212.56MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304): Total Chunks: 1, Chunks in use: 0 6.67MiB allocated for chunks. 6.67MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608): Total Chunks: 1, Chunks in use: 0 13.33MiB allocated for chunks. 13.33MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216): Total Chunks: 2, Chunks in use: 0 53.34MiB allocated for chunks. 53.34MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432): Total Chunks: 1, Chunks in use: 0 40.01MiB allocated for chunks. 26.67MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 426.62MiB was 256.00MiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a0000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a0100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a0200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a0300 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a0400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a0500 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a2500 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a6500 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a8500 of size 2048
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a8d00 of size 512
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a8f00 of size 2048
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a9700 of size 2048
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024a9f00 of size 512
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024aa100 of size 512
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024aa300 of size 24576
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024b0300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024b4300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024b8300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024bc300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024c0300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024c4300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024c8300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024cc300 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024d0300 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024d2300 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024d4300 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024d6300 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024d8300 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024da300 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dc300 of size 2048
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dcb00 of size 512
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dcd00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dce00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dcf00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dd000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dd100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dd200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dd300 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dd400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dd500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024dd600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024ded00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024e0d00 of size 24576
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024e8d00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024e8e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024e8f00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024eaf00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024ecf00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024f0f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024f4f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024f8f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7024fcf00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702500f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702504f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702508f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70250cf00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702510f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702514f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702518f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70251cf00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702520f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702524f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702528f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70252cf00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702530f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702534f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702538f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70253cf00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702540f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702544f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702548f00 of size 16384
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70254cf00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70254ef00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702550f00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702552f00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702554f00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702556f00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702558f00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70255af00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70255cf00 of size 8192
........
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 13983744 totalling 26.67MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 13985280 totalling 13.34MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 27966208 totalling 26.67MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 27966464 totalling 53.34MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 55932928 totalling 53.34MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 111865856 totalling 106.68MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 223739904 totalling 213.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 447348736 totalling 426.62MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 447479808 totalling 426.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 894959616 totalling 853.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1333799424 totalling 1.24GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 3.55GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit: 3928915968
InUse: 3809625088
MaxInUse: 3872537088
NumAllocs: 5933
MaxAllocSize: 1855949824

W tensorflow/core/common_runtime/bfc_allocator.cc:270] *****************************************************************************************xxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 426.62MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:940] Resource exhausted: OOM when allocating tensor with shape[1024,32,1,3413]
Traceback (most recent call last):
File "train.py", line 173, in
main()
File "train.py", line 158, in main
summary, loss_value, _ = sess.run([summaries, loss, optim])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 710, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 908, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 958, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 978, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[1024,32,1,3413]
[[Node: wavenet/dilated_stack/layer5/conv_filter = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](wavenet/dilated_stack/layer5/conv_filter/SpaceToBatch, wavenet/dilated_stack/layer5/Variable/read)]]
[[Node: wavenet/loss/Mean/_67 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_308_wavenet/loss/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'wavenet/dilated_stack/layer5/conv_filter', defined at:
File "train.py", line 173, in
main()
File "train.py", line 118, in main
loss = net.loss(audio_batch)
File "/home/ubuntu/tensorflow-wavenet/wavenet.py", line 165, in loss
raw_output = self._create_network(encoded)
File "/home/ubuntu/tensorflow-wavenet/wavenet.py", line 112, in _create_network
self.dilation_channels)
File "/home/ubuntu/tensorflow-wavenet/wavenet.py", line 51, in _create_dilation_layer
name="conv_filter")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 221, in atrous_conv2d
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2317, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1239, in init
self._traceback = _extract_stack()

from tensorflow-wavenet.

lemonzi avatar lemonzi commented on May 19, 2024

Have you modified the batch size?

from tensorflow-wavenet.

sjain07 avatar sjain07 commented on May 19, 2024

@lemonzi no the batch size is the default as mentioned in train.py: ie 1
running python train.py no extra params

from tensorflow-wavenet.

lemonzi avatar lemonzi commented on May 19, 2024

@ansh7 Never mind then -- I saw a Tensor shape in the logs and had a hunch.

from tensorflow-wavenet.

genekogan avatar genekogan commented on May 19, 2024

lowering the sample rate also helps avoid memory issues.

from tensorflow-wavenet.

polyrhythmatic avatar polyrhythmatic commented on May 19, 2024

@ibab what did you originally train this on? I'm using a GTX Titan X and running out of memory. Is anyone having luck with lower quantization levels?

from tensorflow-wavenet.

ibab avatar ibab commented on May 19, 2024

There seems to be a weird issue with garbage collection when using the dilated convolutions.
If you set all the dilations to 1 in wavenet_params.json, then it will use regular convolution, which seems to work extremely well (evaluates faster and doesn't run into memory issues as quickly).
I've been able to train models with upwards of 30 convolutional layers on a K40c when setting the dilations to 1.

Theoretically, the dilated convolution should be just as fast, but the TensorFlow version is implemented by combining existing ops and I suspect it's not as efficient as the simple convolution.

from tensorflow-wavenet.

lemonzi avatar lemonzi commented on May 19, 2024

It looks like it's making a lot of assumptions about the data being 2D. We don't need this padding, maybe we should use space_to_batch and conv2d manually to adapt it to the 1D case.

https://github.com/tensorflow/tensorflow/blob/c856366b739850a9f4b0bf1469de7f052619042b/tensorflow/python/ops/nn_ops.py#L208

What this line does is basically pad the height (which is 1 for audio) so that it is equal to the dilation, and this will be cropped back after the convolution to match the output padding.

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on May 19, 2024

@polyrhythmatic
I'm also using a Titan-X (12GB) and was still getting OOM after recent fixes. So I dropped the number of channels a bit in wavenet_params.json and it ran to completion:

"residual_channels": 32,
"dilation_channels": 16

Edit: when I say "to completion", I mean for the default number of steps == 2000, which I had never been able to do before.

Guesses/Speculation:
Tensorflow appears to grab all the memory and has its own GPU heap allocation algorithm. I suspect it tries to re-use previously allocated chunks until memory is too fragmented. I don't have time to look at the wavenet code at the moment but I get the impression it's creating input tensors that match the length of the wav file (?not sure). If so, each one is different length, and whenever it finds one that's bigger than the largest previously-allocated chunk, it has to allocate new memory (since all the activation and gradient tensors will have a new size), and so the heap becomes progressively more fragmented. Until there's not enough left. So if most of the clips are less than some duration, say 3 seconds (I just made that number up, but whatever a good number would be), and occasionally there are longer ones, I wonder if we couldn't cap the length to that max length. Then the heap allocator would always have already-allocated chunks available that are big enough it can re-use.

from tensorflow-wavenet.

ibab avatar ibab commented on May 19, 2024

Just came to the same realization as @lemonzi as to why atrous_conv2d is using so much more memory than conv2d.

It pads the height dimension so that dilation divides it exactly.
When dilation is something like 512, this will pad the tensor with an astronomical number of zeros, because our width is so large (> 10^4).
So we can't use atrous_conv2d.

Actually, we can't use space_to_batch either, because it requires that the dilation rate divides both height and width. The case where either is exactly 1, which can be handled cleanly, has unfortunately not been treated as a special case.
This might be worth a feature request.

What we can do instead is cut away the end of the tensor so that dilation dividies it exactly, reshape, perform the 2d convolution and reshape again.
If we pad the beginning of the output with zeros, it will be exactly the causal convolution we're looking for.

from tensorflow-wavenet.

nonstop99 avatar nonstop99 commented on May 19, 2024

Would it be possible to post a link to a pre-trained model we can use? and a link to some example wav output(s)?

from tensorflow-wavenet.

ibab avatar ibab commented on May 19, 2024

Fixing the convolution op seems to have fixed the issue of easily running into OOM errors, so I'm closing this issue and opening another one on the fact that we might want to crop the samples to a fixed length.

from tensorflow-wavenet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.