Giter VIP home page Giter VIP logo

Comments (30)

dathudeptrai avatar dathudeptrai commented on May 15, 2024 1

@myagues i know that. BTW, in my case, somehow fixed shape is still faster than bucket sequence. BTW when i said i use bucket sequence before that mean i use tf.data.experimental.bucket_by_sequence_length :D

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@rgzn-aiyun

Filling up shuffle buffer (this may take a while)’

This mean the dataloader is calculating and cache dataset. After it finished, it will training model without preprocess anything.

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@rgzn-aiyun

Filling up shuffle buffer (this may take a while)’

This mean the dataloader is calculating and cache dataset. After it finished, it will training model without preprocess anything.

Is it normal to load data for half an hour?

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@rgzn-aiyun i will enhance it tonight, so it will take around 5 minutes :D

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@rgzn-aiyun i will enhance it tonight, so it will take around 5 minutes :D

Looking forward to the latest, I will continue to test!

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@rgzn-aiyun dathudeptrai@4add642. Pls check if it work :)

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@rgzn-aiyun reopen if it doesn't work.

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@rgzn-aiyun 4add642. Pls check if it work :)

It only takes 5 seconds to load data, which is great.

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

Pls help me check if the output is same as the old code. :)))

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

Pls help me check if the output is same as the old code. :)))


Layer (type) Output Shape Param #

encoder (TFTacotronEncoder) multiple 8218624


decoder_cell (TFTacotronDeco multiple 18246402


post_net (TFTacotronPostnet) multiple 5460480


residual_projection (Dense) multiple 41040

Total params: 31,966,546
Trainable params: 31,956,306
Non-trainable params: 10,240


[train]: 0% 0/200000 [00:00<?, ?it/s]2020-06-09 02:13:17.096090: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 2533 of 9500
2020-06-09 02:13:27.096872: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 5071 of 9500
2020-06-09 02:13:37.098078: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 7625 of 9500
2020-06-09 02:13:44.514868: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
2020-06-09 02:13:52.435945: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] function_optimizer failed: Invalid argument: Node 'tacotron2/StatefulPartitionedCall/encoder/bilstm/forward_lstm/StatefulPartitionedCall_Func/tacotron2/StatefulPartitionedCall/output/_325': Connecting to invalid output 29 of source node tacotron2/StatefulPartitionedCall/encoder/bilstm/forward_lstm/StatefulPartitionedCall which has 29 outputs.
2020-06-09 02:13:52.674893: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] shape_optimizer failed: Out of range: src_output = 29, but num_outputs is only 29
2020-06-09 02:13:52.873193: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] layout failed: Out of range: src_output = 29, but num_outputs is only 29
[train]: 0% 100/200000 [15:01<471:01:16, 8.48s/it]2020-06-09 02:28:08,977 (train_tacotron2:253) INFO: (Step: 100) train_stop_token_loss = 0.3202.
2020-06-09 02:28:08,978 (train_tacotron2:253) INFO: (Step: 100) train_mel_loss_before = 0.4082.
2020-06-09 02:28:08,979 (train_tacotron2:253) INFO: (Step: 100) train_mel_loss_after = 0.8663.
2020-06-09 02:28:08,979 (train_tacotron2:253) INFO: (Step: 100) train_guided_attention_loss = 0.0041.
[train]: 0% 196/200000 [28:37<471:17:11, 8.49s/it]

The test should be no problem, but the training speed is too slow, so training for half an hour?

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@rgzn-aiyun tacotron-2 training too slow because it's sequence to sequence. My 2080Ti run 4s/1it and it's normal, what is ur machine ?

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@rgzn-aiyun tacotron-2 training too slow because it's sequence to sequence. My 2080Ti run 4s/1it and it's normal, what is ur machine ?

Tesla P100,It shouldn’t be so slow, less than 400 steps in an hour.

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@rgzn-aiyun what is ur max char len and max mel len ? . https://github.com/dathudeptrai/TensorflowTTS/issues/19#issuecomment-636548991 reference here

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@rgzn-aiyun what is ur max char len and max mel len ?

max_char_length: 290
max_mel_length: 1300

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@rgzn-aiyun it's too long. You may need eliminate some sample to get smaller max_char and max_mel. On Ljspeech, max char is 170 and max_len is 800.

from tensorflowtts.

myagues avatar myagues commented on May 15, 2024

You can try to set os.environ["TF_GPU_THREAD_MODE"] = "gpu_private" in the global scope of train_tacotron2.py. Sometimes it helps to speed up the process a bit, and I don't know of any downsides.

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

You can try to set os.environ["TF_GPU_THREAD_MODE"] = "gpu_private" in the global scope of train_tacotron2.py. Sometimes it helps to speed up the process a bit, and I don't know of any downsides.

TF_GPU_THREAD_MODE: Whether and how the GPU device uses its own threadpool. Possible values:

global: GPU uses threads shared with CPU in the main compute thread-pool. This is currently the default.
gpu_private: GPU uses threads dedicated to this device.
gpu_shared: All GPUs share a dedicated thread pool.

I will test. For me, the average speed is 400 steps per hour. There are only 8800 steps in 24 hours. If 200k trainings are required, it will take too long!

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

100k is enough, if u use my pretrained and finetune i think 50k is ok. You should eliminate some samples to get smaller max char len and max mel length. Ur length is too long. You can move long sentences to valid folder and training with short sentences.

from tensorflowtts.

myagues avatar myagues commented on May 15, 2024

I did some further changes to the Tacotron 2 data reading in my fork. It uses tf.data.experimental.bucket_by_sequence_length, which groups mel spectrograms of similar length into the same batches. This means that when using variable shapes in training, less padding will be needed as most mels will have similar length (in my case buckets of 50), and makes training faster.

I don't know if this causes training problems (if you have few mels of a given length, those will be grouped together every epoch) so I did a short run with variable (bucket variable sizes each batch) and fixed shapes (maximum constant length each batch) to see if the differences were meaningful. It seems that variable shape gets higher training loss, but very similar in validation, and is faster than fixed shape.

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@myagues i use bucket sequence before too. But as the tacotron notes you can see i said that fixed shape training 2x faster than dynamic shape and i cann't understand why. So i remove bucket sequence :(. BTW, what is ur tensorflow version. That's great if u create pull request :D

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@myagues i use bucket sequence before too. But as the tacotron notes you can see i said that fixed shape training 2x faster than dynamic shape and i cann't understand why. So i remove bucket sequence :(. BTW, what is ur tensorflow version. That's great if u create pull request :D

I will test the dynamic shape, because the fixed shape is always wrong when saving the model soon after training.

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

I did some further changes to the Tacotron 2 data reading in my fork. It uses tf.data.experimental.bucket_by_sequence_length, which groups mel spectrograms of similar length into the same batches. This means that when using variable shapes in training, less padding will be needed as most mels will have similar length (in my case buckets of 50), and makes training faster.

I don't know if this causes training problems (if you have few mels of a given length, those will be grouped together every epoch) so I did a short run with variable (bucket variable sizes each batch) and fixed shapes (maximum constant length each batch) to see if the differences were meaningful. It seems that variable shape gets higher training loss, but very similar in validation, and is faster than fixed shape.

File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.DataLossError: Attempted to pad to a smaller size than the input element.
[train]: 0% 2/200000 [00:50<1389:46:04, 25.02s/it]

Can't seem to run?

from tensorflowtts.

myagues avatar myagues commented on May 15, 2024

@rgzn-aiyun are you running with reduction_factor=1 and n_mels=80 in config? It is the only thing that comes to my mind with this error, since n_mels is hard coded and I have not tested a reduction factor other than 1.

@dathudeptrai before sending the pull I want to test that it works correctly with different configurations. In your version there is bucketing too, in tacotron_dataset.py#L121-L137, but if you have is_shuffle=True then it won't have any effect.

from tensorflowtts.

myagues avatar myagues commented on May 15, 2024

Oh, ok I thought you meant the groupby.
Well I don't know, I suppose fixed shape runs some optimizations that cannot be used in variable shape. For me fixed shape has a constant step of 3.76s/it (batch_size=16), and variable shape between 2.7 and 4s/it at maximum, but generally closer to 3s/it than 4s/it.

from tensorflowtts.

dathudeptrai avatar dathudeptrai commented on May 15, 2024

@myagues what is ur version of tf and tf_addons?. I use batch_size 32, will check batch_size 16. There are some remain issues with this implementation, i don't know why i cann't apply mixed precision for tacotron :)). I didn't see anything wrong in my implementation :(

from tensorflowtts.

myagues avatar myagues commented on May 15, 2024

The same version as in setup.py, TF v2.2.0 and TF-addons v0.9.1

I run with fp32, have not tried fp16 yet, although my card does not have tensor cores, so I don't think it will have much benefit in my case. I will try it later and see if I can find any errors.

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@dathudeptrai

Calculate the maximum value of char_lengths

nums = char_lengths
nums.sort()
max=nums[len(nums)-1]
min=nums[0]
print("Maximum:",max)

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

@rgzn-aiyun are you running with reduction_factor=1 and n_mels=80 in config? It is the only thing that comes to my mind with this error, since n_mels is hard coded and I have not tested a reduction factor other than 1.

@dathudeptrai before sending the pull I want to test that it works correctly with different configurations. In your version there is bucketing too, in tacotron_dataset.py#L121-L137, but if you have is_shuffle=True then it won't have any effect.

Yes, the default configuration file is used.

from tensorflowtts.

rgzn-aiyun avatar rgzn-aiyun commented on May 15, 2024

[train]: 0% 0/200000 [00:00<?, ?it/s]2020-06-11 09:50:22.090650: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 3244 of 9500
2020-06-11 09:50:32.089778: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 6514 of 9500
2020-06-11 09:50:41.202193: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
2020-06-11 09:50:46.995056: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] function_optimizer failed: Invalid argument: Node 'tacotron2/StatefulPartitionedCall/encoder/bilstm/forward_lstm/StatefulPartitionedCall_Func/tacotron2/StatefulPartitionedCall/output/_325': Connecting to invalid output 29 of source node tacotron2/StatefulPartitionedCall/encoder/bilstm/forward_lstm/StatefulPartitionedCall which has 29 outputs.
2020-06-11 09:50:47.193020: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] shape_optimizer failed: Out of range: src_output = 29, but num_outputs is only 29
2020-06-11 09:50:47.336215: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] layout failed: Out of range: src_output = 29, but num_outputs is only 29
[train]: 0% 50/200000 [06:13<374:38:00, 6.75s/it]2020-06-11 09:57:20,083 (base_trainer:144) INFO: Successfully saved checkpoint @ 50 steps.
[train]: 0% 86/200000 [11:10<375:00:07, 6.75s/it]Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 1986, in execution_mode
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 655, in _next_internal
output_shapes=self._flat_output_shapes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2363, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.DataLossError: Attempted to pad to a smaller size than the input element. [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_tacotron2.py", line 507, in
main()
File "train_tacotron2.py", line 500, in main
resume=args.resume)
File "train_tacotron2.py", line 343, in fit
self.run()
File "/ai/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 72, in run
self._train_epoch()
File "/ai/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 92, in _train_epoch
for train_steps_per_epoch, batch in enumerate(self.train_data_loader, 1):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 631, in next
return self.next()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 670, in next
return self._next_internal()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 661, in _next_internal
return structure.from_compatible_tensor_list(self._element_spec, ret)
File "/usr/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 1989, in execution_mode
executor_new.wait()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.DataLossError: Attempted to pad to a smaller size than the input element.
[train]: 0% 86/200000 [11:14<435:36:28, 7.84s/it]

This error is coming again? It doesn't seem to be a problem with raw-feat or norm-feats.

from tensorflowtts.

myagues avatar myagues commented on May 15, 2024

So it begins the training process, but at some point it encounters an item that cannot be padded because is larger than the pad applied.

I don't really know what could cause this, since all unknown dimensions should be padded to the largest in the batch.

from tensorflowtts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.