nvidia / mellotron Goto Github PK

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data

License: BSD 3-Clause "New" or "Revised" License

Python 13.92% Jupyter Notebook 86.08%

mellotron's Introduction

Rafael Valle, Jason Li, Ryan Prenger and Bryan Catanzaro

In our recent paper we propose Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.

By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice.

Visit our website for audio samples.

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Clone this repo: git clone https://github.com/NVIDIA/mellotron.git
CD into this repo: cd mellotron
Initialize submodule: git submodule init; git submodule update
Install PyTorch
Install Apex
Install python requirements or build docker image
- Install python requirements: pip install -r requirements.txt

Training

Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the speaker embedding layer is ignored

Download our published Mellotron model trained on LibriTTS or LJS
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

jupyter notebook --ip=127.0.0.1 --port=31337
Load inference.ipynb
(optional) Download our published WaveGlow model

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.

Acknowledgements

This implementation uses code from the following repos: Keith Ito, Prem Seetharaman, Chengqi Deng, Patrice Guyot, as described in our code.

mellotron's People

Contributors

Stargazers

Watchers

Forkers

paduck10 hyunjoolee hash2430 entn-at macroustc ak9250 ruohoruotsi jjandnn peter05010402 agangzz wendonggan aiyi2099 taroorat chenchy templeblock burguerjohn xzm2004260 yhgon ggsonic kevinzhangcode liannice idgmatrix ghlee3401 beckgom hiyoung-asr hephaex charlottecuc jeonghwarr trendingtechnology rosssong blueprintrandom richardburleigh daxiangpanda ryhorv railsloes jeffpanuk texpomru13 ericasaccount patrickisroman vishwas1234567 nukea xilaili virtualmoon wenbozhangjs chynphh gaiyutao yfliao zge narrationbox peng2017 giggslam halleyyoung b1sounours burnhamg titiaffandi fnyhy emmacirl onnoo pneumoman mathigatti machineko mathematiguy tricky61 ntzzc kannadaraj shyamalschandra cookieppp astricks azahed98 devkinoti dmzubr xiaozhuo12138 karkirowle heynuts nikolo555 alberstadt sjcobb hubeibei007 jinsongpan mychiux413 dominikethomas sungjae-cho kustomzone sala7efelninja raytrac3r alanderex yangxipeng xw1324832579 sejik guoyang94 jeevesh8 stefantaubert sdlibowen xuexidi mulxcode rajeshdiwakar taneliang smilemcm ajinkyakulkarni14 miralan

mellotron's Issues

How to fix problematic words with MusicXML parser?

I think the MusicXML parser is the most interesting feature from this repo, however, I would never had imagined that a super common word such as "Hello" or "Hel-lo" could throw an error, especially when "hi/baby/ba-by/shark/do" work just fine, I assumed that was a "must work" word that I could safely use for testing purposes, but instead turned out to be the culprit of a long series of problems I had with some MusicXML file where I couldn't figure out what could possibly be wrong with them as all words existed in the arpabet dictionary, to narrow down the problem I kept dumbing down the lyrics until they all contained only the word "hello" and since it still wasn't working, I assumed there must have been some hidden problem as my past experience taught me that even something as simple as one orphaned note, a typo or an elusive invisible character is all it takes to throw that same error message here.
Could you please explain more in detail what steps are required to fix such bugs (ie. using the word "Hello") so other users can follow your example as well and narrow down all the buggy words and fix them as they keep coming up? I'd really appreciate that, thanks!

Adaptation

I've been trying to run some adaptation experiments with Mellotron, i.e try to use small amount of data (less than an hour) to shift the acoustics of an existing speaker towards a different speaker. I.e. even if there is not a large amount of data from a male/female singer, it should be possible to move the acoustics by retraining with a similar speaker's id.

My experiment's haven't been succesful so far, interestingly, I found that even the other speakers get affected during adaptation, and meaning becomes quickly uninterpretable.

Have you tried something like that? What layers should be ignored for adaptation?

Generic Text-to-Speech Inference

I understood that Mellotron puts audio or musicXML on the result of synthesis based on Tacotron2 and gives StyleTransfer accordingly. By the way, if there is no reference file here, can't I just bring the general TTS composite result? I looked at the code section of model.py, but I'm asking because I don't think it's relevant.

Need more info for training and inference

Hello, thanks for sharing this amazing repo! Could we have more information how to process our own data for training and inference, please?
The inference demo works perfectly, but any attempt to use my own "musicxml" throws an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-d11c35a85dab> in <module>
----> 1 data = get_data_from_musicxml('data/haendel_hallelujah3.musicxml', 110, convert_stress=True)
      2 panning = {'Soprano': [-60, -30], 'Alto': [-40, -10], 'Tenor': [30, 60], 'Bass': [10, 40]}

C:\mellotron\mellotron_utils.py in get_data_from_musicxml(filepath, bpm, phoneme_durations, convert_stress)
    460             continue
    461 
--> 462         events = track2events(v)
    463         events = adjust_words(events)
    464         events_arpabet = [events2eventsarpabet(e) for e in events]

C:\mellotron\mellotron_utils.py in track2events(track)
    285     events = []
    286     for e in track:
--> 287         events.extend(adjust_event(e))
    288     group_ids = [i for i in range(len(events))
    289                  if events[i][0] in [' '] or events[i][0].isupper()]

C:\mellotron\mellotron_utils.py in adjust_event(event, hop_length, sampling_rate)
    230 
    231 def adjust_event(event, hop_length=256, sampling_rate=22050):
--> 232     tokens, freq, start_time, end_time = event
    233 
    234     if tokens == ' ':

ValueError: not enough values to unpack (expected 4, got 2)

I confirm that even trying to change one single letter on the "haendel_hallelujah.musicxml" lyrics (ie. "jah" into "yah") will throw an error, if I change it back to "jah" it works again, so I doubt it's my text editor fault or wrong musicxml format (there're tiny differences how the text is organized depending on which software it was exported from), I get this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-f648a3b7ff04> in <module>
     18         with torch.no_grad():
     19             mel_outputs, mel_outputs_postnet, gate_outputs, alignments_transfer = tacotron.inference_noattention(
---> 20                 (text_encoded, mel, speaker_id, pitch_contour*frequency_scaling, rhythm))
     21 
     22             audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.8), 0.01)[0, 0]

C:\mellotron\model.py in inference_noattention(self, inputs)
    665 
    666         mel_outputs, gate_outputs, alignments = self.decoder.inference_noattention(
--> 667             encoder_outputs, f0s, attention_map)
    668 
    669         mel_outputs_postnet = self.postnet(mel_outputs)

C:\mellotron\model.py in inference_noattention(self, memory, f0s, attention_map)
    523             attention = attention_map[i]
    524             decoder_input = torch.cat((self.prenet(decoder_input), f0), dim=1)
--> 525             mel_output, gate_output, alignment = self.decode(decoder_input, attention)
    526 
    527             mel_outputs += [mel_output.squeeze(1)]

C:\mellotron\model.py in decode(self, decoder_input, attention_weights)
    382         self.attention_context, self.attention_weights = self.attention_layer(
    383             self.attention_hidden, self.memory, self.processed_memory,
--> 384             attention_weights_cat, self.mask, attention_weights)
    385 
    386         self.attention_weights_cum += self.attention_weights

C:\ProgramData\Anaconda3\envs\ptlast37\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

C:\mellotron\model.py in forward(self, attention_hidden_state, memory, processed_memory, attention_weights_cat, mask, attention_weights)
     84 
     85             attention_weights = F.softmax(alignment, dim=1)
---> 86         attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
     87         attention_context = attention_context.squeeze(1)
     88 

RuntimeError: invalid argument 6: wrong matrix size at C:/w/1/s/tmp_conda_3.7_104508/conda/conda-bld/pytorch_1572950778684/work/aten/src\THC/generic/THCTensorMathBlas.cu:534

I tried training with my own audio data, files are in WAV format, 22050hz 16-bit mono, 1 to 4 seconds long, I pointed my data on the "ljs_audiopaths_text_sid_train_filelist.txt" and "ljs_audiopaths_text_sid_val_filelist.txt" file formatted like this on each line: data/speaker/audiofile1.wav|hello world|0
Used this command: python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
But it throws this error:

Traceback (most recent call last):
  File "train.py", line 297, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 187, in train
    train_loader, valset, collate_fn, train_sampler = prepare_dataloaders(hparams)
  File "train.py", line 44, in prepare_dataloaders
    trainset = TextMelLoader(hparams.training_files, hparams)
  File "C:\mellotron\data_utils.py", line 45, in __init__
    self.speaker_ids = self.create_speaker_lookup_table(self.audiopaths_and_text)
  File "C:\mellotron\data_utils.py", line 52, in create_speaker_lookup_table
    d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
  File "C:\mellotron\data_utils.py", line 52, in <dictcomp>
    d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
ValueError: invalid literal for int() with base 10: ''

Any information how to solve this is much appreciated, thanks!

No of training iterations of pretrained model

I would like to ask for how many thousands of iterations is the published pretrained model trained approximately.
Thank you.

The distinction between different speaker with mandarin dataset is not obvious.

When using multi-speaker data, can not distinguish between male and female, and there are only slight differences between different speakers. Current training step is 32K. Is this normal? The language is mandarin.

CUDA out of memory while running inference file

When running the Singing voice from Music Score's second cell.
The error I am getting is "CUDA out of memory. Tried to allocate 378.00 MiB (GPU 0; 6.00 GiB total capacity; 3.46 GiB already allocated; 206.13 MiB free; 4.02 GiB reserved in total by PyTorch)".

How can I reset this and get the Mellatron running?

Style tokens as guide rather than 1:1 transfer

Thanks again for the great work on Mellotron.

The usual implementations of Global Style Tokens allow for transfer of style without locking the target inference to a 1:1 rhythm transfer.

For example, using a 1min reference audio with Mellotron appears to be generating a 1min output regardless of the text input, whereas other GST implementations transfer the style without locking in the rhythm / duration of the reference audio 1:1, such as inference of a 5sec sentence on a 1min reference, while still keeping the 'style' of the reference audio.

Is there any change to the model to enable such a scenario?

Synthesized voice does not correspond to the speaker id

I was using the inference notebook with the pre-trained models. I noticed that the synthesized audio does not always correspond to the speaker id. For many male speakers, the audio still sounds like that from a female speaker. I tried using both the inference function and inference_noattention functions. Is this an issue faced by anyone else?

Inference troubles on Windows

I wanted to try and synthesize a short sample using a model I've been training before training but I think I'm running into some more issues :/

I ran conda install -c conda-forge notebook but then decided on conda install -c conda-forge jupyterlab, since it has both the new lab and notebook. When opening "inference.ipynb",
I started to run the cells one by one

First block gave this error

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-20c49233a480> in <module>
     11 import scipy as sp
     12 from scipy.io.wavfile import write
---> 13 import pandas as pd
     14 import librosa
     15 import torch

ModuleNotFoundError: No module named 'pandas'

just a simple missing dependency, so I ignored it and moved on to see the rest of the code.

  File "<ipython-input-8-a10e6c979de1>", line 2
    angle = np.radians(angle)
    ^
IndentationError: unexpected indent


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-9f5d81c0336e> in <module>
----> 1 hparams = create_hparams()

NameError: name 'create_hparams' is not defined


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-a2ae391905a9> in <module>
----> 1 stft = TacotronSTFT(hparams.filter_length, hparams.hop_length, hparams.win_length,
      2                     hparams.n_mel_channels, hparams.sampling_rate, hparams.mel_fmin,
      3                     hparams.mel_fmax)

NameError: name 'TacotronSTFT' is not defined

And then I stopped it at Load models. Are these not supposed to show? It feels like I"m using the wrong project or something

Yin pitch set to minimum 100 Hz

Your Yin Pitch algorithm pitch minimum is set to 100 Hz, so you can't use any voices lower than that threshold. To lower it, you'll also need to make the window size larger to accommodate the lower frequencies.

RuntimeError: CUDA out of memory

I'm trying to train mellotron on my dataset (20 speakers, each for 5 hours, samples up to 10sec, 22kHz, Russian language)
and I just can’t pick up the batch size
on V100 the maximum batch is 11-12, on K80 6-7, is this ok or am I doing something wrong?

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 15; 11.17 GiB total capacity; 10.81 GiB already allocated; 64.00 KiB free; 53.75 MiB cached)

thanks for the great work!

error during inference

I bumped into an error while importing tensorboardX during inference. The error is:

TypeError Traceback (most recent call last)
in
19 from waveglow.denoiser import Denoiser
20 from layers import TacotronSTFT
---> 21 from train import load_model
22 from data_utils import TextMelLoader, TextMelCollate
23 from text import cmudict, text_to_sequence

/app/train.py in
14 from data_utils import TextMelLoader, TextMelCollate
15 from loss_function import Tacotron2Loss
---> 16 from logger import Tacotron2Logger
17 from hparams import create_hparams
18

/app/logger.py in
1 import random
2 import torch
----> 3 from tensorboardX import SummaryWriter
4 from plotting_utils import plot_alignment_to_numpy, plot_spectrogram_to_numpy
5 from plotting_utils import plot_gate_outputs_to_numpy

/usr/local/lib/python3.6/dist-packages/tensorboardX/init.py in
2 """
3
----> 4 from .writer import FileWriter, SummaryWriter
5 from .record_writer import RecordWriter

/usr/local/lib/python3.6/dist-packages/tensorboardX/writer.py in
22 import json
23 import os
---> 24 from .src import event_pb2
25 from .src import summary_pb2
26 from .src import graph_pb2

/usr/local/lib/python3.6/dist-packages/tensorboardX/src/event_pb2.py in
14
15
---> 16 from tensorboardX.src import summary_pb2 as tensorboardX_dot_src_dot_summary__pb2
17
18

/usr/local/lib/python3.6/dist-packages/tensorboardX/src/summary_pb2.py in
14
15
---> 16 from tensorboardX.src import tensor_pb2 as tensorboardX_dot_src_dot_tensor__pb2
17
18

/usr/local/lib/python3.6/dist-packages/tensorboardX/src/tensor_pb2.py in
14
15
---> 16 from tensorboardX.src import resource_handle_pb2 as tensorboardX_dot_src_dot_resource__handle__pb2
17 from tensorboardX.src import tensor_shape_pb2 as tensorboardX_dot_src_dot_tensor__shape__pb2
18 from tensorboardX.src import types_pb2 as tensorboardX_dot_src_dot_types__pb2

/usr/local/lib/python3.6/dist-packages/tensorboardX/src/resource_handle_pb2.py in
20 package='tensorboard',
21 syntax='proto3',
---> 22 serialized_pb=_b('\n&tensorboardX/src/resource_handle.proto\x12\x0btensorboard"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
23 )
24

/usr/local/lib/python3.6/dist-packages/google/protobuf/descriptor.py in new(cls, name, package, options, serialized_options, serialized_pb, dependencies, public_dependencies, syntax, pool)
882 raise RuntimeError('Please link in cpp generated lib for %s' % (name))
883 elif serialized_pb:
--> 884 return _message.default_pool.AddSerializedFile(serialized_pb)
885 else:
886 return super(FileDescriptor, cls).new(cls)

TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "tensorboardX/src/resource_handle.proto":
tensorboard.ResourceHandleProto.device: "tensorboard.ResourceHandleProto.device" is already defined in file "tensorboard/compat/proto/resource_handle.proto".
tensorboard.ResourceHandleProto.container: "tensorboard.ResourceHandleProto.container" is already defined in file "tensorboard/compat/proto/resource_handle.proto".
tensorboard.ResourceHandleProto.name: "tensorboard.ResourceHandleProto.name" is already defined in file "tensorboard/compat/proto/resource_handle.proto".
tensorboard.ResourceHandleProto.hash_code: "tensorboard.ResourceHandleProto.hash_code" is already defined in file "tensorboard/compat/proto/resource_handle.proto".
tensorboard.ResourceHandleProto.maybe_type_name: "tensorboard.ResourceHandleProto.maybe_type_name" is already defined in file "tensorboard/compat/proto/resource_handle.proto".
tensorboard.ResourceHandleProto: "tensorboard.ResourceHandleProto" is already defined in file "tensorboard/compat/proto/resource_handle.proto".

Any idea why this is happening? compatibility issue?

Thank you!

[QUESTION] About lyrics hyphenation

My question is about how the words in the lyric are "hyphenated" when adjusted to the pitches/notes in the xml file, like the word

Hallelujah is "hyphenated" (not the correct term indeed) in the following as

Hal ° le ° lu ° ° jah (where there is an additional note between lu and jah)

     <note default-x="376">
        <pitch>
          <step>D</step>
          <octave>5</octave>
        </pitch>
        <duration>2</duration>
        <voice>1</voice>
        <type>eighth</type>
        <stem default-y="-45">down</stem>
        <lyric default-y="-81" number="1" relative-x="7">
          <syllabic>begin</syllabic>
          <text>Hal</text>
        </lyric>
      </note>
    </measure>
    <!--=======================================================-->
    <measure number="4" width="419">
      <print new-system="yes">
        <system-layout>
          <system-margins>
            <left-margin>2</left-margin>
            <right-margin>0</right-margin>
          </system-margins>
          <system-distance>113</system-distance>
        </system-layout>
      </print>
      <note default-x="90">
        <pitch>
          <step>C</step>
          <alter>1</alter>
          <octave>5</octave>
        </pitch>
        <duration>2</duration>
        <voice>1</voice>
        <type>eighth</type>
        <stem default-y="-50">down</stem>
        <notations>
          <slur number="1" placement="above" type="start"/>
        </notations>
        <lyric default-y="-81" number="1">
          <syllabic>middle</syllabic>
          <text>le</text>
        </lyric>
      </note>
      <note default-x="136">
        <pitch>
          <step>D</step>
          <octave>5</octave>
        </pitch>
        <duration>4</duration>
        <voice>1</voice>
        <type>quarter</type>
        <stem default-y="-45">down</stem>
        <notations>
          <slur number="1" type="stop"/>
        </notations>
      </note>
      <note default-x="226">
        <pitch>
          <step>C</step>
          <alter>1</alter>
          <octave>5</octave>
        </pitch>
        <duration>2</duration>
        <voice>1</voice>
        <type>eighth</type>
        <stem default-y="-50">down</stem>
        <lyric default-y="-81" number="1" relative-x="7">
          <syllabic>middle</syllabic>
          <text>lu</text>
        </lyric>
      </note>
      <note default-x="272">
        <pitch>
          <step>D</step>
          <octave>5</octave>
        </pitch>
        <duration>4</duration>
        <voice>1</voice>
        <type>quarter</type>
        <stem default-y="-45">down</stem>
        <lyric default-y="-81" number="1" relative-x="9">
          <syllabic>end</syllabic>
          <text>jah</text>
        </lyric>
      </note>

The effect of passing in the original MEL seems minimal

Hi,

In studying the way this model works, using the pre-trained model, I found that it seemed like the MEL was having little effect. In fact in my very small blind poll (my co-worker and our 'sound guy') that the effect was detrimental. Curious about your thoughts.

# from code snippet in box 14 of inference.ipynb

with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, _ = mellotron.inference_noattention(
        (text_encoded, mel, speaker_id, pitch_contour, rhythm))

try other language？

like mandarin？

Cannot resume training without quality loss

Due to unfortunate circumstances my training process has terminated abruptly, any attempt to resume it has resulted in a model that was much worse than before the interruption happened, even after days of training it doesn't seem to get back to the quality it used to be, as if it was unrecoverable, I attempted 2 different methods:

I edit this line in train.py with the lastest checkpoint number (ie. "123456" if checkpoint filename is "checkpoint_123456"), I checked the log with tensorboard but it doesn't seem to contain any information about the latest epoch so I leave it to 0 (I guess that is purely cosmetical and only resets the epoch counter to 0 but results shouldn't be affected, right?), I make a backup and run
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
just like when I began training and looks like it resumes from that latest iteration I pointed it to, but results are way worse than it used to be even after 3 days of training.
I revert train.py as it was originally and begin training from scratch, but warm starting using my latest model (mylatestmodel.pt) instead of the provided LibriTTS pretrained model (mellotron_libritts.pt) from this repo, so I run
python train.py --output_directory=outdir --log_directory=logdir -c models/mylatestmodel.pt --warm_start
but the generated results sound even worse than (1) after 2 days of training.
Is it actually possible to resume an interrupted training back to it's tracks? If so, what is the correct method? It can be quite frustating losing days/hours of training because of an incident beyond our control.
I think a console logger could be a useful addition too, if the terminal window gets closed unexpectedly you'd still have a record of it, in my case, it would have been useful to check how many epochs it reached before the training was suddenly interrupted, even if it was purely cosmetical.

Generic Inference

I'm trying to perform inference without a file for copying style from, just using this repo like a multi-speaker Tacotron2 without GSTs.
What code do I need for that?

In Tacotron2 you only need the Text Input to use tacotron.inference but I'm not sure how to do inference here where I just input,
Text, SpeakerID.

Your notebook has 2 examples of "tacotron.inference_noattention" and how to get the inputs for them, but no examples for just "tacotron.inference", and I'm having trouble telling from the source code.

Singing Voice from Music Score

When I synthesis using music score(musicXML), I have to use "mel" by input parameter.
However, if you look at the code on inference.ipynb provided, the input parameter, Mel, is using the mel of the dataloader, which has nothing to do with Hallelujah of the Haendel.

Can I use any mel?

Question about custom dataset

Hi everyone!

Firstly, thank you for the great implementation.

I haven't understood yet how should I prepare my data for training, so I'd appreciate if someone clarifies that for me. My assumptions are:

If I have data from 10 speakers, I should divide it into 2 files in the "filelist" directory (train and val)
Each of those files should contain a representative sample of all speakers
The txt file format should be: path_to_audio|transcripts|speaker_id

Are my assumptions correct?

inference_noattention for new sequences

can I somehow use inference_noattention for new sequences, and not just for true ones?
or is only the "inference" method suitable for this?

if so, how do you get the right rhythm for the new sequence to copy the style of the selected audio?

confused about source speaker id in style and rhythm transfer

Hi, I'm a little confused about the speaker id in the reference audio and text. When doing the style and rhythm transfer, the given reference speaker ids will be re-ordered as 0,1,2,...
data_utils.py
and inference script

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In that case, for the same audio like:
"audio_10|text_10|10"
in 2 different filelists

A.txt
audio_10|text_10|10
audio_0|text_0|0

B.txt
audio_10|text_10|10
audio_20|text_20|20

The reference speaker id(10) will be set as mellotron_id=1 and 0 respectively. It would be sure to cause the attention_map(A.K.A rhythm in Mellotron) to be different.

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

Is it as expected ? Or I've misunderstand somewhere?

Funky warble on sustained sung notes from a MUSICXML file

Getting a strange (ca. 40hz) vibrato on any note longer than half a second. Any suggestions?

Dimensions mismatch when using pretrained model

Hi, thanks for sharing this great project.

I receive some errors trying to fine tune the pretrained model you provided.

size mismatch for embedding.weight: copying a param with shape torch.Size([148, 512]) from checkpoint, the shape in current model is torch.Size([185, 512]).
        
size mismatch for decoder.attention_rnn.weight_ih: copying a param with shape torch.Size([4096, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1281]).


size mismatch for decoder.attention_layer.memory_layer.linear_layer.weight: copying a param with shape torch.Size([128, 512]) from checkpoint, the shape in current model is torch.Size([128, 1024]).
        
size mismatch for decoder.decoder_rnn.weight_ih: copying a param with shape torch.Size([4096, 1536]) from checkpoint, the shape in current model is torch.Size([4096, 2048]).
        

size mismatch for decoder.linear_projection.linear_layer.weight: copying a param with shape torch.Size([80, 1536]) from checkpoint, the shape in current model is torch.Size([80, 2048]).

size mismatch for decoder.gate_layer.linear_layer.weight: copying a param with shape torch.Size([1, 1536]) from checkpoint, the shape in current model is torch.Size([1, 2048]).

Maybe the pretrained model example was trained with a different code? For example n_symbols produce some of the problems, after replacing it with the one from tacotron that problem is solved. Do you know how could I solve the other problems?

Thanks again :)

Sampling rate used for pretrained model of LibriTTS

Hello,
I would like to ask about the hparmas used for the LibriTTS pretrained model.
The default hparams.py seems to be written for the LJSpeech DB.
So it would be appreciated if you could share the hparams you used for the pretrained LibriTTS model.
Especially, the sampling rate is set as 22050 on hparams while the wavfile.read() returns the sampling rate of 24000, resulting an assert.
Changing this is no problem but I would like to know on what sampling rate, the pretrained model is trained.
Thanks :D

how many epochs to get good results for libritts clean 100?

Hi,

how many epochs to get good results for libritts clean 100? Thanks.

How to fine tune a new voice using pretrained model

Hi, thanks for this amazing project! I wanted to ask a few short questions.

I want to train the model over a new voice, the dataset is similar to the LJ Speech Dataset with short audios of a single person (a man in this case) speaking english with a duration of between 1 and 10 seconds on each sample (Those are about 6 hours in total). I plan to use the pretrained LibriTTS model you provide as a start point.

I'm not sure how should I prepare my dataset. The third column of the TXT that contains the transcriptions specifies the ID of the speaker, should I assign to my speaker a new ID or reuse the ID of the most similar speaker in the LibriTTS dataset?
Do you have some other suggestions related to the hparams?
I'm using 8 V100 GPUs so I'm able to use a batch size of about 24. Do you know how many iterations are usually needed to get descent results?

Using own text to generate speech using Mellotron

Hi,

I'm trying to generate speech using my own text, with style (pitch contour, rhythm, f0) to be transferred from an input wav file. I've been trying to modify the code in model.py but I'm not able to inject in my own text.

Could you please give me some guidance?

Windows Anaconda

Has anyone been able to get this working with Anaconda on Windows? I've run into many issues with attempting to install it and apex with pytorch 1.3.1, cuda 10.1 from the conda repo. I'll link some logs later

Adding another speaker

I am trying to train the pre-trained model LibriTTS with one more speaker. I've added around 15 minutes of audio from this speaker to the train-clean-100 dataset, added the transcription in 85:15 ratio (train:validation) and increased the number of speakers to 124 in hparams.py. Also all the audio files were resampled to 22 050 Hz, 16 bit. But when I run the inference on the checkpoints I get only noise for all the speakers.

regarding text cleaner

I saw there are some python files (e.g., acronyms.py, abbreviations.py, datestime.py) in the text folder. But they don't seem to be used in the text cleaner. Is it true?

Installation issues on Ubuntu 18.04

Installation could be tricky on Ubuntu 18.04, here is a possibile solution to this error:

    ============================================================================
                            * The following required packages can not be built:
                            * freetype, png * Try installing freetype with `apt-
                            * get install libfreetype6-dev` * Try installing png
                            * with `apt-get install libpng12-dev`

First install the following:

sudo apt-get install --reinstall libpng16-16=1.6.34-1
pip install -r requirements.txt

Then one error is still there:

============================================================================
                            * The following required packages can not be built:
                            * freetype * Try installing freetype with `apt-get
                            * install libfreetype6-dev`

Do the following

sudo apt-get install libfreetype6-dev
pip install -r requirements.txt

Now you may get this error:

    In file included from src/ft2font.cpp:9:0:
    src/mplutils.h:31:10: fatal error: Python.h: No such file or directory
     #include <Python.h>
              ^~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    ----------------------------------------

to fix it install python dev:

sudo apt install libpython3.7-dev
pip install -r requirements.txt

You may find now another (!) error:

      File "scipy/linalg/setup.py", line 19, in configuration
        raise NotFoundError('no lapack/blas resources found')
    numpy.distutils.system_info.NotFoundError: no lapack/blas resources found

To get rid of it try

sudo apt install libatlas-base-dev
sudo apt-get install libblas-dev  liblapack-dev

At this point, unfortunately I have scipy compilation error, I cannot get rid of so far, here it is the full error log.

The most optimal audio settings for training dataset preparation?

Previously I trained on a small dataset where the speaker was recorded in a single shot (so the volume level and quality never changed for the entire duration of it), the resulting model sounded promising (it was just for testing) so I decided to train on a larger dataset afterwards where the speaker was recorded in mutiple shots on different places and equipment (so the volume levels and quality varied), the resulting model was a disaster, had an insanely high volume, so much that it could ruin your eardrums if you kept your earphones on, clipping and screeching almost all the time. However, none of the training dataset sounded too high or weird at all, the sound levels were pretty much normal to the ear and never hit the red bar, so I guess all these audio files need to be normalized/processed first such as having all the same optimal dB volume level or things can go horribly wrong, what parameters do you suggest using for the audio data to get the best of it? Of course I converted them all at 22050 Hz mono.

Question regarding paper

Hello,
First, I apologize if this is not a proper channel to ask about your paper "MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS Rafael Valle*, Jason Li*, Ryan Prenger, Bryan Catanzaro NVIDIA Corporation".

According to this paper Table 1, it is giving GPE of 0.08% for both single and multi-speaker case.
I tried to replicate this but it didn't go well. VDE and FFE were replicated but not GPE.
My question is, what did you use for denominator in your equation for GPE?
Accordingto the "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron" paper, GPE uses the number of frames that are voiced in both geneated and reference signal. However, other metrics in this paper, such as VDE and FFE uses the number of all frames.

It would make sense to me if you used the number of all frames to calculate GPE in Table 1 of "Mellotron" paper.
I am sorry if this question is stupid and I'm just being silly.

Thanks!

Training DB for Waveglow pretrained model in this repo

Hello,
May I ask what database is used for training the pretrained waveglow model whose link is attached in the ReadMe of this repo?
I searched Nvidia GPU Cloud to see if I can find description on this pretrained model but pretrained Waveglow checkpoints on that repo was only on single speaker (LJSpeech).
I don't think the pretrained waveglow that is refered to in this repo is trained on single speaker though.
Thanks!

Information about p_teacher_forcing hyperparameter

I want to know what p_teacher_forcing was set as while training mellotron. I am using the default value 1.0 and I am not able to get proper alignment/attention map even after 100k steps. I was wondering if something else was used in training the LibriTTS model.

LJS trained model?

Thank you so much for sharing this! I have it working on Google Colab (thanks to @yhgon ). But I cannot find the pretrained LJS model which you mentioned on your website (https://nv-adlr.github.io/Mellotron). Can you post a download link to your LJS model? Thanks!

Warning Message in yin

For some of my audio I'm getting a warning message:

yin.py:44: RuntimeWarning: invalid value encountered in true_divide
cmndf = df[1:] * range(1,N) / np.cumsum(df[1:]).stype(float) #scipy method

Any ideas? The problem children are encoded exactly like their peers.

Thanks

long dataset need more video memory？

train this model with a longer dataset（10s to 20s per audio）.I noticed that batch size is smaller than when I trained with LJSpeech dataset.

Cannot train with multi GPUs

I clone the repository to my local server, then start to train on my own dataset.

I can run with one GPU, and the logs are as follwing:

FP16 Run: False
Dynamic Loss Scaling: False
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Epoch: 0
/home/yablon/mellotron/yin.py:44: RuntimeWarning: invalid value encountered in true_divide
  cmndf = df[1:] * range(1, N) / np.cumsum(df[1:]).astype(float) #scipy method
Train loss 0 18.868097 Grad Norm 6.209010 19.63s/it
Validation loss 0: 63.929592
Saving model and optimizer state at iteration 0 to /home/yablon/training/mellotron/output/checkpoint_0
Train loss 1 39.906715 Grad Norm 18.103324 3.63s/it

But when I run with multi-GPUs, life becomes difficult for me.

The first problem is "apply_gradient_allreduce is not defined" error. OK, that's easy to fix, I just import it from distributed.

The coming problem is that the training seems to stop at the "Done initializing distributed", no more logs is printed further.

Can you fix this ? Thank you !

hparams trainings settings

Hi,

refering to your paper:

4.1. Training Setup
 For all the experiments, we trained on LJS, Sally and the
train-clean-100 subset of LibriTTS with over 100 speakers
and 25 minutes on average per speaker.

I'd be interested in knowing if you use the same training settings for all 3 (LJS, Sally and the train-clean-100 subset of LibriTTS)

did you just change the training_files, validation_files path or many of the other parameters too?

https://github.com/NVIDIA/mellotron/blob/master/hparams.py#L5

THANKS

How to use the Yin algorithm compute GPE FFE

I want to eval my synthesis audio. The paper said that pitch and voicing metrics are computed using the Yin algorithm. I read the code but i didn't see the evaluation part. Does anyone can share the code to compute GPE and FFE. Thank you!

Pitch contour not being applied

I trained the model with my own dataset, and strangely mel_outputs_postnet is always the same regardless of the pitch contour value, given that all other inputs & seed are fixed (i.e. mellotron.inference_noattention((text_encoded, mel, speaker_id, pitch_contour_A, rhythm)) == mellotron.inference_noattention((text_encoded, mel, speaker_id, pitch_contour_B, rhythm)) where pitch_contour_A != pitch_contour_B ) Other than that everything works flawlessly so it makes me wonder what could have possibly gone wrong. I first thought the model wasn't able to extract pitch during training so I retrained it after tweaking harm_thresh but that doesn't seem to do anything.

UPDATE: So when prenet_f0_kernel_size=1, self.prenet_f0(f0s) outputs tensors with negative values which are then passed to relu resulting in all zero tensors. I'm currently experimenting with prenet_f0_kernel_size=3 and while it learns pitch the validation loss hovers around 0.5 and there is a significant degradation in the pronunciation quality.

UPDATE 2: Changing ConvNorm's weight initialization from xavier_uniform to xavier_normal also seems to solve the issue. Once I'm done with the experiment I'll report the results and close the issue.

UPDATE 3: Changing the seed also works.

Singing voice from audio signal

The inference notebook has examples on rhythm/pitch transfer and singing voice from music score, but there is no explanation as to how to synthesize singing voice from audio signal, which is presented on https://nv-adlr.github.io/Mellotron. Is this feature not supported?

The speaker ids are misaligned in inference.ipynb

The mapping of speakers to mellotron_ids in inference.ipynb is incorrect. it uses the evaluation filelist instead of the training filelist. The evaluation filelist does not have all the speaker ids in it, which results in the speaker ids not being matched correctly. I've attached a test program to demonstrate this as well as the output, but here are a few examples:

libriTTS id eval	LibriTTS id train	Mellotron Id eval	Mellotron Id train	Speaker Name eval	Speaker Name train
40	40	26	40	Vicki Barbour	Vicki Barbour
78	78	71	103	Hugh McGuire	Hugh McGuire
87	83	83	111	Rosalind Wills	Catharine Eastman
118	87	2	119	Alex Buie	Rosalind Wills

test_speaker_ids.out.txt

test_speaker_ids.py.txt

inference on COLAB

enjoy Mellotron on Google COLAB !

https://github.com/yhgon/mellotron/blob/master/inference_colab.ipynb

cpu inference?

can this model run on cpu?

can you provide more musicxml file to test this model?

Unable to reproduce decent quality generated audio with training data samples

Hi everyone,

Thanks a lot for releasing the code for the Mellotron model and amazing research work!

I was trying to reuse the checkpoints y'all posted and perform voice transfer using a sample from the training data (LibriTTS - the same subset train-clean-100 used for training the model). Specifically, I'm using the inference colab but I'm trying to run it for audio clips from training data (specifically 40_121026_000224_000000.wav with text So sudden and violent was the fit that the unfortunate prisoner was unable to complete the sentence; a violent convulsion shook his whole frame, his eyes started from their sockets, his mouth was drawn on one side, his cheeks became purple, he struggled, foamed, dashed himself about, and uttered the most dreadful cries, which, however, Dantes prevented from being heard by covering).

LibriTTS Speaker Id 40 is present in Mellotron with Id 26. So, in the input data to the model, I specified Mellotron Id as 26 and was trying to transfer the voice to a random target speaker. However, the quality of output is not as good as the samples on the website. I was wondering if I'm missing something?

Here is the colab: https://colab.research.google.com/drive/1e0GCP0fAFoXLMY7S_CUnJME9e4OCzOyy

I also found that if my text is slightly different that the audio content, eg: the audio was speaking 'stared' but the text corresponding to it had 'started', then the output audio that I receive has audio that is okay until 'stared', but gets significantly worse after the misspelled word. Is this because the decoder is auto-regressive? Is there a way to fix this issue?

synthesized speaker quality changed

Thanks fro sharing the repo. I have trained the model using this repo on LJ speech. I am performing inference using only GST. During inference i use a out of dataset file as style file. The synthesized speaker quality changes very much. The synthesized quality is decent but it doesn't sound like the original speaker of LJ speech. How to fix that? Please can anyone help. Thanks.