Giter VIP home page Giter VIP logo

keonlee9420 / comprehensive-e2e-tts Goto Github PK

View Code? Open in Web Editor NEW
140.0 11.0 19.0 3.54 MB

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS

Dockerfile 0.48% Python 99.52%
deep-learning fastspeech2 hifi-gan jets multi-speaker neural-tts non-autoregressive pytorch single-speaker sota

comprehensive-e2e-tts's Introduction

Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (generating waveform given text), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS. Any suggestions toward the best End-to-End TTS are welcome :)

Architecture Design

Linguistic Encoder

Audio Upsampler

Duration Modeling

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models (will be shared soon) and put them in output/ckpt/DATASET/.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Add --speaker_id SPEAKER_ID for a multi-speaker TTS.

Training

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.

Preprocessing

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

  • The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost.

Notes

  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

comprehensive-e2e-tts's People

Contributors

keonlee9420 avatar roedoejet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

comprehensive-e2e-tts's Issues

Question about Differentiable Duration Modeling

Hello, I'm trying to implement Differentiable Duration Modeling(DDM) module introduced in
Differentiable Duration Modeling for End-to-End Text-to-Speech.

I opened this issue to get advice on implementation DDM.

My Implementation of Differentiable Alignment Encoder outputs attention like thing from noise input.
But the training speed of DDM is too slow(10s/iter). Seems like it hanged in backward progress.

Can anyone give me some advice to improve the speed of recursive tensor operation?
Should I use cuda.jit like Soft DTW? Or is there something wrong with the approach itself?

The module's output from noise input and code is like below.

Thank you.

dae = DifferentiableAlignmentEncoder()
b = 5
text_max_len = 25
mel_max_len = 85
dim = 256
x_len = torch.randint(1, text_max_len, (b,))
mel_len = torch.randint(2, mel_max_len, (b,))
x = torch.randn(b, max(x_len), dim)
s, l, q, dur = dae(x, x_len, mel_len)
i = 2
plt.imshow(l[i, :x_len[i], :mel_len[i]].detach().numpy())
plt.imshow(q[i, :x_len[i], :mel_len[i]].detach().numpy())
plt.imshow(s[i, :x_len[i], :mel_len[i]].detach().numpy())
plt.plot(dur[i, :x_len[i]])

L
image
Q
image
S = soft attention
image
Duration
image

Code

class DifferentiableAlignmentEncoder(nn.Module):
    def __init__(
        self,
        hidden_dim=256,
        conv_kernels=3,
        num_layers=3,
        dropout_p=0.2,
        max_mel_len=1150 # Max Length of Mel-Spectrogram Frame in training data
    ):
        super().__init__()
        
        self.conv_layer_blocks = nn.ModuleList([
            nn.Sequential(
                ConvNorm(hidden_dim, hidden_dim, conv_kernels, bias=True, transpose=True),
                nn.ReLU(),
                nn.LayerNorm(hidden_dim),
                nn.Dropout(dropout_p)
            )
            for i in range(num_layers)
        ])
        self.dur_prob_proj = LinearNorm(hidden_dim, max_mel_len, bias=False)
        
        self.ddm = DifferentiableDurationModeling()
    
    def forward(self, x, phon_lens, mel_lens, x_masks=None):
        
        """
        x  : Tensor[B, T_phon, C_phone]
        phon_lens : LongTensor[B]
        mel_lens : LongTensor[B]
        s : S Matrix : Tensor[B, T_phon, T_mel]
        dur : Duration Matrix : Tensor[B, T_phon]
        """
        
        max_mel_len = int(torch.max(mel_lens))
        
        for layer in self.conv_layer_blocks:
            if x_masks is not None:
                x = x * (1 - x_masks.float())
            x = layer(x)
        x = self.dur_prob_proj(x)
                
        norm = torch.randn(x.shape).to(x.device)
        x = x + norm
        
        p = torch.sigmoid(x)
        p = p[:, :, :max_mel_len]
        
        s, l, q, dur = self.ddm(p, phon_lens, mel_lens)
        
        dur = dur.detach()
        
        return s, l, q, dur
    
    
class DifferentiableDurationModeling(nn.Module):
    def __init__(self):
        super().__init__()
        
    def _get_attn_mask(self, phon_lens, mel_lens):
        phon_mask = ~get_mask_from_lengths(phon_lens)
        mel_mask = ~get_mask_from_lengths(mel_lens)
        
        return phon_mask.unsqueeze(-1) * mel_mask.unsqueeze(1), phon_mask
    
    def forward(self, p, phon_lens, mel_lens):
        
        attn_mask, phon_mask = self._get_attn_mask(phon_lens, mel_lens)
        
        p = p * attn_mask
        
        l = self._get_l(p, attn_mask)
        
        l = l * attn_mask

        dur = self._get_duration(l)
        
        dur = dur * phon_mask

        q = self._get_q(l)
        
        q = q * attn_mask
        
        s = self._get_s(q, l)
        
        s = s * attn_mask
            
        return s, l, q, dur
    
    def _get_duration(self, l):
        with torch.no_grad():
            m = torch.arange(1, l.shape[-1] + 1)[None, :].expand_as(l).to(l.device)
            dur = torch.sum(m * l, dim=-1)
        return dur
    
    def _get_l(self, p, mask):
        # getting l is numerically unstable for the gradient computation.
        # Paper's Author resolve this issue by computing this product in the log-space
        _p = torch.log(mask[:, :, 1:].float() - p[:, :, 1:] + 1e-8)
        p = torch.log(p + 1e-8)
        com = torch.cumsum(_p, dim=-1)
        l_0 = com[:, :, -1].unsqueeze(-1)
        l_1 = p[:, :, 1].unsqueeze(-1)
        
        l_m = com[:, :, :-1] + p[:, :, 2:]
                
        l = torch.cat([l_0, l_1, l_m], dim=-1)

        l = torch.exp(l)
        
        return l
    
    def _variable_kernel_size_convolution(self, x, y, length):
        matrix = torch.flip(x.unsqueeze(1) * y.unsqueeze(-1), dims=[-1])
        output =  torch.flip(
            torch.cat(
                [
                    torch.sum(
                        torch.diagonal(
                            matrix, offset=idx, dim1=-2, dim2=-1
                        ), dim=1
                    ).unsqueeze(1) 
                    for idx in range(length)
                ],
                dim=1
            ),
            dims=[1] 
        )
        return output
    
    def _get_q(self, l):
        length = l.shape[-1]
        q = [l[:, 0, :]]
        if l.shape[-1] > 1:
            for i in range(1, l.shape[1]):
                q.append(self._variable_kernel_size_convolution(q[i-1], l[:, i], length))
                        
        q = torch.cat([_.unsqueeze(1) for _ in q], dim=1)
        
        return q   

    def _reverse_cumsum(self, x):
        return torch.flip(torch.cumsum(torch.flip(x, dims=[-1]), dim=-1), dims=[-1])
    
    def _get_s(self, q, l):
        length = l.shape[-1]
        l_rev_cumsum = self._reverse_cumsum(l)
        s = [l_rev_cumsum[:, 0, :]]
        
        if l.shape[-1] > 1:
            for i in range(1, q.shape[1]):
                s.append(self._variable_kernel_size_convolution(q[:, i-1], l_rev_cumsum[:, i], length))
        
        s = torch.cat([_.unsqueeze(1) for _ in s], dim=1)
            
        return s

severe metallic sound

Hi, thanks for your nice jobs. I used your codes for ny own datasets and the synthesized voices seems not that normal at 160K steps now. Though we could still figure out what's being saied, the spectrum is unnormal (especially the high frequency part, as you can see from the following figures.) with severe metallic sound. I have double checked the feature extraction process and the training process, and all are normal. Do you know any reason about it? BTW, how many steps are required to train the LJSpeech model?
image

Thanks again.

Variance Loss RuntimeError

Hi there,

I'm trying to train a model with LJ data, but at step 50331 I get:

  File "train.py", line 339, in <module>
    train(0, args, configs, batch_size, num_gpus)
  File "train.py", line 196, in train
    ) = Loss.variance_loss(batch, output, step=step)
  File "/gpfs/fs2c/nrc/ict/portage/u/tts/code/Comprehensive-E2E-TTS-new/model/loss.py", line 206, in variance_loss
    ctc_loss = self.sum_loss(attn_logprob=attn_logprob, in_lens=src_lens, out_lens=mel_lens)
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/gpfs/fs2c/nrc/ict/portage/u/tts/code/Comprehensive-E2E-TTS-new/model/loss.py", line 249, in forward
    target_lengths=key_lens[bid : bid + 1],
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1502, in forward
    self.zero_infinity)
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/functional.py", line 2201, in ctc_loss
    zero_infinity)
RuntimeError: Expected input_lengths to have value at most 693, but got value 694 (while checking arguments for ctc_loss_gpu)

I'm using the default configuration (https://github.com/keonlee9420/Comprehensive-E2E-TTS/tree/main/config/LJSpeech) except I reduced the batch size to 10 to fit my GPU. Is the reason this only showed up now due to var_start_steps? Any advise would be appreciated.

multi GPU training doesnt seem to work

I tested with a single GPU and training works fine. I am not testing with multiple GPUs and i noticed, that the outer bar (counting total number of steps) is not updating. Adding some print statements to the code it seems that the statement in train.py:

for batchs in loader returns batchs: [], [], [], [] (so empty batches)

seems something goes wrong in data loader ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.