Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

It is an issue with the attention mechanism. This will not happen with

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Please take a look at <a href="https://github.com/NVIDIA/flowtron/blob/master/inferenc

Replication,about nvidia/flowtron

Comments (15)

rafaelvalle commented on May 24, 2024 1

Give us a few days to put a notebook up replicating some of our experiments.

from flowtron.

rafaelvalle commented on May 24, 2024

It is an issue with the attention mechanism. This will not happen with Flowtron Parallel, which we'll release soon.
https://twitter.com/RafaelValleArt/status/1281268833504751616?s=20
We used multiple files from the same speaker and style as evidence. Try it it this way and let us know

from flowtron.

polvanrijn commented on May 24, 2024

Cool, thanks for you fast reply! Regarding 3., could you tell me which utterances you exactly used?

from flowtron.

rafaelvalle commented on May 24, 2024

We used all utterances from speaker 24.

from flowtron.

DamienToomey commented on May 24, 2024

@polvanrijn did you manage to transfer the 'surprised' style?

from flowtron.

polvanrijn commented on May 24, 2024

@rafaelvalle, thanks for you reply. @DamienToomey, yes, I just redid the analysis. See gist here and listen to the output here. To be honest, I see no resemblance to the fragment which is on the NVIDIA demo page. @rafaelvalle, do you know what causes such big deviations between the pre-trained models?

from flowtron.

rafaelvalle commented on May 24, 2024

Please take a look at https://github.com/NVIDIA/flowtron/blob/master/inference_style_transfer.ipynb

from flowtron.

polvanrijn commented on May 24, 2024

Thanks for the notebook! 👍 I also noticed you use a newer waveglow model and removed the fixed seed. As you don't use a fixed seed, I could not listen to the samples you exactly created in your notebook. What I noted if you rerun your script multiple times, you get very different samples. See gist here and listen to examples here. This is not only the case for the baseline, but also for the posterior sample. You can hear that the variation is huge. To me the posterior samples do not really resemble the audio clips the posterior was computed from, at least not as strongly as the sample on the NVIDIA demo page (but this is my perception and would need to be quantified). By observing this variation, I wonder, to what extend do the z-values capture meaningful properties of surprisal if the deviations are so big if you sample from them? If the variation in the sampling is so large, you will at some point, probably also generate a sample that sounds surprised, but this does not mean that you sampled from z values computed on surprised clips. This could also happen if you sample long enough from a normal distribution. So my question is: Why is there so much variation in the sampling?

from flowtron.

rafaelvalle commented on May 24, 2024

Thank you for these comments and questions, Pol.

You're right that if we sample long enough from a normal distribution we might end up getting sounds that sound surprising. This happens when the z value comes from a region in z-space that we associate with surprised samples. Now, imagine if instead of sampling the entire normal distribution, we could sample only from the region that produces surprised samples? This is exactly what we can achieve with the posterior sampling.

Take a look at the image below for an illustration. Imagine the blue and red points are z values obtained by running Flowtron forward using existing human data to obtain z values. Imagine that the red samples were labelled as surprised and that the blue samples have other labels. Now, consider that the blue circles represent the pdf for a standard Gaussian with zero mean and unit variance.

By randomly sampling the standard gaussian, with low probability we will get samples associated with surprise. Fortunately, we can sample from that region by sampling from a Gaussian with origin on that region. We can obtain a mean parameter for this Gaussian by computing the centroid over z values obtained from surprised samples. In addition to defining the mean, we need to define the variance of the Gaussian, which is the source of variation during sampling. As we increase the variance of the Gaussian, we end up sampling from region in z-space not associated with surprise. The red circles represents this Gaussian.

Take a look at the samples here and here, in which we perform an interpolation over time between the random gaussian and the posterior.

For more details, take a look at the Transflow paper

from flowtron.

DamienToomey commented on May 24, 2024

After reading the Transflow paper, I was wondering why sigma in dist = Normal(mu_posterior.cpu(), sigma) (https://github.com/NVIDIA/flowtron/blob/master/inference_style_transfer.ipynb), is not divided by (ratio + 1) like in equation 8 (Transflow paper) whereas mu_posterior seems to be computed following equation 7 (Transflow paper)

from flowtron.

polvanrijn commented on May 24, 2024

Thank you for your detailed and clear explanation and the intuitive illustration, Rafael! Also thanks for the interesting paper. There are two things I still wonder about. First, Gambardella et al. (2020) propose to construct the posterior $q(z|\zeta_{1:m})=\mathcal{N}(\mu_p,\sum_p)$ where the mean $\mu_p$ and the standard variation $\sum_p$ are computed analytically (p. 3). But aren't we solely estimating the mean and not estimating the standard deviation in the current implementation? In the notebook you sent we use sigma of 1 and in the Flowtron paper you propose sigma of 0.7. By the way the samples when using sigma of 0.7 sound way more like surprised than the ones sampled using sigma 1. It seems to me that the estimation of sigma is a necessary step to estimate realistic fragments. If it's too small you undershoot, if it's too big you overshoot and end up with fragments that do not sound like surprisal at al.

My second question is about the n_frames variable. I played around with it and found out that if you make it too small, e.g. 50, you end up with chopped sounds (listen to samples here). We also know the z-values are of a different 'length' (aka n_frames) for each of the stimuli and this is also why we extend or cut the z-values to be of the same dimensionality (namely 80 times n_frames). Now I wonder, how do you know the 'right' or even optimal value of n_frames? This question does not only needs to be addressed when you draw new samples from a distribution like I did, but also if you want to sample from a space that for example resembles surprisal. In order to do so, you need to force all the varied-length measured z-values into one fixed-sized matrix, which size is determined by n_frames.

from flowtron.

rafaelvalle commented on May 24, 2024

In the Transflow paper, lambda is a hyperparameter that controls both the mean and the variance as coupled parameters.
Relatively speaking, larger lambda values will produce a distribution with origin closer to the sample mean and smaller variance, while smaller lambda values will be closer to the zero mean and have larger variance.

This imposes a paradox: if you want more variance, you have to decrease lambda... but by decreasing lambda you move away from the posterior mean, which makes samples from the region in z-space associated with the style less probable than others. We circumvent this by treating the mean and variance as decoupled parameters. As Pol mentioned, the variance has to be adjusted accordingly.

Regarding choosing n_frames, it should be large enough such that the model has enough frames to produce the text input. Given that 1 second of audio has approximately 86 frames, trying to generate the sequence "Humans are walking on the street" with 40 frames is similar to expecting the model to fit the sentence in half a second, which is very unlikely.

from flowtron.

polvanrijn commented on May 24, 2024

Thank you for answering the questions. ☺️

You can also very nicely see that n_frames reaches a threshold at around 210 frames (see animation below, where z is simply a growing 80 X n_frames matrix filled with zeroes). In other words, the 210 are enough frames to produce the text "Humans are walking on the streets".

If you now take a sliding vertical stride you can see that changes to the frames < 94 do not change the output Mel spectrogram. But changes to later frames (≥ 94) do change the spectrogram:

The duration of the created file is 2390 ms, 2.39 X 86 is about 206 frames and confirms the minimal length of z for this sentence (300 - 94 = 206). What I now wonder how can you estimate a suitable number for n_frames without first generating an audio file and looking at its duration? Like you show in your paper (figure 3), different values in z lead to different durations (hence require more n_frames to be produced properly). n_frames so to say puts an upper border on the maximum fragment length. An extreme case is to fill the whole matrix with a high enough number, here I fill the whole 80 X 300 matrix with 2's. Here the duration becomes 3480 ms which is substantially longer than the baseline (2390 ms, 80 X 300 matrix filled with zeroes).

A question regarding the size of the z-space, is how to translate directly translate prosody across utterances. Emphasis so far has been to create audio that sounds similar to a bunch of other sounds but do not exactly translate prosody:

In the notebook you created, you compute a mean z-space over all utterances (in this case over 8 surprised sounds). As addressed in the notebook, the z-values are of a different dimensionality, ranging from 80 X 121 to 80 X 173. We now duplicate each matrix 2 or 3 times and cut it to exactly the same dimension 80 X 300. We compute a mean over all 8 'aligned' z-matrixes (and multiply and divide by the ratio) and finally draw from a normal distribution along with sigma which we need to set manually. If sigma is set properly, we do get values that sound like the input, e.g., we can observe an increased pitch range. What I still find quite puzzling, is that we do not (need to?) take care of the alignment of the different extracted z-matrixes, we just duplicate them and chop off the rest. This leaves me with two questions.

My first question is how do you deal with different lengths of text that need a differently-sized 80 X n_frames matrix? Say we have text A that only needs a length of 150 and text B needs a length of 200 and we set n_frames to 300. The same posterior can have a very different effect, right?

Related to the previous question: if different texts need a different number of n_frames, different parts of the same posterior might be used, which would lead to same prosodic effect at different parts of the sentence. Now I wonder the same prosodic effect at different parts of the sentence might have very different perceptual effects. For example, pitch fluctuation might be prototypical for surprisal, but possibly more saliently at specific parts of the sentence (e.g. the end of the sentence).

To avoid it, it would be interesting to directly translate the prosody from one sentence to another. I am currently working on an translation example where both texts produce a fragment with equal duration. However, I would not know how to apply such direct translation if the texts produce fragments with different durations.

from flowtron.

rafaelvalle commented on May 24, 2024

how to choose n_frames for different text lengths?
Flowtron has a gating mechanism that will remove extra frames. There are several approaches to dealing with not knowing n_frames in advance.
1) train a simple model that predicts n_frames given some text and a speaker, e.g. different speakers have different speech rates. The simples model is to take the average n_frames given text and speaker.
2) choose n_frames such that it maxes out the GPU memory and rely on the gate to remove the extra frames.

I still find quite puzzling, is that we do not (need to?) take care of the alignment of the different extracted z-matrixes
If we compute a flowtron forward pass to obtain Z on a single sample and do not average it over time, this Z is sentence dependent and each frame is highly associated with the sentence. Now, if we compute Zs on a large number of samples and average over batch, we're averaging out sentence dependent characteristics. With this, we're keeping only characteristics that are common to all sentences and frames.

It would be interesting to directly translate the prosody from one sentence to another.
Our first approach was to transfer rhythm and pitch contour from one sample to a speaker.
Take a look at Mellotron and samples on the website
Mellotron only takes into account non-textual characteristics that are easy to extract, like token duration and pitch.

We're currently working on a model, Flowtron Parallel, that is able to also convert non-textual characteristics that are hard to extract like breathiness, nasal voice, whispering. Take a look at this example in which we perform voice conversion, i.e. we replace the speaker in vc_source.wav with LJSpeech's speaker ftp_vc_zsourcespeaker_ljs.wav, while keeping the pitch contour, token durations, breathiness, somber voice from the source.
ftp_vc.zip

from flowtron.

polvanrijn commented on May 24, 2024

Thank you for your reply and for your approaches for dealing with not knowing n_frames in advance

If we compute a flowtron forward pass to obtain Z on a single sample and do not average it over time, this Z is sentence dependent and each frame is highly associated with the sentence. Now, if we compute Zs on a large number of samples and average over batch, we're averaging out sentence dependent characteristics. With this, we're keeping only characteristics that are common to all sentences and frames.

I understand what you are saying. I think I wasn't clear in my formulation. The averaging makes a lot of sense to me, but I wondered why we don't align the z-spaces (e.g. stretch them to be of the same size). Say for simplicity we compute an average on two Z matrixes extracted from two sound files, Z1 and Z2 respectively. In the current implementation, we repeat the Z-matrixes and cut of the part longer than n_frames. Now for each point in all matrixes (in this example only Z1 and Z2) we compute a mean. Since the extracted Z-matrixes are not of the same size, there will be cases where we compare the start of a sentence with the end of another sentence (see red area in figure below). I wonder if such a comparison is meaningful.

Regarding direct prosody transfer, I was not precise. I did not mean to take an extracted Z matrix of a fragment and directly synthesise a new sentence with it, but rather to draw Z from a normal distribution and apply it to different sentences. This is what I did in this gist. I selected 16 sentences from the Harvard sentences and generated 100 random Z matrixes and synthesised sounds from them. From those 16 X 100 sounds, I computed some simple acoustic measures (e.g. duration, mean pitch etc.). Then I computed a correlation matrix for each of those acoustic measures separately. Here are the average correlation coefficients (absolute correlations):

duration: 0.20
mean_pitch: 0.44
sd_pitch: 0.21
min_pitch: 0.21
max_pitch: 0.18
range_pitch: 0.17
slope_pitch: 0.20
mean_intensity: 0.24
e_0_500: 0.34
e_0_1000: 0.34
e_500_1000: 0.34
e_1000_2000: 0.35
e_0_2000: 0.35
e_2000_5000: 0.35
e_5000_8000: 0.24

I expected the same Z matrix would lead to similar changes across sentences, but the average correlations are rather low for some acoustic measures.

Thanks for mentioning Mellotron, it is on my todo list to look at next. :-) What you describe about Flowtron Parallel looks very promising. The example sounds sounds great. Can't wait until it's released.

from flowtron.

Replication about flowtron HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent