Giter VIP home page Giter VIP logo

learning2listen's People

Contributors

evonneng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

learning2listen's Issues

Restarting Vqgan training from checkpoint break the training loss

Hi @evonneng ,
Sometimes the training of vq-gan stop in midway and I have to restart it due to some technical issue with our server. When I restart the training from, the checkpoint training loss goes haywire as shown on this green training loss grap from my previous issue

Have you come across this issue? I was wondering whether one should also save the loss in checkpoint config and load them when starting again?
checkpoint = {'config': args.config,
'state_dict': generator.state_dict(),
'optimizer': {
'optimizer': g_optimizer._optimizer.state_dict(),
'n_steps': g_optimizer.n_steps,
},
'epoch': epoch}

About the vq loss when training the codebook

  1. Can you explain why "it is normal for the loss to increase before decreasing"?
  2. And I want to ask if it has some thing to do with the learning rate "2.0" set in the l2_32_smoothSS.json.

Usage of List of Files

Hi @evonneng, I was wondering what is the usage of p*_speak_files_clean_deca.npy files?

I am creating my own dataset, therefore, was wondering should I generate a file similar to this for every speaker in my dataset as well?

If I understood correct, this contains the file path, speaker location and number of frames of a particular speaker, is that correct?

Why exactly 4T in extracting Mels?

I am creating my own dataset. It is mentioned to use librosa.feature.melspectrogram(...) to process the speaker's audio in this format: (1 x 4T x 128)
I could not find the reason why exactly you use this 4 times T? Is there a way to calculate this number?.

Your dataset contains videos with 30fps, but could not find the details about audio freq.
Did you set the hop_length = 22050 (sr) / 30 (fps) * 4? Or used the default hope_length in

 librosa.feature.melspectrogram(*, y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window='hann', center=True, pad_mode='constant', power=2.0, **kwargs)

Did you choose 4T such that the frames are overlapped to reduce the spectral leakage of trimmed windows?

Runtime error during training the predictor model

Hello, when I trained the predictor model based on the provided VQ-VAE model, I got the runtime error:

python -u train_vq_decoder.py --config configs/vq/delta_v6.json

using config configs/vq/delta_v6.json
starting lr 2.0
Let's use 4 GPUs!
changing lr to 4.5e-06
loading checkpoint from... vqgan/models/l2_32_smoothSS_er2er_best.pth
starting lr 0.01
Let's use 4 GPUs!
loading from checkpoint... models/delta_v6_er2er_best.pth
loaded... conan
===> in/out (9922, 64, 56) (9922, 64, 56) (9922, 256, 128)
====> train/test (6945, 64, 56) (2977, 64, 56)
=====> standardization done
epoch 7903 num_epochs 500000
Traceback (most recent call last):
File "train_vq_decoder.py", line 217, in
main(args)
File "train_vq_decoder.py", line 202, in main
patch_size, seq_len)
File "train_vq_decoder.py", line 89, in generator_train_step
g_optimizer.step_and_update_lr()
File "/vc_data/learning2listen-main/src/utils/optim.py", line 25, in step_and_update_lr
self.optimizer.step()
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/adam.py", line 144, in step
eps=group['eps'])
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/functional.py", line 86, in adam
exp_avg.mul
(beta1).add
(grad, alpha=1 - beta1)
RuntimeError: output with shape [200] doesn't match the broadcast shape [4, 200]

I want to know how to solve it. Thanks in advance!

How to render the output of this project to DECA

@evonneng
Hi! You indicate that raw 3D meshes can be rendered using the DECA renderer. Could you tell me how to deal with your output result (pkl files) so that it can be the input to DECA? It seems that DECA can only take images as input. Meanwhile, there are only 3 parameters(exp,pose,prob) in pkl files, is it enough for DECA to generate output?

Mismatch in face features size between paper and the code release

Thank you for the wonderful study and expressive codebase!

However, when I checked your repo and code, the author seemed to change the input size for facial features from $d_\phi + 3 = 50 + 3$ [Evonne, 22] to $d_\phi + d_\alpha (pose) + d_{detail} = 50 + 6 + 128$.

I was curious about the reason for this change in code compared to the paper because theoretically, the latent detail code should be static person-specific details which was independent of the expression behaviors of the listener. May I ask why did the author do that? It seems a bit redundant to add a temporally correlated feature to the analysis (I tested it on DECA and rendering the facial profile of a subject with different expressions was nearly identical with or without the detail code).

p0_list_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is listener
N sequences of length 64. Features of size 184, which includes the deca parameter set of expression (50D), pose (6D), and details (128D).

RE. Unable to reproduce audio 128-D mel spectrogram feature from raw video

Problem statement

I am trying to reproduce the audio feature pre-processing for a longer time-window sequence experiment, but the only available detailed instructions were from #2. However, in the answer, the script seemed to extract the MFCC features from an extracted? audio which returned an output with a different shape (1 x 4T x 20) compared to the audio feature in the dataset (1 x 4T x 128).

Issue reproduction

My snippet on Google Collab could be found HERE

Yes, we chose 4*T to allow for temporal alignment with the 30fps framerate of the videos just to make it easier to process both the audio and the video frames in a unified way. The T here refers to the number of frames in the video clip. So for the purposes of this paper, T=64. The exact code used to calculate the melspecs is as follows:

def load_mfcc(audio_path, num_frames):
    waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000)
    win_len = int(0.025*sample_rate)
    hop_len = int(0.010*sample_rate)
    fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0)))
    S_dB = librosa.feature.mfcc(y=waveform, sr=sample_rate, hop_length=hop_len)   # This line by default only extract 20 MFCCs

    ## do some resizing to match frame rate
    im = Image.fromarray(S_dB)
    _, feature_dim = im.size
    scale_four = num_frames*4
    im = im.resize((scale_four, feature_dim), Image.ANTIALIAS)
    S_dB = np.array(im)
    return S_dB

Hope this helps!

I also tried to extract the Mel spectrogram normally and even combined it with librosa's power_to_db but the scale between my output and the original dataset was still somehow not correct.

S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
# optional
# S_dB = librosa.power_to_db(S_dB)

Below are the expected output and outputs from the Mel spectrogram function before and after power_to_db. I extracted them from the same video file done_conan_videos0/021LukeWilsonIsStartingToLookLikeChrisGainesCONANonTBS from the raw dataset based on metadata provided by *_files_clean*. I assumed that correct preprocessing would produce the same output as your original dataset.

My output

array([3.49407317e-04, 1.72899290e-05, 9.88764441e-06, 9.31489740e-06,
2.19979029e-05, 4.02382248e-05, 5.83300316e-05, 1.78770599e-04,

After powering

array([-34.508053, -46.10779 , -48.621204, -49.872578, -46.910652,
-42.93151 , -41.84772 , -37.57675 , -38.1189 , -38.486935,

Here's the dataset

array([[-50.593018, -47.35103 , -45.426086, -41.643738, -42.111137,
-41.75349 , -41.146526, -38.722565, -39.55792 , -39.344612,

My question

May I ask your advice on how to extract audio features in detail for reproducing the dataset? I believe other readers shared the same question with me--- see THIS

Abrupt Jumps in listener expression

I tried generating output from the data provided with the paper and I found that there are consistent jumps between frames. Most of the times it occurs every 32 frames but it also occurs at interval of 16 frames also. Is this a usual behaviour of the model? I am also attaching a video that I generated using the model.

output30_r.mp4

Reconstructing generated output

I would like to use the model of this research paper for my experiments but I am not sure how to reconstruct the output videos from the pkl files. Please guide me in the steps for reconstructing the videos. Thanks!!

Raw audio

Hello, can you provide the raw audio data?Thanks a lot.

Render

Thank you very much for doing such a good job, would you mind sharing the rendering code of DECA to real pictures?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.