evonneng / learning2listen Goto Github PK

View Code? Open in Web Editor NEW

105.0 105.0 10.0 170.27 MB

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

HTML 16.98% CSS 8.77% Python 72.81% Shell 1.44%

learning2listen's People

Contributors

Stargazers

Watchers

Forkers

zozo567 sunotsue masolar allengu01 peterzs amramer jackzhousz sohamtiwari3120 techthiyanes bruinxiong

learning2listen's Issues

Restarting Vqgan training from checkpoint break the training loss

Hi @evonneng ,
Sometimes the training of vq-gan stop in midway and I have to restart it due to some technical issue with our server. When I restart the training from, the checkpoint training loss goes haywire as shown on this green training loss grap from my previous issue

Have you come across this issue? I was wondering whether one should also save the loss in checkpoint config and load them when starting again?
checkpoint = {'config': args.config,
'state_dict': generator.state_dict(),
'optimizer': {
'optimizer': g_optimizer._optimizer.state_dict(),
'n_steps': g_optimizer.n_steps,
},
'epoch': epoch}

vqgan training and validation loss

Hi @evonneng, I am training the vqgan with given trevor data and my own data, is this a usual behaviour of the loss curve?

About the vq loss when training the codebook

Can you explain why "it is normal for the loss to increase before decreasing"?
And I want to ask if it has some thing to do with the learning rate "2.0" set in the l2_32_smoothSS.json.

where can I download the original videos?

Usage of List of Files

Hi @evonneng, I was wondering what is the usage of p*_speak_files_clean_deca.npy files?

I am creating my own dataset, therefore, was wondering should I generate a file similar to this for every speaker in my dataset as well?

If I understood correct, this contains the file path, speaker location and number of frames of a particular speaker, is that correct?

Why exactly 4T in extracting Mels?

I am creating my own dataset. It is mentioned to use librosa.feature.melspectrogram(...) to process the speaker's audio in this format: (1 x 4T x 128)
I could not find the reason why exactly you use this 4 times T? Is there a way to calculate this number?.

Your dataset contains videos with 30fps, but could not find the details about audio freq.
Did you set the hop_length = 22050 (sr) / 30 (fps) * 4? Or used the default hope_length in

 librosa.feature.melspectrogram(*, y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window='hann', center=True, pad_mode='constant', power=2.0, **kwargs)

Did you choose 4T such that the frames are overlapped to reduce the spectral leakage of trimmed windows?

Runtime error during training the predictor model

Hello, when I trained the predictor model based on the provided VQ-VAE model, I got the runtime error:

python -u train_vq_decoder.py --config configs/vq/delta_v6.json

using config configs/vq/delta_v6.json
starting lr 2.0
Let's use 4 GPUs!
changing lr to 4.5e-06
loading checkpoint from... vqgan/models/l2_32_smoothSS_er2er_best.pth
starting lr 0.01
Let's use 4 GPUs!
loading from checkpoint... models/delta_v6_er2er_best.pth
loaded... conan
===> in/out (9922, 64, 56) (9922, 64, 56) (9922, 256, 128)
====> train/test (6945, 64, 56) (2977, 64, 56)
=====> standardization done
epoch 7903 num_epochs 500000
Traceback (most recent call last):
File "train_vq_decoder.py", line 217, in
main(args)
File "train_vq_decoder.py", line 202, in main
patch_size, seq_len)
File "train_vq_decoder.py", line 89, in generator_train_step
g_optimizer.step_and_update_lr()
File "/vc_data/learning2listen-main/src/utils/optim.py", line 25, in step_and_update_lr
self.optimizer.step()
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/adam.py", line 144, in step
eps=group['eps'])
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/functional.py", line 86, in adam
exp_avg.mul(beta1).add(grad, alpha=1 - beta1)
RuntimeError: output with shape [200] doesn't match the broadcast shape [4, 200]

I want to know how to solve it. Thanks in advance!

How to render the output of this project to DECA

@evonneng
Hi! You indicate that raw 3D meshes can be rendered using the DECA renderer. Could you tell me how to deal with your output result (pkl files) so that it can be the input to DECA? It seems that DECA can only take images as input. Meanwhile, there are only 3 parameters(exp,pose,prob) in pkl files, is it enough for DECA to generate output?

Mismatch in face features size between paper and the code release

Thank you for the wonderful study and expressive codebase!

However, when I checked your repo and code, the author seemed to change the input size for facial features from $d_\phi + 3 = 50 + 3$ [Evonne, 22] to $d_\phi + d_\alpha (pose) + d_{detail} = 50 + 6 + 128$.

I was curious about the reason for this change in code compared to the paper because theoretically, the latent detail code should be static person-specific details which was independent of the expression behaviors of the listener. May I ask why did the author do that? It seems a bit redundant to add a temporally correlated feature to the analysis (I tested it on DECA and rendering the facial profile of a subject with different expressions was nearly identical with or without the detail code).

p0_list_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is listener
N sequences of length 64. Features of size 184, which includes the deca parameter set of expression (50D), pose (6D), and details (128D).

RE. Unable to reproduce audio 128-D mel spectrogram feature from raw video

Problem statement

I am trying to reproduce the audio feature pre-processing for a longer time-window sequence experiment, but the only available detailed instructions were from #2. However, in the answer, the script seemed to extract the MFCC features from an extracted? audio which returned an output with a different shape (1 x 4T x 20) compared to the audio feature in the dataset (1 x 4T x 128).

Issue reproduction

My snippet on Google Collab could be found HERE

Yes, we chose 4*T to allow for temporal alignment with the 30fps framerate of the videos just to make it easier to process both the audio and the video frames in a unified way. The T here refers to the number of frames in the video clip. So for the purposes of this paper, T=64. The exact code used to calculate the melspecs is as follows:
def load_mfcc(audio_path, num_frames):
    waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000)
    win_len = int(0.025*sample_rate)
    hop_len = int(0.010*sample_rate)
    fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0)))
    S_dB = librosa.feature.mfcc(y=waveform, sr=sample_rate, hop_length=hop_len)   # This line by default only extract 20 MFCCs

    ## do some resizing to match frame rate
    im = Image.fromarray(S_dB)
    _, feature_dim = im.size
    scale_four = num_frames*4
    im = im.resize((scale_four, feature_dim), Image.ANTIALIAS)
    S_dB = np.array(im)
    return S_dB
Hope this helps!

I also tried to extract the Mel spectrogram normally and even combined it with librosa's power_to_db but the scale between my output and the original dataset was still somehow not correct.

S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
# optional
# S_dB = librosa.power_to_db(S_dB)

Below are the expected output and outputs from the Mel spectrogram function before and after power_to_db. I extracted them from the same video file done_conan_videos0/021LukeWilsonIsStartingToLookLikeChrisGainesCONANonTBS from the raw dataset based on metadata provided by *_files_clean*. I assumed that correct preprocessing would produce the same output as your original dataset.

My output

array([3.49407317e-04, 1.72899290e-05, 9.88764441e-06, 9.31489740e-06,
2.19979029e-05, 4.02382248e-05, 5.83300316e-05, 1.78770599e-04,

After powering

array([-34.508053, -46.10779 , -48.621204, -49.872578, -46.910652,
-42.93151 , -41.84772 , -37.57675 , -38.1189 , -38.486935,

Here's the dataset

array([[-50.593018, -47.35103 , -45.426086, -41.643738, -42.111137,
-41.75349 , -41.146526, -38.722565, -39.55792 , -39.344612,

My question

May I ask your advice on how to extract audio features in detail for reproducing the dataset? I believe other readers shared the same question with me--- see THIS

Abrupt Jumps in listener expression

I tried generating output from the data provided with the paper and I found that there are consistent jumps between frames. Most of the times it occurs every 32 frames but it also occurs at interval of 16 frames also. Is this a usual behaviour of the model? I am also attaching a video that I generated using the model.

output30_r.mp4

evonneng / learning2listen Goto Github PK

learning2listen's People

Contributors

Stargazers

Watchers

Forkers

learning2listen's Issues

Problem statement

Issue reproduction

My output

After powering

Here's the dataset

My question

Recommend Projects

Recommend Topics

Recommend Org