evonneng / learning2listen Goto Github PK
View Code? Open in Web Editor NEWOfficial pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)
Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)
Hi @evonneng ,
Sometimes the training of vq-gan stop in midway and I have to restart it due to some technical issue with our server. When I restart the training from, the checkpoint training loss goes haywire as shown on this green training loss grap from my previous issue
Have you come across this issue? I was wondering whether one should also save the loss in checkpoint config and load them when starting again?
checkpoint = {'config': args.config,
'state_dict': generator.state_dict(),
'optimizer': {
'optimizer': g_optimizer._optimizer.state_dict(),
'n_steps': g_optimizer.n_steps,
},
'epoch': epoch}
Hi @evonneng, I am training the vqgan with given trevor data and my own data, is this a usual behaviour of the loss curve?
Hi @evonneng, I was wondering what is the usage of p*_speak_files_clean_deca.npy files?
I am creating my own dataset, therefore, was wondering should I generate a file similar to this for every speaker in my dataset as well?
If I understood correct, this contains the file path, speaker location and number of frames of a particular speaker, is that correct?
I am creating my own dataset. It is mentioned to use librosa.feature.melspectrogram(...) to process the speaker's audio in this format: (1 x 4T x 128)
I could not find the reason why exactly you use this 4 times T? Is there a way to calculate this number?.
Your dataset contains videos with 30fps, but could not find the details about audio freq.
Did you set the hop_length = 22050 (sr) / 30 (fps) * 4? Or used the default hope_length in
librosa.feature.melspectrogram(*, y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window='hann', center=True, pad_mode='constant', power=2.0, **kwargs)
Did you choose 4T such that the frames are overlapped to reduce the spectral leakage of trimmed windows?
Hello, when I trained the predictor model based on the provided VQ-VAE model, I got the runtime error:
python -u train_vq_decoder.py --config configs/vq/delta_v6.json
using config configs/vq/delta_v6.json
starting lr 2.0
Let's use 4 GPUs!
changing lr to 4.5e-06
loading checkpoint from... vqgan/models/l2_32_smoothSS_er2er_best.pth
starting lr 0.01
Let's use 4 GPUs!
loading from checkpoint... models/delta_v6_er2er_best.pth
loaded... conan
===> in/out (9922, 64, 56) (9922, 64, 56) (9922, 256, 128)
====> train/test (6945, 64, 56) (2977, 64, 56)
=====> standardization done
epoch 7903 num_epochs 500000
Traceback (most recent call last):
File "train_vq_decoder.py", line 217, in
main(args)
File "train_vq_decoder.py", line 202, in main
patch_size, seq_len)
File "train_vq_decoder.py", line 89, in generator_train_step
g_optimizer.step_and_update_lr()
File "/vc_data/learning2listen-main/src/utils/optim.py", line 25, in step_and_update_lr
self.optimizer.step()
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/adam.py", line 144, in step
eps=group['eps'])
File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/functional.py", line 86, in adam
exp_avg.mul(beta1).add(grad, alpha=1 - beta1)
RuntimeError: output with shape [200] doesn't match the broadcast shape [4, 200]
I want to know how to solve it. Thanks in advance!
@evonneng
Hi! You indicate that raw 3D meshes can be rendered using the DECA renderer. Could you tell me how to deal with your output result (pkl files) so that it can be the input to DECA? It seems that DECA can only take images as input. Meanwhile, there are only 3 parameters(exp,pose,prob) in pkl files, is it enough for DECA to generate output?
Thank you for the wonderful study and expressive codebase!
However, when I checked your repo and code, the author seemed to change the input size for facial features from
I was curious about the reason for this change in code compared to the paper because theoretically, the latent detail code should be static person-specific details
which was independent of the expression behaviors of the listener. May I ask why did the author do that? It seems a bit redundant to add a temporally correlated feature to the analysis (I tested it on DECA and rendering the facial profile of a subject with different expressions was nearly identical with or without the detail code).
p0_list_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is listener
N sequences of length 64. Features of size 184, which includes the deca parameter set of expression (50D), pose (6D), and details (128D).
I am trying to reproduce the audio feature pre-processing for a longer time-window sequence experiment, but the only available detailed instructions were from #2. However, in the answer, the script seemed to extract the MFCC features from an extracted? audio which returned an output with a different shape (1 x 4T x 20
) compared to the audio feature in the dataset (1 x 4T x 128
).
My snippet on Google Collab could be found HERE
Yes, we chose 4*T to allow for temporal alignment with the 30fps framerate of the videos just to make it easier to process both the audio and the video frames in a unified way. The T here refers to the number of frames in the video clip. So for the purposes of this paper, T=64. The exact code used to calculate the melspecs is as follows:
def load_mfcc(audio_path, num_frames): waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000) win_len = int(0.025*sample_rate) hop_len = int(0.010*sample_rate) fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0))) S_dB = librosa.feature.mfcc(y=waveform, sr=sample_rate, hop_length=hop_len) # This line by default only extract 20 MFCCs ## do some resizing to match frame rate im = Image.fromarray(S_dB) _, feature_dim = im.size scale_four = num_frames*4 im = im.resize((scale_four, feature_dim), Image.ANTIALIAS) S_dB = np.array(im) return S_dB
Hope this helps!
I also tried to extract the Mel spectrogram normally and even combined it with librosa's power_to_db
but the scale between my output and the original dataset was still somehow not correct.
S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
# optional
# S_dB = librosa.power_to_db(S_dB)
Below are the expected output and outputs from the Mel spectrogram function before and after power_to_db
. I extracted them from the same video file done_conan_videos0/021LukeWilsonIsStartingToLookLikeChrisGainesCONANonTBS
from the raw dataset based on metadata provided by *_files_clean*
. I assumed that correct preprocessing would produce the same output as your original dataset.
My output
array([3.49407317e-04, 1.72899290e-05, 9.88764441e-06, 9.31489740e-06,
2.19979029e-05, 4.02382248e-05, 5.83300316e-05, 1.78770599e-04,After powering
array([-34.508053, -46.10779 , -48.621204, -49.872578, -46.910652,
-42.93151 , -41.84772 , -37.57675 , -38.1189 , -38.486935,Here's the dataset
array([[-50.593018, -47.35103 , -45.426086, -41.643738, -42.111137,
-41.75349 , -41.146526, -38.722565, -39.55792 , -39.344612,
May I ask your advice on how to extract audio features in detail for reproducing the dataset? I believe other readers shared the same question with me--- see THIS
I tried generating output from the data provided with the paper and I found that there are consistent jumps between frames. Most of the times it occurs every 32 frames but it also occurs at interval of 16 frames also. Is this a usual behaviour of the model? I am also attaching a video that I generated using the model.
I would like to use the model of this research paper for my experiments but I am not sure how to reconstruct the output videos from the pkl files. Please guide me in the steps for reconstructing the videos. Thanks!!
Hi @evonneng,
I have reached around 0.016 validation loss, training the model for only 2 hours and 390 steps. Is it usually the case?
Hello author, when I looked at your paper and code, I encountered doubts. In the paper, your flow chart only shows the process of predicting the listener from frames 32 to 39. How to predict frames 40 to 64? In the code Where is it reflected in?
Hello, can you provide the raw audio data?Thanks a lot.
Thank you very much for doing such a good job, would you mind sharing the rendering code of DECA to real pictures?
Hi, @evonneng
There are 3 keys(exp,pose,prob) in pkl files, but the decode function in DECA needs ['shape', 'tex', 'exp', 'pose', 'cam', 'light'] keys. How to use these three keys to correspond to them?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.