sstzal / difftalk Goto Github PK

[CVPR2023] The implementation for "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation"

Shell 0.04% Python 99.96%

difftalk's People

Contributors

Stargazers

Watchers

difftalk's Issues

about loss function

Thanks to the author for sharing. I have a concern, which loss function is used in the model. I only found MSE loss in the code. If the author sees it, can you take a moment to answer it, thank you

Hey there! You mentioned that 100 videos randomly select from HDTF for testing and the remaining for training. Can you provide the specific filenames for us, thus we can compare our methods with yours.

Thanks a lot~

About DeepSpeechRNN

您原文提到了使用了该模型，但是您项目代码中没有用到这一模型，想请教您这个模型是否是在预处理的程序中引用？

Hello, when will the code be released?

已创建中文的讨论组想加入的请添加微信xaaheng

Is this method realtime?

How much time does it take to predict a frame? Can I use it for making a real time chatbot?

Code license

Thank you for your research,
What is the code license?

The usage of RAM is always increasing during one epoch.

After preprocessing of HDTF dataset, I got 415 videos.
249 videos (60%) were randomly selected as training set, the others (40%) were test set.
The first 1500 frames of each video were extracted for training with stride 2.
So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap.
In training, the batch size is set to 16.
At the first epoch, the usage of RAM is always increasing.
At step 2743, all RAM was occupied (even the Swap space) and the training stopped.
Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account.
I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

Did you meet the same problem in your training?
If so, how did you solve the problem?
Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

How to get comparison results like this?

How to obtain the financial landmarks in the project

May I ask how the financial landmarks in the project were obtained? Neither the paper nor the project mentioned the dimensions and methods of obtaining financial landmarks.

Do we need to have the same number of images, landmarks and audio features?

Thanks for your great work. I am confused one thing in preporcessing stage. When we extract images, landmarks and audio features from a video, do we need to have the same number of these files because I got different numbers of file. For example, I got 2247 images and 2247 landmarks but audio features of 937 files only. Could someone please answer this issue?

How can the driven-audio feature a and the landmark representation l be used for cross-attention module?

As we all know, the driven-audio feature a and the landmark representation l are just a vector, not a batch of vectors, so how can they be used in cross-attention module as Key and Value?

Where can I down the HDTF dataset?

"Please download the HDTF dataset for training and test, and process the dataset as following."

Sorry, I'm a newbie.
Please tell me where can i down the HDTF dataset, if you could give me a url, that would be great!

How did you split dataset for training and validation?

In data dir, there are data_test.txt for validation and data_train.txt for training.
How did you split dataset? By portrait or by videos?
By portrait means persons in training set are not repeated in validation set.
By videos means randomly spit all videos into training and validation set.

In train_name.txt, there are 98 videos. However, 99 videos can be found in data_train.txt.
What is the relationship between train_name.txt and data_train.txt?

Could you provide test_name.txt just like train_name.txt to indicate videos used in validation?

Inference question

When running inference, I only get an incomplete image with landmarks and mask. What do I need to do in order to get a clean image?

May I know that there is such a related Chinese data set?

Can you provide the processed data or the related processing code?

Nice job! Can you released the used dataset?

Amazing work! When will the code be released? Thank you.

pretrained model

Awsome paper!
We are extremely interesting in your work. And we want to run a inference program to see the awsome result of the paper, could you share the entire pretrained model, may be share it through google or baidu online. Thanks for your time.

The requirements.exe is so bad

Can anyone share a useable requirements.txt? It has many conflicts and error.

confusion in processing

|——data/HDTF
|——images
|——0_0.jpg
|——0_1.jpg
|——...
|——N_M.bin
|——landmarks
|——0_0.lmd
|——0_1.lmd
|——...
|——N_M.lms
what is N_M.bin should it be .jpg in images? and N_M.lmd in landmarks?

What's the data.txt?

error:No such file or directory: './data/HDTF/data.txt''./data/HDTF/data.txt'
Is this file the total of data_train.txt and data_test.txt?

火钳刘明

Can you provide the processed data or the related processing code?

about data_test

What does every line in data_test.txt mean?I guess first part before '_'means the id of video,the later one means the frame number of that video.But some of them don't have all of frames of original video.So what does every line exactly mean?

I encountered some difficulties in the process of reproducing the paper----the model cannot be loaded and cannot run through the inference process

hello friends Has anyone successfully reproduced this paper? I encountered some difficulties in the process of reproducing the paper, and I directly used the model parameters provided by the author. When strict is set to True in m, u = model.load_state_dict(sd, strict=True), the model cannot be loaded and cannot run through the reasoning process. I also trained it myself and found that the saved model reached 8.2G. Does anyone have the same problem, hope to get your help, thank you

preprocessing code 实测可用

https://github.com/yxdydgithub/difftalk_preprocess 实测可用

ModuleNotFoundError: No module named 'ldm.util'; 'ldm' is not a package

i encountered a problem about package 'ldm', my env's
ldm==0.1.3
python==3.7
pytorch==1.12.1

how to test with my own ref_image and audio ，to generate audio-driven video

channel error

    elif cond_class == "audio":
        if self.cond_stage_forward is None:
            bs = c.shape[0] # 20
            c = c.reshape(-1,16,29) # [20, 16, 29]
            c = self.cond_stage_model_for_audio(c) # [20, 64]
            c = c.reshape(bs, 8, -1) # [20, 8, 8]
            c = self.cond_stage_model_for_audio_smooth(c)

在处理音频信息的时候，网络要求输入维度是（B， 16, 29），c.reshape(-1,16,29)也可以确认网络的输入维度信息，我输入的音频信息与其一致，经过c = self.cond_stage_model_for_audio_smooth(c)的时候报错RuntimeError: Given groups=1, weight of size [16, 32, 3], expected input[20, 8, 8] to have 32 channels, but got 8 channels instead

Code release date?

Awsome paper!
Do you have a rough estimate on code release date?

deepspeech model version

I use the deepspeech==0.9.3, however, it has error:
graph_def.ParseFromString(f.read())
google.protobuf.message.DecodeError: Error parsing message with type 'tensorflow.GraphDef'

sstzal / difftalk Goto Github PK

difftalk's People

Contributors

Stargazers

Watchers

Forkers

difftalk's Issues

"Please download the HDTF dataset for training and test, and process the dataset as following."

Recommend Projects

Recommend Topics

Recommend Org