sstzal / difftalk Goto Github PK
View Code? Open in Web Editor NEW[CVPR2023] The implementation for "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation"
[CVPR2023] The implementation for "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation"
Thanks to the author for sharing. I have a concern, which loss function is used in the model. I only found MSE loss in the code. If the author sees it, can you take a moment to answer it, thank you
Hey there! You mentioned that 100 videos randomly select from HDTF for testing and the remaining for training. Can you provide the specific filenames for us, thus we can compare our methods with yours.
Thanks a lot~
How much time does it take to predict a frame? Can I use it for making a real time chatbot?
Thank you for your research,
What is the code license?
After preprocessing of HDTF dataset, I got 415 videos.
249 videos (60%) were randomly selected as training set, the others (40%) were test set.
The first 1500 frames of each video were extracted for training with stride 2.
So, I got 277,117 frames in training set, and 179,711 frames in test set.
My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap.
In training, the batch size is set to 16.
At the first epoch, the usage of RAM is always increasing.
At step 2743, all RAM was occupied (even the Swap space) and the training stopped.
Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account.
I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.
Questions @sstzal :
I guess the reason of this problem is that there are too much log during training.
May I ask how the financial landmarks in the project were obtained? Neither the paper nor the project mentioned the dimensions and methods of obtaining financial landmarks.
Thanks for your great work. I am confused one thing in preporcessing stage. When we extract images, landmarks and audio features from a video, do we need to have the same number of these files because I got different numbers of file. For example, I got 2247 images and 2247 landmarks but audio features of 937 files only. Could someone please answer this issue?
As we all know, the driven-audio feature a and the landmark representation l are just a vector, not a batch of vectors, so how can they be used in cross-attention module as Key and Value?
Sorry, I'm a newbie.
Please tell me where can i down the HDTF dataset, if you could give me a url, that would be great!
In data dir, there are data_test.txt for validation and data_train.txt for training.
How did you split dataset? By portrait or by videos?
By portrait means persons in training set are not repeated in validation set.
By videos means randomly spit all videos into training and validation set.
In train_name.txt, there are 98 videos. However, 99 videos can be found in data_train.txt.
What is the relationship between train_name.txt and data_train.txt?
Could you provide test_name.txt just like train_name.txt to indicate videos used in validation?
May I know that there is such a related Chinese data set?
Nice job! Can you released the used dataset?
Awsome paper!
We are extremely interesting in your work. And we want to run a inference program to see the awsome result of the paper, could you share the entire pretrained model, may be share it through google or baidu online. Thanks for your time.
Can anyone share a useable requirements.txt? It has many conflicts and error.
|——data/HDTF
|——images
|——0_0.jpg
|——0_1.jpg
|——...
|——N_M.bin
|——landmarks
|——0_0.lmd
|——0_1.lmd
|——...
|——N_M.lms
what is N_M.bin should it be .jpg in images? and N_M.lmd in landmarks?
error:No such file or directory: './data/HDTF/data.txt''./data/HDTF/data.txt'
Is this file the total of data_train.txt and data_test.txt?
火钳刘明
What does every line in data_test.txt mean?I guess first part before '_'means the id of video,the later one means the frame number of that video.But some of them don't have all of frames of original video.So what does every line exactly mean?
hello friends Has anyone successfully reproduced this paper? I encountered some difficulties in the process of reproducing the paper, and I directly used the model parameters provided by the author. When strict is set to True in m, u = model.load_state_dict(sd, strict=True), the model cannot be loaded and cannot run through the reasoning process. I also trained it myself and found that the saved model reached 8.2G. Does anyone have the same problem, hope to get your help, thank you
how to test with my own ref_image and audio ,to generate audio-driven video
elif cond_class == "audio":
if self.cond_stage_forward is None:
bs = c.shape[0] # 20
c = c.reshape(-1,16,29) # [20, 16, 29]
c = self.cond_stage_model_for_audio(c) # [20, 64]
c = c.reshape(bs, 8, -1) # [20, 8, 8]
c = self.cond_stage_model_for_audio_smooth(c)
在处理音频信息的时候,网络要求输入维度是(B, 16, 29),c.reshape(-1,16,29)也可以确认网络的输入维度信息,我输入的音频信息与其一致,经过c = self.cond_stage_model_for_audio_smooth(c)的时候报错RuntimeError: Given groups=1, weight of size [16, 32, 3], expected input[20, 8, 8] to have 32 channels, but got 8 channels instead
Awsome paper!
Do you have a rough estimate on code release date?
I use the deepspeech==0.9.3, however, it has error:
graph_def.ParseFromString(f.read())
google.protobuf.message.DecodeError: Error parsing message with type 'tensorflow.GraphDef'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.