mrzzm / dinet Goto Github PK
View Code? Open in Web Editor NEWThe source code of "DINet: deformation inpainting network for realistic face visually dubbing on high resolution video."
The source code of "DINet: deformation inpainting network for realistic face visually dubbing on high resolution video."
Hello,
many thanks for this amazing work. I was just wondering if there was any specific reason for using this version of DeepSpeech?
I was thinking of retraining the model with a different audio processing like using a newer version DeepSpeech or melspectrograms for example. However I want to know if you already tried so already and if there are any objections.
Thanks
"Hello, may I ask if anyone has encountered issues with the pre-trained syncnet provided by dinet author or if it is extremely sensitive to the dataset? I trained it on my own downloaded hdtf dataset and found that the syncloss kept oscillating on the ground truth data."
在wav2lip中这两个模块直接输出一个数字表示结果,而DINet中输出的却是一个类似(1,1,2,2)的特征图,计算损失函数的时候和一个扩展的全一矩阵进行比较;虽然本质上可能和添加一个avgpooling一致,但看起来总觉得有些别扭。
As the title suggests. I would like to get rid of the TF dependency and trying to convert the full model to ONNX. DeepSpeech is the first challenge. It is released as a black box and doesn't seem consistent with latest official DeepSpeech releases.
Where does it come from exactly?
I wanna speed the inference up. I thought generating frames batch by batch can speed it up. How can I implement?
I got the error about this. 'our method can not handle videos with large change of facial size!!'
However my video does not have huge change about facial size. Is it open face landmark mistakes? I debug the error and it goes to the below stage and failed the inference
elif max(radius_clip) > min(radius_clip) * 1.5:
return False, None
I also checked that Openface has multiple landmark setting.
CLM CLNF CECLM
Do I have to set something among this?
I also attahced my csv file.
new_light_English.csv
Hello, thanks for you nice work! The size of my gpu (2080 Ti) memory is 12G . When I run inference.py, the gpu memory will burst. How large is your GPU memory?
嘴巴有个黑框!😂
怎么都修复不了。特别是面部白皙的女子更明显!
我使用官方提供的./asserts/examples 中的 driving_audio_x.wav 生成视频后,效果还行,嘴部的运动是流畅的。但是当我使用bark,tortoise等tts模型生成的音频再去合成时,结果生成的数字人嘴部动的幅度都很小,值得注意的是,我将driving_audio_x.wav通过torchaudio加载保存,其比特率变为了原来的2倍,然后再合成视频,嘴部也不怎么动了,有谁能提供点解决思路吗?
I generate videos using driving_audio_x.wav in the official supply./asserts/examples and they work fine, mouth movements are smooth. However, when I use bark,tortoise and other tts models to synthesize the audio, the amplitude of mouth movement in the generated figures is very small. It is worth noting that I load and save driving_audio_x.wav through torchaudio, and its bit rate becomes twice of the original, and then synthesize the video. His mouth isn't moving much, either. Does anyone have any ideas?
I see speech to text is not mentioned in the paper, is is just used for training or is it for inference too? and what purpose does it serve?
感谢作者,非常棒的工作,实践中发现,中文语音同步效果比较差,能否考虑升级语音模型到deepspeech最新版本或者其他可以适配中文的版本。谢谢。我的微信:13718542435,希望能和作者交流。
Where can I find the supplementary materials from the paper?
Amazing! Thanks for your contribution.
To simplfy the network structure, Can we:
全网首发Colab,一键运行脚本:
百度网盘链接:https://pan.baidu.com/s/13DbElzZjAigtkwsGaHVfgA?pwd=1234
提取码:1234
First launch of Colab on the entire network, one click to run the script:
baidu Link: https://pan.baidu.com/s/13DbElzZjAigtkwsGaHVfgA?pwd=1234
Extraction code: 1234
I reproduced for training syncnet, what is the loss of syncnet that could take into train clip?
currently, it's around 0.21-0.25
到目前为止,没有一个自定义视频对我有用。
如果你之前用openface创建了元历史*.csv,是否可以用pretrain.pth模型对自定义视频进行推理?
或者你必须事先在HDTF的训练集中创建自定义视频,才能将这个视频添加到推理中?
Hi,
I installed OpenFace and I tried this two commands from an Ubuntu:
./build/bin/FeatureExtraction -f video.mp4
./build/bin/FaceLandmarkVid -f video.mp4
I tried the two output csv's. In both cases I tried the next command:
python inference.py --mouth_region_size=256 --source_video_path=/home/pc/video.mp4 --source_openface_landmark_path=/home/pc/output.csv --driving_audio_path=/home/pc/audio.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
Unfortunately, it is not working as expected. The lips are not moving.
The output video_synthetic_face.mp4 is blurry, there is no face. So, I take for granted that I am not executing the correct feature extraction, or maybe I need some parameters that I don't know how to pass.
What is the right command to extract the face landmarks to be compatible with DINet?
Thank you!
Hi, thanks for the amazing work!
When I tried to unzip asserts.zip, it was shown that the output_graph.pb file in the zip package was damaged. Could you please check the zip package and repair the corresponding file?Thank you so much!
output_graph.pb这个模型是自己重新训练的吗?是根据这个项目https://github.com/mozilla/DeepSpeech使用自己数据训练的吗?大致流程可以说明一下吗?
If possible can you just help me any other way to find same feature like media pipe or dlib because I'm getting error related to fast change can't be handle
If you can release the version for feature extraction CSV file ..
That will help me alot
I get this error when running on google colab. I can't install tensorflow-gpu==1.15.0 on colab, because colab doesn't support python 3.7
Any other ways to solve this?
作者你好,我用自己的音频为啥运行出错了,后来把audio_sample_rate, audio = wavfile.read(audio_path)替换成audio,audio_sample_rate = librosa.load(audio_path),运行的结果嘴巴不会动
Works perfectly on GPU, is it possible to run it on CPU? if yes, could you please add some examples?
Thanks!
syncnet_256mouth.pth 部分是如何训练的?没有找到,作者能否提供一下?谢谢。
Hi,
thanks for this amazing work. I have worked a bit on this project to remove deep speech dependancies beside some other optimization efforts. You can find this optimized version here:
https://github.com/Elsaam2y/DINet_optimized
This version would improve the inference latency by 50-60%. I can also open a PR here in this repo if you are willing to accept external PRs.
Thanks
I'm doing training but it's difficult to understand the loss convergance so I keep failing to get good results, how if you are training are you figuring it out?
May I ask if multi-stage training is necessary for DINet, or if it is possible to only train the final stage to save training time? I understand that multi-stage training is primarily used to improve the effect of data initialization, so theoretically, it should be possible to train only the final stage
Hi,
great project, thanks for sharing. I wanted to use openface linux to extract the landmark and create csv to inference to a new custom video.
I try to find out the parameter i need, so i used to extract this via command line:
`
build/bin/FaceLandmarkVidMulti -f mj2.mp4.m4v -2Dfp -tracked
`
The csv is written and for me it looks good as the other examples:
mj2.csv
When i execute inference:
python3 inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/short1.mp4 --source_openface_landmark_path=./asserts/examples/mj2.csv --driving_audio_path=./asserts/examples/mj_sound1.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
I received:
Traceback (most recent call last):
File "inference.py", line 54, in
video_landmark_data = load_landmark_openface(opt.source_openface_landmark_path).astype(np.int)
AttributeError: 'NoneType' object has no attribute 'astype'
Do i miss a parameter in the csv or the openface cmd?
Thanks @MRzzm
您好!
用自己的视频数据测试,推理时视频帧数与openface 生成的csv的帧数不一致,请问怎么解决?是视频源有问题吗?但是换了好几个视频都是这样:
The video_landmark_data.shape: 249
aligning frames with driving audio
len_video_frames: 250
Traceback (most recent call last):
File "inference.py", line 67, in
raise ('video frames are misaligned with detected landmarks')
TypeError: exceptions must derive from BaseException
May I ask if there is any code that uses GUP for inference.
如果想针对某一个人物进行嘴型微调,应该冻结哪些层较好,生成器,鉴别器?
Hi, this is a very good project thanks for making it open source, I would like to know what are changes that we need to do in order to generalize the Clip network, as I can see that there are trained on only some 400 videos.
Hi,
First of all, congratulations for this project. I really like it. It is not easy to find a good project like this one. (I spent a lot of hours looking for something like this with quality, easy to use and easy to train)
Do you know another technology similar to OpenFace? I know there are a lot of face landmark detection. But, I ask you because maybe you found a better solution than OpenFace in 2023.
Thank you,
David Martin
运行代码
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/test4.mp4 --source_openface_landmark_path=./asserts/examples/test4.csv --driving_audio_path=./asserts/examples/driving_audio_4.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
出现这个问题。
Traceback (most recent call last):
File "inference.py", line 88, in
raise ('our method can not handle videos with large change of facial size!!')
运行代码
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/test24.mp4 --source_openface_landmark_path=./asserts/examples/test24.csv --driving_audio_path=./asserts/examples/driving_audio_2.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
出现
loading facial landmarks from : ./asserts/examples/test24.csv
aligning frames with driving audio
Traceback (most recent call last):
File "inference.py", line 59, in
raise ('video frames are misaligned with detected landmarks')
TypeError: exceptions must derive from BaseException
如何解决?
extracting: input/888.zip
Traceback (most recent call last):
File "inference.py", line 33, in
raise ('wrong video path : {}'.format(opt.source_video_path))
TypeError: exceptions must derive from BaseException
because the your crop size may outside the image? if I have crop before
Will you release the syncnet training code as well?
The parameter settings of the loss function in the paper are different from those in the open-source code. Is it really necessary to set Sync loss so low (0.1)? Can it still be effective?
Is it possible to add a feature or change some code please so we can continue training?
Currently if the training crashes or its stopped we cant continue and have to retrain from the start of that step
代码有实现多卡训练吗?
from models.Syncnet import SyncNetPerception,SyncNet
from config.config import DINetTrainingOptions
from sync_batchnorm import convert_model
from torch.utils.data import DataLoader
from dataset.dataset_DINet_syncnet import DINetDataset
from utils.training_utils import get_scheduler, update_learning_rate,GANLoss
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import os
import torch.nn.functional as F
if __name__ == "__main__":
# load config
opt = DINetTrainingOptions().parse_args()
random.seed(opt.seed)
np.random.seed(opt.seed)
torch.cuda.manual_seed(opt.seed)
# init network
net_lipsync = SyncNet(15,29,128).cuda()
criterionMSE = nn.BCELoss().cuda()
# set scheduler
# set label of syncnet perception loss
real_tensor = torch.tensor(1.0).cuda()
# setup optimizer
# optimizer_s = optim.Adam(net_lipsync.parameters(), lr=opt.lr_g)
optimizer_s = optim.Adamax(net_lipsync.parameters(), lr=opt.lr_g)
# set scheduler
net_s_scheduler = get_scheduler(optimizer_s, opt.non_decay, opt.decay)
# load training data
train_data = DINetDataset(opt.train_data,opt.augment_num,opt.mouth_region_size)
training_data_loader = DataLoader(dataset=train_data, batch_size=opt.batch_size, shuffle=True,drop_last=True,num_workers=12)
train_data_length = len(training_data_loader)
# load training data
test_data = DINetDataset(opt.test_data,opt.augment_num,opt.mouth_region_size)
test_data_loader = DataLoader(dataset=test_data, batch_size=1, shuffle=True,drop_last=True,num_workers=12)
test_data_length = len(test_data_loader)
min_loss = 100
# start train
for epoch in range(opt.start_epoch, opt.non_decay+opt.decay+1):
net_lipsync.train()
for iteration, data in enumerate(training_data_loader):
# forward
optimizer_s.zero_grad()
source_clip, deep_speech_full, y = data
source_clip = torch.cat(torch.split(source_clip, 1, dim=1), 0).squeeze(1).float().cuda()
source_clip = torch.cat(torch.split(source_clip, opt.batch_size, dim=0), 1).cuda()
deep_speech_full = deep_speech_full.float().cuda()
y = y.cuda()
## sync perception loss
source_clip_mouth = source_clip[:, :, train_data.radius:train_data.radius + train_data.mouth_region_size,
train_data.radius_1_4:train_data.radius_1_4 + train_data.mouth_region_size]
sync_score = net_lipsync(source_clip_mouth, deep_speech_full)
loss_sync = criterionMSE(sync_score.unsqueeze(1), y)
loss_sync.backward()
optimizer_s.step()
print(
"===> Epoch[{}]({}/{}): Loss_Sync: {:.4f} lr_g = {:.7f} ".format(
epoch, iteration, len(training_data_loader), float(loss_sync) ,
optimizer_s.param_groups[0]['lr']))
update_learning_rate(net_s_scheduler, optimizer_s)
# checkpoint
if epoch % opt.checkpoint == 0 :
if not os.path.exists(opt.result_path):
os.makedirs(opt.result_path)
model_out_path = os.path.join(opt.result_path, 'netS_model_epoch_{}.pth'.format(epoch))
states = {
'epoch': epoch + 1,
'state_dict': {'net': net_lipsync.state_dict()},
'optimizer': {'net': optimizer_s.state_dict()}
}
torch.save(states, model_out_path)
print("Checkpoint saved to {}".format(epoch))
if epoch % opt.stop_checkpoint == 0:
break
Hello, I have a doubt. The dubbed image generated each time is a single frame, but the driving audio is the entire sequence. Where does this single-frame dubbed image correspond to in the audio?
My training got this error when using ONE 4090 gpu:
train_DINet_frame64 ===> Epoch90: Loss_DI: 0.2199 Loss_GI: 0.3163 Loss_perception: 2.4996 Loss_g: 2.8160 lr_g = 0.0001000
Traceback (most recent call last):
File "train_DINet_frame.py", line 121, in
loss_g.backward()
File "/root/miniconda3/envs/dinet/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/dinet/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":30, please report a bug to PyTorch.
this error also randomly occurs on forward.
It seams related to nn.DataParallel while I use single GPU.
Any idea on this will be most appreciative.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.