mrzzm / dinet Goto Github PK

The source code of "DINet: deformation inpainting network for realistic face visually dubbing on high resolution video."

Python 100.00%

dinet's People

Stargazers

Watchers

Forkers

fuxivirtualhuman zhangziliang04 pustar samsgates davidmartinrius ruqhia nelsontseng0704 tanyinyan esently rostekus saber5433 xiaoqingwang maxmax2016 erjihaoshi 935077727 johnjarr springfeng renfengyi ameerazam08 adambear jj-math thetargo lirong-zhong westwei-source mit10000 feimaogit st4kc lianjiang-yulj remark-app maikehongg houweifeng gzy234 chuanbei888 fangdejia zhangsanfeng86 matou9 houtian80 sysuchenpc muyiai assassindesign john20090910 zjackz w4l6 cantops ohmygoldbo76 a21211 likenamehaojie arthurain mrduxs gelove samggggflynn tang576225574 lcc157 tian64873493 bamaao lnmcc great1001 oceantan sushobhan04 yehuan87 je395191059 jdjs2023 songfang superdreams lucadavid075 husw725 pansusu zaitianaoxiang kayya886 zcloud2014 kuoenterprises srymurphy chenkaic4 gaowudao jjandnn john073200 jack20002 qq30135878 dlnan xiedongmingming ai-chef burashixi 729533572 shuogesha krisp0o abheeshth 9bitss wenjun34955 loopformdisc xjx777 k0ngolab gh81997167 chuongloc shanghailiwei hly990 hadesnull123 ipcenter shawnxhong mrlzla oijoijcoiejoijce

dinet's Issues

已创建高清模型讨论组，需要交流的请加我微信Rena625729

DeepSpeech

Hello,

many thanks for this amazing work. I was just wondering if there was any specific reason for using this version of DeepSpeech?

I was thinking of retraining the model with a different audio processing like using a newer version DeepSpeech or melspectrograms for example. However I want to know if you already tried so already and if there are any objections.

Thanks

是否可以帮忙发下论文的补充材料部分

你好，我在论文中没有找到补充材料部分

如果方便的话，麻烦帮忙发下完整版的论文，感谢~

ploblem about the pretrained syncnet that provided

"Hello, may I ask if anyone has encountered issues with the pre-trained syncnet provided by dinet author or if it is extremely sensitive to the dataset? I trained it on my own downloaded hdtf dataset and found that the syncloss kept oscillating on the ground truth data."

关于syncnet和discriminator模块输出的实际含义

在wav2lip中这两个模块直接输出一个数字表示结果，而DINet中输出的却是一个类似(1,1,2,2)的特征图，计算损失函数的时候和一个扩展的全一矩阵进行比较；虽然本质上可能和添加一个avgpooling一致，但看起来总觉得有些别扭。

DeepSpeech model: where does it come from?

As the title suggests. I would like to get rid of the TF dependency and trying to convert the full model to ONNX. DeepSpeech is the first challenge. It is released as a black box and doesn't seem consistent with latest official DeepSpeech releases.

Where does it come from exactly?

How can we apply batch size when inference?

I wanna speed the inference up. I thought generating frames batch by batch can speed it up. How can I implement?

Question about facial size change error

I got the error about this. 'our method can not handle videos with large change of facial size!!'
However my video does not have huge change about facial size. Is it open face landmark mistakes? I debug the error and it goes to the below stage and failed the inference
elif max(radius_clip) > min(radius_clip) * 1.5:
return False, None

I also checked that Openface has multiple landmark setting.
CLM CLNF CECLM
Do I have to set something among this?
I also attahced my csv file.
new_light_English.csv

How large is the GPU memory?

Hello, thanks for you nice work! The size of my gpu (2080 Ti) memory is 12G . When I run inference.py, the gpu memory will burst. How large is your GPU memory?

嘴巴有个黑框！😂

嘴巴有个黑框！😂
怎么都修复不了。特别是面部白皙的女子更明显！

用自己的音频生成视频嘴部不怎么动（generated videos with my own audios,however,the mouth almost does not move）

我使用官方提供的./asserts/examples 中的 driving_audio_x.wav 生成视频后，效果还行，嘴部的运动是流畅的。但是当我使用bark,tortoise等tts模型生成的音频再去合成时，结果生成的数字人嘴部动的幅度都很小，值得注意的是，我将driving_audio_x.wav通过torchaudio加载保存，其比特率变为了原来的2倍，然后再合成视频，嘴部也不怎么动了，有谁能提供点解决思路吗？

I generate videos using driving_audio_x.wav in the official supply./asserts/examples and they work fine, mouth movements are smooth. However, when I use bark,tortoise and other tts models to synthesize the audio, the amplitude of mouth movement in the generated figures is very small. It is worth noting that I load and save driving_audio_x.wav through torchaudio, and its bit rate becomes twice of the original, and then synthesize the video. His mouth isn't moving much, either. Does anyone have any ideas?

what is deepspeech used for?

I see speech to text is not mentioned in the paper, is is just used for training or is it for inference too? and what purpose does it serve?

中文语音同步效果比较差，能否考虑升级语音模型到deepspeech最新版本

感谢作者，非常棒的工作，实践中发现，中文语音同步效果比较差，能否考虑升级语音模型到deepspeech最新版本或者其他可以适配中文的版本。谢谢。我的微信：13718542435，希望能和作者交流。

Supplementary Materials

Where can I find the supplementary materials from the paper?

Simplfy the network structure

Amazing! Thanks for your contribution.
To simplfy the network structure, Can we:

use 5 mouth images as refrence images instead of 5 whole face?
Fref concat Fs as input to the AdaAT.
no aligment encoder.
no concat in Inpainting part.

Colab inference

全网首发Colab,一键运行脚本：
百度网盘链接：https://pan.baidu.com/s/13DbElzZjAigtkwsGaHVfgA?pwd=1234
提取码：1234

First launch of Colab on the entire network, one click to run the script:
baidu Link: https://pan.baidu.com/s/13DbElzZjAigtkwsGaHVfgA?pwd=1234
Extraction code: 1234

Syncnet Training

I reproduced for training syncnet, what is the loss of syncnet that could take into train clip?
currently, it's around 0.21-0.25

己的视频与预训练的模型？

到目前为止，没有一个自定义视频对我有用。

如果你之前用openface创建了元历史*.csv，是否可以用pretrain.pth模型对自定义视频进行推理？

或者你必须事先在HDTF的训练集中创建自定义视频，才能将这个视频添加到推理中？

OpenFace .csv not working

Hi,

I installed OpenFace and I tried this two commands from an Ubuntu:

./build/bin/FeatureExtraction -f video.mp4

./build/bin/FaceLandmarkVid -f video.mp4

I tried the two output csv's. In both cases I tried the next command:

python inference.py --mouth_region_size=256 --source_video_path=/home/pc/video.mp4 --source_openface_landmark_path=/home/pc/output.csv --driving_audio_path=/home/pc/audio.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth

Unfortunately, it is not working as expected. The lips are not moving.
The output video_synthetic_face.mp4 is blurry, there is no face. So, I take for granted that I am not executing the correct feature extraction, or maybe I need some parameters that I don't know how to pass.

What is the right command to extract the face landmarks to be compatible with DINet?

Thank you!

Wrong file for output_graph.pb

Hi, thanks for the amazing work!
When I tried to unzip asserts.zip, it was shown that the output_graph.pb file in the zip package was damaged. Could you please check the zip package and repair the corresponding file？Thank you so much!

请问DeepSpeech这个模型output_graph.pb是怎么生成的？

output_graph.pb这个模型是自己重新训练的吗？是根据这个项目https://github.com/mozilla/DeepSpeech使用自己数据训练的吗？大致流程可以说明一下吗？

Is this code trained using only the HDTF dataset, or are both HDTF and MEAD codes trained together?

I need help regarding feature extraction?

If possible can you just help me any other way to find same feature like media pipe or dlib because I'm getting error related to fast change can't be handle
If you can release the version for feature extraction CSV file ..
That will help me alot

已创建中文的讨论组想加入的请添加微信xaaheng

111 KeyError: "The name 'deepspeech/logits:0' refers to a Tensor which does not exist. The operation, 'deepspeech/logits', does not exist in the graph."

I get this error when running on google colab. I can't install tensorflow-gpu==1.15.0 on colab, because colab doesn't support python 3.7
Any other ways to solve this?

关于自己的音频无法驱动问题

作者你好，我用自己的音频为啥运行出错了，后来把audio_sample_rate, audio = wavfile.read(audio_path)替换成audio,audio_sample_rate = librosa.load(audio_path)，运行的结果嘴巴不会动

Great project, can this be done in real-time? If possible, how should I modify it?

CPU usage

Works perfectly on GPU, is it possible to run it on CPU? if yes, could you please add some examples?

Thanks!

syncnet_256mouth.pth 部分是如何训练的？没有找到，作者能否提供一下？谢谢。

Optimized version

Hi,

thanks for this amazing work. I have worked a bit on this project to remove deep speech dependancies beside some other optimization efforts. You can find this optimized version here:

https://github.com/Elsaam2y/DINet_optimized

This version would improve the inference latency by 50-60%. I can also open a PR here in this repo if you are willing to accept external PRs.

Thanks

Training Loss convergance

I'm doing training but it's difficult to understand the loss convergance so I keep failing to get good results, how if you are training are you figuring it out?

multi-stage training is necessary?

May I ask if multi-stage training is necessary for DINet, or if it is possible to only train the final stage to save training time? I understand that multi-stage training is primarily used to improve the effect of data initialization, so theoretically, it should be possible to train only the final stage

Landmark OpenFace CMD issue?

Hi,

great project, thanks for sharing. I wanted to use openface linux to extract the landmark and create csv to inference to a new custom video.

I try to find out the parameter i need, so i used to extract this via command line:

`
build/bin/FaceLandmarkVidMulti -f mj2.mp4.m4v -2Dfp -tracked

`
The csv is written and for me it looks good as the other examples:
mj2.csv

When i execute inference:
python3 inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/short1.mp4 --source_openface_landmark_path=./asserts/examples/mj2.csv --driving_audio_path=./asserts/examples/mj_sound1.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth

I received:

Traceback (most recent call last):
File "inference.py", line 54, in
video_landmark_data = load_landmark_openface(opt.source_openface_landmark_path).astype(np.int)
AttributeError: 'NoneType' object has no attribute 'astype'

Do i miss a parameter in the csv or the openface cmd?

Thanks @MRzzm

用自己的视频数据测试，推理时视频帧数与openface 生成的csv的帧数不一致

您好！
用自己的视频数据测试，推理时视频帧数与openface 生成的csv的帧数不一致，请问怎么解决？是视频源有问题吗？但是换了好几个视频都是这样：
The video_landmark_data.shape: 249
aligning frames with driving audio
len_video_frames: 250
Traceback (most recent call last):
File "inference.py", line 67, in
raise ('video frames are misaligned with detected landmarks')
TypeError: exceptions must derive from BaseException

May I ask if there is any code that uses GUP for inference.

如何在原有模型基础上微调

如果想针对某一个人物进行嘴型微调，应该冻结哪些层较好，生成器，鉴别器？

Generalizing the DINet

Hi, this is a very good project thanks for making it open source, I would like to know what are changes that we need to do in order to generalize the Clip network, as I can see that there are trained on only some 400 videos.

Other technology than OpenFace?

Hi,

First of all, congratulations for this project. I really like it. It is not easy to find a good project like this one. (I spent a lot of hours looking for something like this with quality, easy to use and easy to train)

Do you know another technology similar to OpenFace? I know there are a lot of face landmark detection. But, I ask you because maybe you found a better solution than OpenFace in 2023.

Thank you,

David Martin

运行执行代码教程出现问题Traceback (most recent call last): File "inference.py", line 88, in <module> raise ('our method can not handle videos with large change of facial size!!')

运行代码
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/test4.mp4 --source_openface_landmark_path=./asserts/examples/test4.csv --driving_audio_path=./asserts/examples/driving_audio_4.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
出现这个问题。
Traceback (most recent call last):
File "inference.py", line 88, in
raise ('our method can not handle videos with large change of facial size!!')

运行代码
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/test24.mp4 --source_openface_landmark_path=./asserts/examples/test24.csv --driving_audio_path=./asserts/examples/driving_audio_2.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth
出现
loading facial landmarks from : ./asserts/examples/test24.csv
aligning frames with driving audio
Traceback (most recent call last):
File "inference.py", line 59, in
raise ('video frames are misaligned with detected landmarks')
TypeError: exceptions must derive from BaseException

如何解决？

How to fix this error HELP!

extracting: input/888.zip
Traceback (most recent call last):
File "inference.py", line 33, in
raise ('wrong video path : {}'.format(opt.source_video_path))
TypeError: exceptions must derive from BaseException

do not crop&resize videos？why？

because the your crop size may outside the image? if I have crop before

Syncnet Training code

Will you release the syncnet training code as well?

questions of loss weights

The parameter settings of the loss function in the paper are different from those in the open-source code. Is it really necessary to set Sync loss so low (0.1)? Can it still be effective？

Continue Model Training?

Is it possible to add a feature or change some code please so we can continue training?

Currently if the training crashes or its stopped we cant continue and have to retrain from the start of that step

训练的时候可以多卡训练吗？

代码有实现多卡训练吗？

请教 Syncnet Training 代码是否正确

from models.Syncnet import SyncNetPerception,SyncNet
from config.config import DINetTrainingOptions
from sync_batchnorm import convert_model

from torch.utils.data import DataLoader
from dataset.dataset_DINet_syncnet import DINetDataset

from utils.training_utils import get_scheduler, update_learning_rate,GANLoss

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import os
import torch.nn.functional as F


if __name__ == "__main__":
    # load config
    opt = DINetTrainingOptions().parse_args()
    random.seed(opt.seed)
    np.random.seed(opt.seed)
    torch.cuda.manual_seed(opt.seed)
    # init network
    
    net_lipsync = SyncNet(15,29,128).cuda()

    criterionMSE = nn.BCELoss().cuda()
    # set scheduler
    # set label of syncnet perception loss
    real_tensor = torch.tensor(1.0).cuda()
    
    # setup optimizer
   # optimizer_s = optim.Adam(net_lipsync.parameters(), lr=opt.lr_g)
    optimizer_s = optim.Adamax(net_lipsync.parameters(), lr=opt.lr_g)
    
    # set scheduler
    net_s_scheduler = get_scheduler(optimizer_s, opt.non_decay, opt.decay)

    
    # load training data
    train_data = DINetDataset(opt.train_data,opt.augment_num,opt.mouth_region_size)
    training_data_loader = DataLoader(dataset=train_data,  batch_size=opt.batch_size, shuffle=True,drop_last=True,num_workers=12)
    train_data_length = len(training_data_loader)
    
    # load training data
    test_data = DINetDataset(opt.test_data,opt.augment_num,opt.mouth_region_size)
    test_data_loader = DataLoader(dataset=test_data,  batch_size=1, shuffle=True,drop_last=True,num_workers=12)
    test_data_length = len(test_data_loader)
    
    min_loss = 100
    # start train
    for epoch in range(opt.start_epoch, opt.non_decay+opt.decay+1):
        net_lipsync.train()
        for iteration, data in enumerate(training_data_loader):
            # forward
            optimizer_s.zero_grad()
            source_clip, deep_speech_full, y = data
            source_clip = torch.cat(torch.split(source_clip, 1, dim=1), 0).squeeze(1).float().cuda()
            source_clip = torch.cat(torch.split(source_clip, opt.batch_size, dim=0), 1).cuda()
            deep_speech_full = deep_speech_full.float().cuda()

            y = y.cuda()
            ## sync perception loss
            source_clip_mouth = source_clip[:, :, train_data.radius:train_data.radius + train_data.mouth_region_size,
            train_data.radius_1_4:train_data.radius_1_4 + train_data.mouth_region_size]
            sync_score = net_lipsync(source_clip_mouth, deep_speech_full)        

            loss_sync = criterionMSE(sync_score.unsqueeze(1), y)
            
            loss_sync.backward()
            optimizer_s.step()

            print(
                "===> Epoch[{}]({}/{}):  Loss_Sync: {:.4f} lr_g = {:.7f} ".format(
                    epoch, iteration, len(training_data_loader), float(loss_sync) ,
                    optimizer_s.param_groups[0]['lr']))

        update_learning_rate(net_s_scheduler, optimizer_s)

        # checkpoint
        if epoch %  opt.checkpoint == 0 :
            if not os.path.exists(opt.result_path):
                os.makedirs(opt.result_path)
            model_out_path = os.path.join(opt.result_path, 'netS_model_epoch_{}.pth'.format(epoch))
            states = {
                'epoch': epoch + 1,
                'state_dict': {'net': net_lipsync.state_dict()},
                'optimizer': {'net': optimizer_s.state_dict()}
            }
            torch.save(states, model_out_path)
            print("Checkpoint saved to {}".format(epoch))
        if epoch %  opt.stop_checkpoint == 0:
            break

single-frame dubbed and entire sequence audio

Hello, I have a doubt. The dubbed image generated each time is a single frame, but the driving audio is the entire sequence. Where does this single-frame dubbed image correspond to in the audio?

黑边

RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED

My training got this error when using ONE 4090 gpu:

train_DINet_frame64 ===> Epoch90: Loss_DI: 0.2199 Loss_GI: 0.3163 Loss_perception: 2.4996 Loss_g: 2.8160 lr_g = 0.0001000
Traceback (most recent call last):
File "train_DINet_frame.py", line 121, in
loss_g.backward()
File "/root/miniconda3/envs/dinet/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/dinet/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":30, please report a bug to PyTorch.

this error also randomly occurs on forward.

It seams related to nn.DataParallel while I use single GPU.
Any idea on this will be most appreciative.

mrzzm / dinet Goto Github PK

dinet's People

Stargazers

Watchers

Forkers

dinet's Issues

Recommend Projects

Recommend Topics

Recommend Org