Hello, don't know if you are the original author of this tutorial – if so, thank you f

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, My bad, an example will be much clear. A

Question about the Encoder & Decoder model,about ovalery16/swap-face

Comments (12)

OValery16 commented on May 19, 2024 2

I am the original author of this tutorial, but the technique that is used is not mine and have been implemented by other researchers.

Your question is very interesting and deserves a detailed explanation. Sorry (in advance) if I am a bit wordy.

To make sure you understood it clearly. The initial face extraction process (Training data generation.ipynb) output the face of an actor (ex. daniel craig) 256 by 256 with 3 channel RBG. But most of the picture represents the actor s face + background + a bit of cloth (and in the center of the image you have the face). But for the face swapping, we only need the face itself (which is most smaller: 64x64x3) When the train read the dataset, we apply first a random transformation to the image, and get the face itself. You can find this part of the code in training_data.py. (I)
Auto encoder wants to find a function f(x) ≈ xwhere x is your image (here, the size is arbitrary chosen to be 64 by 64 with 3 channels (RBG) ). For that, we have an encoder which goal is to encode your image in a smaller representation (Here, we choose 8 by 8 with 512 channels ), and a decoder which goal is to use this representation and get back the original images.
Try to see it as using WinZip. Imagine you have a folder that you want to compress in zip. First, WinZip will parse your folder and find out some pattern (some data that are more useful than others) then it will take advantage of this information to encode these data on less bit. Finally, you end up with a zip file with the same information by encoded differently. Later if you give this file to a friend this friend needs to decode the zip file in order to read its content. We call that lossless compression (you don't lose information). In the case of autoencoder, it does a kind of lossy compression (the data that you get back at the end is an approximation of the initial data).
The goal of an autoencoder is to find the function f(x) = decoder(encoder(x)) ≈ x that approximates the best. For that (in contrast of WinZip or WinRAR) it takes advantage of the nature of x. (1) x is an image. (2) x shows the same face but with different angle, lightening condition ... (1) often means that you might need to use convolution operations. (2) implies that you need to decompose your face into its atomic components, such as the shape of the noise, the shape of the smile, the wrinkle ...
You should see the output of the encoder as 512 different images of 8 by 8. Each of these images gives some insight about the noise shape, the ear shape ... These features represent the persons face and will be used by the decoder to reconstruct the image.
Now you understood clearly what we are doing, let's come back to your initial question: how to efficiently train this algorithm for a bigger image shape and parameter I should change. I would say that the only way to know for sure is to try (actually this is the job of data scientist and it is way-way beyond the level of a beginner ^^). But if you are interested in that, I would recommend you to follow these steps:

Take your training data and split it into 3 parts: the training set, the validation set and the test set (80%/10%/10%). Very important thing: they must come from the same distribution.
Modify the encoder (IMAGE_SHAPE), and don't forget to modify the decoder (otherwise the output image won't have the right dimension). Hint: add upscale layers.
First, retrain from scratch the model with a bigger face image size and see what you get (keep in mind, bigger training input imply also a longer training time)
Analyze your result. Run your model on some images of the training set, the validation set. If the image produced using the training data look very bad, keep training for a bit until the loss goes down. If it doesn't anymore but the result is still bad, it means that your model is not able to fit even the training set. Your model is not complex enough or deep. In that case, start increasing the number of neurons in the full connected layer (dense layer) of the encoder (ENCODER_DIM), and train your model.
Once it performs well on the training dataset. Run it on the validation set. Same if it performs poorly. 2 options: (1) Get more training data and restart to train (generally speaking a lot of data help the model to generalize well) (2) Use some regularization technique, such as dropout.

Keep in mind that this process is long and requires a lot of tuning efforts (with a good GPU, it could take up to 1 month to explore the hyperparameter space). If you are interested in this problem, you should take a look at something called 'pix2pix' or 'cycle-gan' (it is a kind of Generative Adversarial Networks). Autoencoder is a great tool for producing images which respect the probability distribution of the original ones. However, I believe that Generative Adversarial Networks (such as cycle-gan) remains a great way to train autoencoder, and surpass the generation capability of standard autoencoder.

If you like you like this project, feel free to leave a star. (it is my only reward ^^)

from swap-face.

shaoanlu commented on May 19, 2024

Great write up with cool results, I like the gifs!

Just want to add some comments about the images shown in readme. Siraj in his video did not correctly interpret the flowchart I made.

In the original deepfakes' architecture, there is no mask segmentation. The only output is reconstructed image.
The interesting (and smart) parts of deepfakes' algorithm are the usage of warped image as input and shared-weights encoder. The encoder get update information (backprop. information) from both two decoder_A and B. But on the contrary, decoder_A never trained on face B for reconstruction (and vise versa for decoder_B). In other words, the encoder is able to encode both face A and B into good embedding, but decoder A/B can only generates face A/B respectively from the given embedding.
To swap face of person B to person A in test time, we feed face B in to encoder to get the embedding, and then feed the embedding into decoder A (not decoder B) to get a face B which is face A look-alike. The reason of such method works is that the autoencoder_A (i.e., encoder + decoder_A) treat face B as if its a warped A, so it "reconstruct" face B into a face A look-alike. If we did not train the autoencoder using warped images, it might not able to "treat face B as if its a warped A".

from swap-face.

OValery16 commented on May 19, 2024

Thank you Shaoanlu for you explaination. Can I reuse you image (the flowchart) and add you comment in the main page of this project. You explaination would be very useful for beginners that to understand how to use the deepface algorithm.

If you like you like this project, feel free to leave a star. (it is my only reward ^^)

from swap-face.

subzerofun commented on May 19, 2024

@OValery16 Completely missed your answer – damn Github notifications... A big thank you for your effort to explain everything – it really helps me to visualize what's going on inside the model!

Without seeing your answer before i already did a lot of the steps you've described. Changing the model, dataset, waiting for the loss to go down. Changing it again, using different data, waiting forever for the loss values to reach an acceptable state. Rinse and repeat till my GPU starts to melt 😀.

So if i did understand the input_ = Input(shape=(8, 8, 512)) correctly, increasing the 512 would allow for more representations/features to be stored? Played around with increasing it to 8, 8, 1024 but accidently deleted the weight files and don't know if the loss values were much better than with the original model. I forgot to mention this was all for 128x128px input images.

I initially also tried changing the shape to 16, 16, 512 (also added a layer to the Decoder) and iirc the quality of the output images was much better, had sharper edges and more detail in general. But this model needed a lot of GPU memory and the training also took quite some time. Using a GTX 1080 Ti with 11 GB fortunately allows me to play around with unnecessarily large models :-). I think i will return to the 16, 16, 512 (maybe 1024) shape and look at the output more carefully.

Will have to get better at documenting which model structure produces good results. Do you get an intuition on how to increase the efficiency of your model with the time? For a beginner it all seems so overwhelming – at first it looks simple, written down as code – but as soon as you want to know what is going on in each layer it gets hard to understand quite fast ...

Was looking into some tools to help visualize layer features, but it's not that easy to implement (at least for me). My idea was to develop a way to show activations and features changing while i train the model. I know there are some projects on github doing this, like https://github.com/philipperemy/keras-visualize-activations, but i think i need a lot more time and experience to apply that to the faceswap-Autoencoder model.

Hopefully someday a tool like this: https://www.youtube.com/watch?v=N9q9qacAKoM will help people to better understand the inner workings of neural nets, to be able to faster adapt a model for a specific task without waiting hours for training to finish.

And thanks for the suggestion of CycleGAN and pix2pix (awesome what they can produce!) – hopefully someday i'm able to apply ideas like that on a project like faceswap/deepface. I can already see the limitations of the standard Autoencoder... simply increasing layer dimensions is easy to do, but most definitely not the best solution to get better output images.

Thanks again for your explanation and hope you can answer if the 16, 16, 512 shape would even be a good idea to try. And if the 8, 8, 512 is efficient enough, could you please elaborate why?

from swap-face.

OValery16 commented on May 19, 2024

@subzerofun Your question is again very interesting. Let's talk a bit about how the knowledge is represented. Sorry (in advance) if I am a bit wordy.

When you are working with Deep Learning, it is very important to understand that the key concept is that the machine learns the key concept by itself. (a machine like a normal person has its own way of learning and representing its knowledge). I will take a concrete example (I hope it will be clearer for you after that). Currently, I live in Taiwan, but I am french. I can tell that Taiwanese that belongs to another culture learn and represent their knowledge in a different way than French people do. None of these representations is better than the other. They re just representation. They serve to store the knowledge accumulated. To convince yourself, in French the sentence "j'aime les fleurs" (i like flowers) becomes 我喜歡花. This representation is more compact because it makes sense for Chinese people to store the knowledge in that way. Do you understand my point?

For a machine, it is the same. Deep learning belongs to a class of machine learning algorithm called "representation learning" (https://en.wikipedia.org/wiki/Feature_learning) where the machine tries to learn the most efficient representation to fulfill a specific task. This representation may not always make sense for us (humans). They are optimized for a specific purpose, such as extracting the feature of the specific face. In contrast, human representations are "optimized" for being interpretable for the human.

To come back to you I will try to give an intuition. The input size you choose for your decoder (output size for the encoder) need to be compact enough for the machine to retrieve easily the information necessary to reconstruct the original images (less/no redundancy). 1. If you increase the size, it happens two things: your training time need to be much longer (but it may not lead an accuracy gain). 2. You decoder may have some hard time to retrieve the important information. My advice is (like in my previous comment): always start small and if you don t fit the training set, try to increase the size or go deeper. But keep always in mind that the machine representation may not be interpretable and in most of the case you only have an insight of why this approach is better than this one.

Researchers that invent new network architecture, such as AlexNet, Inception ... , often tries many models until they find the right one that is converging. The main indicator remains the loss function which you can see as a compass to explore the large set of hyperparameters.

Another tool you play with is the amount of training data (it was mentioned by @shaoanlu). The face-swap algorithm generates extra data, by distorting the input image.

I hope you understand my answer. If you have other question, feel free to contact me.

from swap-face.

ebillerey commented on May 19, 2024

Hi, all my converted picture seams to have a 64x64 streched and oriented rectangle to replace face (and sometime a little misplaced/stretched). So i do some research and test and run into this post. Thanks for explanations, it's helpfull to understand more of this code (i have a lot more to understand :p). I will do some more test but i have a question : if i do an other Model for convert than the used one to train data, can i hardcode in this model a better function to output mask or the futur mask will always look like a stretched 64x64 mask because of the encoded stored data format ?

from swap-face.

OValery16 commented on May 19, 2024

The difficult part with deep learning is that you need to define the right loss function to force your model to converge to the solution that you are looking for. Finding the optimal loss function is an open problem (the research community published a lot of papers about that).

For this github project, we train 2 autoencoders (A and B) and each of them use the mean absolute error loss function. The advantage of the mean absolute error is that it is an function easy to compute. It converge well in most of the case. For discriminative model, it is not a problem because the mean absolute error loss function compare the error between a output label and the true label. (the goal is to converge to a solution that output the right label with a good confidence). However for generative model, the situation is a bit different. The mean absolute error loss function compare the error between a given pixel and the ground truth pixel. It doesn't consider about the realism of the overall picture. The consequence is that a given image can minimize the loss function without being realistic from the point of view of a human. The difference is very important. Ian Goodfellow proposed to solve this problem by proposing a model that can "learn" a good loss function. This technique is called "Generative Adversarial Network". The drawback of that category of networks is that they tend to be much harder to train.

If you want to really improve your model, you should take a look at these methods.

I hope you understand my answer. If you have other question, feel free to contact me.

If you like you like this project, feel free to leave a star. (it is my only reward ^^)

from swap-face.

ebillerey commented on May 19, 2024

Hi,

My bad, an example will be much clear.

After training, a picture with a well oriented 64x64 face will be replaced by a 64x64 generated mask.
After training, a picture with a well oriented 1024x1024 face will be replaced by a 64x64 generated mask but streched (so there is 16x16 square of same color).

The 64x64 format is a subproduct of the encode/decode process (256x256 training data was reduced to 64x64 in the training) and that's why preview is a grid of 64x64 pictures.

Results are good looking results after some training. So for me it's work but limited by the always 64x64 stretched mask when convert a picture face.

So after my previous question, i do some test using a model i've already trained and a new Class Model/Convert. First step, i try to use the model to get an 128x128 output mask, but i always rise error about shape. So i guess i miss something.

My question remain, the trained model will always output 64x64 stretched mask how mather i try with modified code ? Or i need to do the same training process but with code who doesn't reduce face to 64x64 but maibe 128x128 to have 128x128 mask in the future ?

Thanks

from swap-face.

OValery16 commented on May 19, 2024

"My question remain, the trained model will always output 64x64 stretched mask how mather i try with modified code ?"

If you modify the code, you can change the model to scale up. The question is "should you do it ?". I advice you to consider 3 aspects.

The training data

How much data do you have to train your model?

If you scale up the model, keep in mind you ll need more data to train your model. This "swapping-face algorithm" is "simple" enough to converge with little amount of data. If you scale up, the model, you ll need more data.

The computing power

Do you have a GPU to train your model ?

if you scale up, you ll need more computing power in order to train the additional layer. Additionally, adding extra convolutional layer require significantly more GPU memory.

The complexity

Scaling up often implies a more complex model with a different loss function.

from swap-face.

ebillerey commented on May 19, 2024

Thanks, so to get better output, i need a model with more layers to have i/o with better resolution.

In that case, it will really take more time, but it's not all about proof of concept, but to get usable piece of work as a side project to improve my knowledges. I probably try to do improvement after getting a intermediate model (i think make a model for 256x256 output to work around other functionnalities and see exactly change effect).

Maybe 256x256 be enough after other search and improvment. But if not i will try 512x512 (probably weeks of training and little batch size :s)

Lot of work for an idea of familly gift :-)

For training data, it's hard to do, but actually i managed to get 10k for the target part and less for the other. But i think it's probably enough for begin.

Question, there is a system to use(GPU + GPU memory) and (RAM or virtual Memory) ? This is a usefull feature if someone have a clue to do this. Actually, i think the GPU calculation power is not totally used when GPU memory is fully used.

from swap-face.

EXJUSTICE commented on May 19, 2024

Hi ebillerey, what kind of results are you observing? I'm seeing lots of blur around the eyes especially without a GAN component

from swap-face.

CaffreyR commented on May 19, 2024

Hi, did you successfully run the train.ipynb? @subzerofun I encountered this problem! Did you? Thanks!

from swap-face.

Question about the Encoder & Decoder model about swap-face HOT 12 OPEN

Comments (12)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent