microsoft / styleswin Goto Github PK

[CVPR 2022] StyleSwin: Transformer-based GAN for High-resolution Image Generation

Home Page: https://arxiv.org/abs/2112.10762

License: MIT License

Python 86.20% C++ 1.62% Cuda 12.18%

computer-vision deep-learning deep-neural-networks pytorch generative-adversarial-network gans image-generation image-synthesis styleswin transformer

styleswin's Introduction

StyleSwin

This repo is the official implementation of "StyleSwin: Transformer-based GAN for High-resolution Image Generation" (CVPR 2022).

By Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang and Baining Guo.

Abstract

Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformer-based GANs, especially on high resolutions, e.g., 1024x1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024x1024, and achieves on-par performance on FFHQ 1024x1024, proving the promise of using transformers for high-resolution image generation.

Quantitative Results

Dataset	Resolution	FID	Pretrained Model
FFHQ	256x256	2.81	Google Drive/Azure Storage
LSUN Church	256x256	2.95	Google Drive/Azure Storage
CelebA-HQ	256x256	3.25	Google Drive/Azure Storage
FFHQ	1024x1024	5.07	Google Drive/Azure Storage
CelebA-HQ	1024x1024	4.43	Google Drive/Azure Storage

Requirements

To install the dependencies:

python -m pip install -r requirements.txt

Generating image samples with pretrained model

To generate 50k image samples of resolution 1024 and evaluate the fid score:

python -m torch.distributed.launch --nproc_per_node=1 train_styleswin.py --sample_path /path_to_save_generated_samples --size 1024 --ckpt /path/to/checkpoint --eval --val_num_batches 12500 --val_batch_size 4 --eval_gt_path /path_to_real_images_50k

To generate 50k image samples of resolution 256 and evaluate the fid score:

python -m torch.distributed.launch --nproc_per_node=1 train_styleswin.py --sample_path /path_to_save_generated_samples --size 256 --G_channel_multiplier 2 --ckpt /path/to/checkpoint --eval --val_num_batches 12500 --val_batch_size 4 --eval_gt_path /path_to_real_images_50k

Training

Data preparing

When training FFHQ and CelebA-HQ, we use ImageFolder datasets. The data structure is like this:

FFHQ
├── images
│  ├── 000001.png
│  ├── ...

When training LSUN Church, please follow stylegan2-pytorch to create a lmdb dataset first. After this, the data structure is like this:

LSUN Church
├── data.mdb
└── lock.mdb

FFHQ-1024

To train a new model of FFHQ-1024 from scratch:

python -m torch.distributed.launch --nproc_per_node=8 train_styleswin.py --batch 2 --path /path_to_ffhq_1024 --checkpoint_path /tmp --sample_path /tmp --size 1024 --D_lr 0.0002 --D_sn --ttur --eval_gt_path /path_to_ffhq_real_images_50k --lr_decay --lr_decay_start_steps 600000

CelebA-HQ 1024

To train a new model of CelebA-HQ 1024 from scratch:

python -m torch.distributed.launch --nproc_per_node=8 train_styleswin.py --batch 2 --path /path_to_celebahq_1024 --checkpoint_path /tmp --sample_path /tmp --size 1024 --D_lr 0.0002 --D_sn --ttur --eval_gt_path /path_to_celebahq_real_images_50k

FFHQ-256

To train a new model of FFHQ-256 from scratch:

python -m torch.distributed.launch --nproc_per_node=8 train_styleswin.py --batch 4 --path /path_to_ffhq_256 --checkpoint_path /tmp --sample_path /tmp --size 256 --G_channel_multiplier 2 --bcr --D_lr 0.0002 --D_sn --ttur --eval_gt_path /path_to_ffhq_real_images_50k --lr_decay --lr_decay_start_steps 775000 --iter 1000000

CelebA-HQ 256

To train a new model of CelebA-HQ 256 from scratch:

python -m torch.distributed.launch --nproc_per_node=8 train_styleswin.py --batch 4 --path /path_to_celebahq_256 --checkpoint_path /tmp --sample_path /tmp --size 256 --G_channel_multiplier 2 --bcr --r1 5 --D_lr 0.0002 --D_sn --ttur --eval_gt_path /path_to_celebahq_real_images_50k --lr_decay --lr_decay_start_steps 500000

LSUN Church 256

To train a new model of LSUN Church 256 from scratch:

python -m torch.distributed.launch --nproc_per_node=8 train_styleswin.py --batch 4 --path /path_to_lsun_church_256 --checkpoint_path /tmp --sample_path /tmp --size 256 --G_channel_multiplier 2 --use_flip --r1 5 --lmdb --D_lr 0.0002 --D_sn --ttur --eval_gt_path /path_to_lsun_church_real_images_50k --lr_decay --lr_decay_start_steps 1300000 --iter 1500000

Notice: When training on 16 GB GPUs, you could add --use_checkpoint to save GPU memory.

Qualitative Results

Image samples of FFHQ-1024 generated by StyleSwin:

Image samples of CelebA-HQ 1024 generated by StyleSwin:

Latent code interpolation examples of FFHQ-1024 between the left-most and the right-most images:

Citing StyleSwin

@misc{zhang2021styleswin,
      title={StyleSwin: Transformer-based GAN for High-resolution Image Generation}, 
      author={Bowen Zhang and Shuyang Gu and Bo Zhang and Jianmin Bao and Dong Chen and Fang Wen and Yong Wang and Baining Guo},
      year={2021},
      eprint={2112.10762},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Responsible AI Considerations

Our work does not directly modify the exiting images which may alter the identity or expression of the people. We discourage the use of our work in such applications as it is not designed to do so. We have quantitatively verified that the proposed method does not show evident disparity, on gender and ages as the model mostly follows the dataset distribution; however, we encourage additional care if you intend to use the system on certain demographic groups. We also encourage use of fair and representative data when training on customized data. We caution that the high-resolution images produced by our model may potentially be misused for impersonating humans and viable solutions so avoid this include adding tags or watermarks when distributing the generated photos.

Acknowledgements

This code borrows heavily from stylegan2-pytorch and Swin-Transformer. We also thank the contributors of code Positional Encoding in GANs, DiffAug, StudioGAN and GIQA.

Maintenance

This is the codebase for our research work. Please open a GitHub issue for any help. If you have any questions regarding the technical details, feel free to contact [email protected] or [email protected].

License

The codes and the pretrained model in this repository are under the MIT license as specified by the LICENSE file. We use our labeled dataset to train the scratch detection model.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

styleswin's People

Contributors

Stargazers

Watchers

styleswin's Issues

Source of FFHQ-256

Which source did you get the FFHQ-256 dataset from? The downsampling method can be a critical factor in FID evaluation. Thanks!

Size mismatch

I am using generator with image size: 512, size: 256, style_dim: 512, channel_multiplier: 2. I am getting error:

Can you let me know what should be size config to have the network for image size 512 x 512?

generator.py", line 252, in forward
out = gamma * out + beta
RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2

Can styleSwin train like pix2pix?

Hi there,
I was wondering, can styleswin train like pix2pix? I wanna use vit to do img2img task, Thanks

is that able to run on cpu?

Could you tell us how long it took you to train the models, please?

How long did it take to train 256 * 256 a and 1024*1024 CelebA models. How many GPUs and what model are used？

Model Training

Is it possible to train a new model on the FFHQ dataset with an image size of 512x512 from scratch?

How to finetune?

Sorry if this is a silly question, but I wanted ask if you could provide and example of how to fine-tune one of your existing models to a new dataset?

An example from StyleGan3
# Fine-tune StyleGAN3-R for MetFaces-U using 1 GPU, starting from the pre-trained FFHQ-U pickle.
python train.py --outdir=~/training-runs --cfg=stylegan3-r --data=~/datasets/metfacesu-1024x1024.zip \ --gpus=8 --batch=32 --gamma=6.6 --mirror=1 --kimg=5000 --snap=5 \ --resume=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-ffhqu-1024x1024.pkl

Size mismatch

/home/jirib/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
/home/jirib/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272068185/work/aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
load model: samples000001.pt
Traceback (most recent call last):
File "/home/jirib/Desktop/StyleSwin-main/train_styleswin.py", line 410, in
generator.load_state_dict(ckpt["g"], strict=False)
File "/home/jirib/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Generator:
size mismatch for to_rgbs.5.conv.weight: copying a param with shape torch.Size([3, 64, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 32, 1, 1]).
`
Hello, i modified the generator and discriminator, then trained the model (Was trying something out). When i then go on to load the model, this error message pops out

FID Curve

Great work! However, I use 4 x 3090 GPUs to train StyleSwin on the FFHQ-256 dataset, and evaluate FID on the same dataset. Then get the following FID curve from 0-500k after 4 days.

Is this normal? I might not see the probability of getting FID less than 10 after 1000k. Could you please show your FID curve in this dataset?

Can I get ONNX format for the pre-trained models?

Hello:
I have all kinds of issue to use any models in python to make predication. But I can run ONNX format model in C#.
Is it possible to convert the pre-trained model to ONNX format and published here?
Thanks,

About FLOPs

When I print the flops of generator, there is an error. I found the 'self.attn' on line 356 is a list. In fact there are 2 attns to be calculated. How can I fix it? When I change 'self.attn' to 'self.attn[0]', the flops shows 68109933824.0 (68B), larger than 50.9B than paper.

abnormal image generation

I don't know if this problem occurred in your build process? Is this due to lack of tRGB module or something else?

显存问题

请问12g的显存可以运行改代码嘛，如果我将该模型作为一个生成器像stylegan一样。是否可以通过该预训练的模型进行人脸编辑？期待您的回复。

Training Artifacts

I am training with my own custom faces dataset but I am getting some artifacts during training, is this normal and eventually disappear with longer training, or have I done something wrong?

Query: 64x64 Datasets

Thank you authors for sharing the interesting work you have done in deep generative modeling. I have a rather small doubt regarding changing the architecture for training smaller 64x64 or 32x32 datasets. It would be great if you could guide me.

Doubts about the networks parameters and FLOPs

Hi, Fancy! Thanks for your excellent work.

StyleSwin synthesizes a 1024x1024 image with 40.86M params and 50.90B FLOPs, as shown in the paper of Table 6.

But I reproduced the results by running:

from thop import profile
flops, params = profile(generator, (noise,))             # noise: torch.Size([1, 512])
print('flops: ', flops / 1000000000, 'params: ', params / 1000000)
flops, params = profile(discriminator, (real_img,))  # real_img: torch.Size([1, 3, 1024, 1024])
print('flops: ', flops / 1000000000, 'params: ', params / 1000000)

The generator params are 28.28M with 47.36B FLOPs.
The discriminator params are 27.73M with 50.19B FLOPs.

I don't know where the problem is. Looking forward to your reply!

latent code interpolation

Hi Authors,

Will you be releasing code for latent code interpolation?

Thanks !

colab

please add a google colab for inference

suppress artifacts

Hi, Fancy! Thanks for your excellent work.
I would like to ask which part of the code uses Haar wavelet to suppress artifacts?

cuda memory

I train StyleSwin for FFHQ 1024 resolution. But I got this error:

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 3; 23.65 GiB total capacity; 19.49 GiB already allocated; 474.00 MiB free; 22.10 GiB reserved in total by PyTorch) If res
erved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I'm using 4 RTX 24G GPUs, and there'is not other programs. The batch is set to 2. Why is not enough this to train StyleSwin?

Why do you replace noise with SPE?

Compared with Stylegan2, I notice that you you replace noise with SPE at the same place. What are the differences between SPE and noise? Can SPE achieve the effect of noise? Seems like SPE is a fixed vector?

Thanks.

Training log of losses

Hi,
Thank you for your awesome research!
I am training the model with my own dataset, i want to know the training logs of losses if it is possible.
Discriminator loss seems to converge so fast (close to 0) , is it right?

Best regards,
Hankyu Jang

Query: How to save the generated samples

Thank you authors for sharing the interesting work you have done in deep generative modeling. I have a rather small doubt, this is regarding the command to generate samples.

python -m torch.distributed.launch --nproc_per_node=1 train_styleswin.py --sample_path /path_to_save_generated_samples --size 256 --G_channel_multiplier 2 --ckpt /path/to/checkpoint --eval --val_num_batches 12500 --val_batch_size 4 --eval_gt_path /path_to_real_images_50k

Here, I wanted to know what should bepath_to_save_generated_samples?

I am getting the following error message:

Traceback (most recent call last):
  File "train_styleswin.py", line 382, in <module>
    os.mkdir(args.sample_path)
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Repositories/StyleSwin/StyleSwin_generated_samples/samples'

I made a directory in the cloned repository called StyleSwin_generated_samples for saving the samples.

question strategy in resolution progression

Congratulations on this great job. I would like to ask you if your training strategy is similar to the StyleGAN resolution progression (e.g. 64x64, then 128x128) Thanks!

cvpr2022 call for demos

Hi, there is a call for demos this year for cvpr 2022

https://cvpr2022.thecvf.com/call-demos

where a demo can be added to the Hugging Face organization here: https://huggingface.co/cvpr

would you be interested in submitting a demo for this?

nvcc fatal: Unknown option '-generate-dependencies-with-compile'

Hi, I was referring to your github and trying to implement the StyleSwin repo. ecountering the following problem:
nvcc fatal : Unknown option '-generate-dependencies-with-compile', not sure whast the problem.

About Automatic Mixed Precision

Thank you for awesome research and code release!
Is there any reason that you don't use automatic mixed precision package of pytorch?
Did it lower the performance of model when you use it?

Error using ckpt when resuming

Thanks for sharing, I am having this error:

Traceback (most recent call last):
File "train_styleswin.py", line 409, in
generator.load_state_dict(ckpt["g"])
File "/mnt/anaconda3/envs/StyleSwin/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1668, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Generator:
Unexpected key(s) in state_dict: "layers.4.blocks.0.attn_mask2", "layers.4.blocks.0.norm1.style.weight", "layers.4.blocks.0.norm1.style.bias", "layers.4.blocks.0.qkv.weight", "layers.4.blocks.0.qkv.bias", "layers.4.blocks.0.proj.weight", "layers.4 and so on

this is the command I am running:

python -m torch.distributed.launch --nproc_per_node=2 train_styleswin.py --batch 4 --path /mnt/DATASETS/FFHQ --checkpoint_path /mnt/PROCESSEDdata/StyleSwin/Train --sample_path /mnt/PROCESSEDdata/StyleSwin/Train --size 32 --G_channel_multiplier 2 --bcr --D_lr 0.0002 --D_sn --ttur --eval_gt_path /mnt/DATASETS/FFHQ --lr_decay --lr_decay_start_steps 775000 --iter 1000000 --ckpt /mnt/PROCESSEDdata/StyleSwin/FFHQ_1024.pt --use_checkpoint

I tried with and without the use_checkpoint flag and also the 256 version giving back the same error.

Best

Missing attribution for `Inceptionv3`

https://github.com/microsoft/StyleSwin/blob/main/utils/inception.py seems to be an exact copy, without any attribution, of https://github.com/mseitzer/pytorch-fid/blob/master/src/pytorch_fid/inception.py
with the exception of adding a copyright text @ Microsoft

Continue training

Hello, when I trained 400000 iterations, the power failure, how to continue training?

when release the train code？

Train and Validation Split

What was the train and validation split used? I'm using the checkpoint provided and testing with a validation set of the top 10k, similar to Co-Mod-GAN's split (section 5.1). Using this split I am getting a FID of 4.26.

Effect of equalized learning rate in generator architecture

Hi, thanks for this great work!

In generator code, mapping network and AdaIN uses EqualLinear from StyleGAN2, and transformer block uses nn.Linear.

I know this configuration may follow the original implementation of mapping network and attention block,
but I wonder if this component affects the image generation performance.

E.g. FID when using EqualLinear in qkv of attention block

Do you have any idea of the effects of equalized learning rate in transformer block?

Thanks,

Training time

How long should i expect it to train on 256x256 resolution? I only have 1 GPU, if that helps

Query: Samples in grid

Thank you for the project. I wanted to know if one has saved the generated images in a directory, how to have them in a grid N*M dimensions for better analysis.

你好

你好，我使用python3.7运行requirements命令安装好环境后一直报arch_list[-1] += '+PTX’错误，请问这是什么原因？

Betas

Hello, i forgot to ask in the previous issue, but what is the intuition behind beta1=0.0 and beta2=0.99? I've seen it in a couple more projects (Such as CIPS), and i always wondered how did they come up with these values (As usually, GANs have beta1=0.5 and beta2=0.999). Is there some property of these values that helps training? Or is it just betas that seemed to work the most?

pre_training and fine-tune

Hi, thank you so much for sharing your code！
I have a question. I'd like to transfer the model to medical images.
But I don't I retrain directly or I need load the pre-training model and fine-tune it?
What do you think about it? Looking forward to your reply!