compvis / taming-transformers Goto Github PK

View Code? Open in Web Editor NEW

5.6K 77.0 1.1K 409.77 MB

Taming Transformers for High-Resolution Image Synthesis

Home Page: https://arxiv.org/abs/2012.09841

License: MIT License

Python 1.73% Jupyter Notebook 98.27%

taming-transformers's People

Stargazers

Watchers

Forkers

zebrajack ak9250 peterouzh anthonyyuan samuelpietri rromb canbuoy ink1 cv-ip yearofthewhopper tamwaiban tubbz-alt jndhlovu aihill 41xu rogalag shadowkun janfschr zymale griseljimenez sunny11286 rockjicks eteminya zeta1999 amwons erinqi peterzhousz chaoso styler00dollar tawawhite obake2ai peternara stjordanis futureprecd lsflyt justinpinkney ashbt rhinojosa crazy-jack lockejiang daijuting adak32 c3suryansu rajesh16702 suraj-rajan ml-and-ai-repo ericyu97 karthikpoluri jawad1347 jbdatascience lklacar b1sounours longjohncoder knut0815 happypennygames guru4741 misery0424 stel-nik lebon369 skyler14 nimpnaw owalnuto dnzengou chithangduong davidsoong rancherzhang ramyh lzhbrian sailfish009 williamberrios codeaudit luigivendetta nakada0000007 guome marcofernandez007 simenglv stephennfernandes peterzs ftgoktas stevenhan1991 leevisin luckyiitr mdorkenw mtlong guthman umairahmd jinshiyin flock1 smallflyingpig vismay93 towardsautonomy fiveoceans-dev theodoregalanos eleanor91888388 bensantos nora919530829 fedral kyeongbokong remikaz jerryrelmore

taming-transformers's Issues

Is it possible to share the config used for CIFAR-10?

New conditional model training

So the paper is fantastic and had lot of fun playing with the pretrained models :)
However I'm slightly confused about training a new model with COCO dataset . What I understood is the following:

Create a new training config file which creates image, conditional image pairs as part of the training and validation dataloaders.
Train the VQGAN with COCO dataset
Train the VQGAN with conditional image (Is this needed? Because with DRIN there were two training steps but with coco segmentation only one VQGAN training step is required. )
Train the Transformer with the above trained VQGAN checkpoints.

Can someone confirm if my understanding is correct?

How would I do RGB 2 RGB image2image translation with this repo?

I have 512x512 pixel images I would like to do image2image translation on.

where is the key code of the tansformer?

Straight-through estimator

Thank you for sharing your amazing work.

Could you tell me where is the implementation of the "straight-through estimator" for training the encoder?

I just wanted to personally thank everyone involved in this effort. Training is now far more accessible on DALLE-pytorch using the pretrained VAE you provided. Compute and memory costs are substantially lower and it's even possible for people to train a relatively large transformer under 16 GiB of VRAM.

It's early days, and no one has trained a "full DALL-E" yet, but this helps with that plenty and momentum is already picking up on the repo.

So thanks and great work everyone. You're awesome.

error on train resume

I am getting the error below on train resume. Say,
python main.py --base configs/pic.yaml -t True --gpus 0, --max_epochs 2 --resume logs/2021-05-01T01-34-57_pic
Oddly enough, it seems to work fine when I retry.

Traceback (most recent call last):
File "main.py", line 562, in
trainer.fit(model, data)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 445, in fit
results = self.accelerator_backend.train()
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train
self.trainer.train_loop.setup_training(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 174, in setup_training
self.trainer.checkpoint_connector.restore_weights(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 75, in restore_weights
self.restore(self.trainer.resume_from_checkpoint, on_gpu=self.trainer.on_gpu)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 107, in restore
self.restore_model_state(model, checkpoint)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 128, in restore_model_state
model.load_state_dict(checkpoint['state_dict'])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Net2NetTransformer:
Unexpected key(s) in state_dict: "cond_stage_model.colorize".

How to generate unconditional image?

Hi,
How to generate unconditional image?
Thank you!

colab

Hi, can you please add a google colab for inference thanks

waiting

[Question] How to detect and grab a grid from an image

Hi, I'm new to this field, so as part of my studies I'm trying to detect and return a Sudoku grid from and image. I know I can use the Hough lines transform but it gets messy:

It possible to detect grid more cleanly and efficiently?

multi-gpu training hangs

Hmm, training with --gpus 0, works fine but training with --gpus 0,1 hangs right at initializing ddp ...

Request:Test the model without downloading the whole dataset.

Is there a easier way to test out D-RIN and FacesHQ models?The Imagenet is just too big for trying out the result to see how it works.Wish has a way to test 10 example like the S-FLCKR model.Thanks!

The S-FLCKR model is so mind blowing incredible! Great work!

The file hosting server heibox.uni-heidelberg.de is down

https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt

remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 671 (delta 4), reused 7 (delta 3), pack-reused 658
Receiving objects: 100% (671/671), 116.29 MiB | 24.59 MiB/s, done.
Resolving deltas: 100% (139/139), done.
/content/taming-transformers
--2021-03-24 20:53:31--  https://heibox.uni-heidelberg.de/f/140747ba53464f49b476/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt [following]
--2021-03-24 20:53:31--  https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 200 OK
Length: 957954257 (914M) [application/octet-stream]
Saving to: ‘logs/vqgan_imagenet_f16_1024/checkpoints/last.ckpt’

logs/vqgan_imagenet   0%[                    ]       0  --.-KB/s    in 29s     

2021-03-24 20:54:01 (0.00 B/s) - Connection closed at byte 0. Retrying.

--2021-03-24 20:54:02--  (try: 2)  https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 502 Proxy Error
2021-03-24 20:54:32 ERROR 502: Proxy Error.

--2021-03-24 20:54:32--  https://heibox.uni-heidelberg.de/f/6ecf2af6c658432c8298/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/3dbcbfc9-5824-4909-8237-df3035a8d83b/model.yaml [following]
--2021-03-24 20:54:32--  https://heibox.uni-heidelberg.de/seafhttp/files/3dbcbfc9-5824-4909-8237-df3035a8d83b/model.yaml
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 502 Proxy Error
2021-03-24 20:55:02 ERROR 502: Proxy Error.

--2021-03-24 20:55:03--  https://heibox.uni-heidelberg.de/f/867b05fc8c4841768640/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/5baa72fd-1411-420c-b711-69aeacccbf8d/last.ckpt [following]
--2021-03-24 20:55:03--  https://heibox.uni-heidelberg.de/seafhttp/files/5baa72fd-1411-420c-b711-69aeacccbf8d/last.ckpt
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... ```

can't get or access the checkpoints and can no longer use them for our implementation. Please host them on a different server please!

CelebA-HQ

Regarding CelebA-HQ - in the link you shared there are instructions on how to create tfrecords for CelebA-HQ but not the npy which are required in your code. Can you please provide some guidance on this?
Thank you

Gradient flow

Hi guys, first of all, impressive work you have donehere.

Skimming through the repo I noticed that the critic/discriminator receives gradients through both losses on account of it not having its gradients frozen, when the autoencoder part is optimized. Do I see that correctly? and if so, why did you choose to do that?

Starting point for training sflckr

Hey guys, great work!
I'm trying to run training on a dataset similar to your sflckr. However I'm hitting this error immediately after validation or training starts, right after "Summoning checkpoint.":
assert t <= self.block_size, "Cannot forward, model block size is exhausted." AssertionError: Cannot forward, model block size is exhausted.
Assuming this was GPU memory related I reduced the model size but the error persisted. So I started to think that perhaps this has something to do with the configuration. My starting point is your sflckr.yaml
python main.py --base configs/sflckr.yaml -t True --gpus 0,
Any hints are highly appreciated. Thanks!

Overfitting problem when training transformer

I train the transformer but find it overfits after 30-40 epochs, with the validation loss goes high and the training loss is very small.
If you meet this problem in model training. Now I try to use the pkeep=0.9 in the cond_transformer.py to avoid overfitting.

decode_code calls undefined embed_code function

In VQModel decode_code it calls embed_code on the quantizer, which doesn't appear to be defined.

Rebuilding Pytorch Lighting Training Loop

Hello,
I'm currently trying to rebuild the model for a different use, or pose to image(which is covered in the website, but is not mentioned here). If I already have input images that are presegmented(eg by Openpose), how would I get this work?
The diagrams seems to indicate the input image is downsample, or encoded, passed through transformers in the patch fashion, then upsampled again, but I'm struggling to see how the code allows it(the model definition just seems to be instantiate from config), but I would appreciate any help in helping me hopefully pick apart this code, and reassemble it.

beta term does not correctly apply to the loss term in VQ-VAE

Is the beta supposed to be applied to the commit loss where z_q is frozen rather than z is frozen?

taming-transformers/taming/modules/vqvae/quantize.py

Line 69 in 1bbc027

loss = torch.mean((z_q.detach()-z)**2) + self.beta * \

Could you please give me some clue about how to do super-resolution task?

Hi there,
I cannot find a pretrained model to have a quick try. Could somebody give me some advice?

index 183 is out of bounds for axis 0 with size 182 np.eye

on this line
segmentation = np.eye(182)[segmentation]
this was with the default image in segmentation norway

pip installable package

Hi! Thank you for the great paper :)

I am the owner of https://github.com/lucidrains/DALLE-pytorch and was thinking of offering the users a way to train DALL-E using your pretrained VQ-GAN, specifically the one with a codebook of size 1024. lucidrains/DALLE-pytorch#75 I was wondering if you would be open to making your repository a pip installable package, with all the necessary dependencies (omegaconf and pytorch-lightning), so that it could be installed with

$ pip install taming-transformers

followed by

from taming.model.vqgan import VQModel

inference time

are there ways to reduce inference time? Currently takes about 13 minutes on a k80 for the norway example at 432x288

Unable to download directory "2020-11-09T13-31-51_sflckr": size is too large.

Anyone facing this same problem?

Generating higher resolution images for FacesHQ data

Hello authors,

Thank you for the amazing work! I am trying to generate a face image of higher resolution (512x512). My strategy is to initiate the z_q as a random vector of integers between (0, 1023) of dimension 1x1024, reshape this to 32x32 and then use model.decode_to_img to make an image of 512x512. To make a sensible image of faces I autoregressively generate the next codebook token using row-major sequence on the 32x32 matrix, similar mechanism as given in the notebook here . Unfortunately, the final image I get is something like this

It is something like the repetition of a pattern of faces. Could you please guide me on this?

Thanks!

Are the grid patterns normal?

I am trying to train VQ-GAN on the COCO dataset, but I got reconstructed images with grid patterns on them, and the reconstructed image doesn't look like the input. Is this normal at the beginning of the training? Am I doing anything wrong?

License

Hi, please add a license file thanks

Skip steps in Inferencing to reduce runtime?

I've been looking at inferencing scripts, mostly the taming-transformers.ipynb file, and I can't figure about how to get the transformer to process every other step, every 3 steps, every n steps, etc. How would I modify the script to skip steps, at the expense of quality?

Reconstruction loss

In the paper,
the authors describe that the recon loss is replaced with perceptual loss.

However, in the code, the actual recon loss is L1 (not L2) + perceptual loss.

https://github.com/LeeDoYup/taming-transformers/blob/1bbc027acb6a47e4eb348d611f9af53f1038ffee/taming/modules/losses/vqperceptual.py#L78-L87

Segmentation map palette

Hi, thank you for sharing your amazing work. I want to play with it a bit and especially with my own segmentation maps, where I can find which color represents what material in the landscape model?

Here is an example of a segmentation map:

Discriminator loss remains unchanged during training

Great work!

I am working on my own dataset recently. During training, I found two odd things about the loss. I really appreciate the guidance of you if you've had the same problems before.

a. I fit in my own dataset, the whole process runs well except that the D loss during training remains 1! I followed the same procedure where Discriminator started after several epochs. Seems that D losses its ability to distinguish the real and fake. I decrease the number of pre-running epochs but ended in the same result.

b. I tried to exclude the D loss and keep the perceptual loss. The reconstructed results seem fine, except that within the complex-pattern area there exists some blocking effect noise. I wonder whether you guys ran into the same odd before.

All in all, I really think this piece of work is a big step toward better text-image generation.

When to use sliding attention window?

Hi, I find the code of sliding attention window in sample_conditional.py, but I cannot find where the sliding attention window is used in the training stage. Is this technique only used when sampling? Or both sampling and training? Thanks.

image harmonization

how to harmonize image?

How to run experiments on DeepFashion?

Setting for the reported experimental results

Hi @rromb , some quantitative results of FID on CelebA-HQ, ADE20k, e.g., are reported in this repo. But the model setting is not clear, such as if the model includes a conditional input (e.g., semantic map) or is unconditioned. Can you add the model setting in the table? Thanks.

please share sflckr vqgan for segmentation maps

@rromb @pesser would much appreciate if you could share vqgan checkpoint train on sflckr segmentation maps

Has this approach been extended to use BERT based transformer instead of GPT?

Unable to download checkpoints from the links provided in the README

The page gives an error: Unable to download directory "2020-11-09T13-31-51_sflckr": size is too large. I have to go into the checkpoints directory and download the file by itself, which is not that big of a deal but...

where is the coco segmentation vqgan pre-train model

when generation segmentation to image i can't find this file "ckpt_path: logs/2020-11-07T00-08-54_cocostuffthings_vqvae_segmentation_bce/checkpoints/last.ckpt"

Possible to get feedback on this paper summary

Hi Folks,

Congrats on this amazing paper. I really enjoyed reading the paper. I would love to get your feedback on my paper summary for this paper.

Regards.

How to decide the training epochs or early stop condition?

I really like your paper, thanks for your open source!
It seems that you did not use early stop in the ModelCheckpoint. Could you please tell me how many epochs you trained the VQGAN and transformer? Or do you have suggestions about the training epochs on new datasets?

How much VRAM does it take to train COCO-Stuff/ADE20K transformer models?

Thank you for sharing this great work!

Could you give more information on training COCO-Stuff/ADE20K transformer models?
I got OOM even with bach-size of 1 when training these transformer models with the hyperparameters specified in the appendix on a GPU with 11GB VRAM. Is this expected? If so, what's the minimum amount of VRAM per GPU to train these conditional models?

Pose to Image, Segmentation Mask Question

I was hoping to reimplement the pose to image portion of the paper with a couple modifications, does anyone have any information for the range of possible values allowed in segmentation, and also what color they correspond to?
Also, can you share the config and how you trained this model?

Has anyone succeeded in reproducing the results?

I am still struggling with training VQ-GAN in the first stage, not even the conditional transformer which is a second stage.
The result looks fine before the discriminator loss is injected. BUT using the discriminator loss suddenly ruins the reconstructed images. disc_loss remains 1.0 during the training. Why??

Load an example segmentation and visualize

I get this issue when I use my own image in the Load an example segmentation and visualize section

How can I fix this?? Thanks.

IndexError                                Traceback (most recent call last)
<ipython-input-46-1334a87733d0> in <module>()
      4 segmentation = Image.open(segmentation_path)
      5 segmentation = np.array(segmentation)
----> 6 segmentation = np.eye(182)[segmentation]
      7 segmentation = torch.tensor(segmentation.transpose(2,0,1)[None]).to(dtype=torch.float32, device=model.device)

IndexError: index 255 is out of bounds for axis 0 with size 182

How to inpaint images?

Thanks. In the paper, top row of Figure 4 shows two image inpainting results. I wonder how to do that. Can I replace the image coordinates with masked images?

Use text as Condition?

config files for conditional training on segmentation maps

Great paper! I am trying to retrain this model on an image dataset where I'm able to generate the segmentation masks using DeepLab v2. However, I don't have a config yaml file for training transformer as for faceHQ or D-RIN. Could you please provide a sample yaml file training with segmentation masks? Many Thanks

compvis / taming-transformers Goto Github PK

taming-transformers's People

Stargazers

Watchers

Forkers

taming-transformers's Issues

Recommend Projects

Recommend Topics

Recommend Org