stability-ai / stablecascade Goto Github PK

Official Code for Stable Cascade

License: MIT License

Jupyter Notebook 99.67% Python 0.32% Shell 0.01% CSS 0.01%

stablecascade's Introduction

Stable Cascade

This is the official codebase for Stable Cascade. We provide training & inference scripts, as well as a variety of different models you can use.

This model is built upon the Würstchen architecture and its main difference to other models, like Stable Diffusion, is that it is working at a much smaller latent space. Why is this important? The smaller the latent space, the faster you can run inference and the cheaper the training becomes. How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable Diffusion 1.5.

Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well. A few of those are already provided (finetuning, ControlNet, LoRA) in the training and inference sections.

Moreover, Stable Cascade achieves impressive results, both visually and evaluation wise. According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).

Stable Cascade´s focus on efficiency is evidenced through its architecture and a higher compressed latent space. Despite the largest model containing 1.4 billion parameters more than Stable Diffusion XL, it still features faster inference times, as can be seen in the figure below.

Model Overview

Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade for generating images, hence the name "Stable Cascade". Stage A & B are used to compress images, similarly to what the job of the VAE is in Stable Diffusion. However, as mentioned before, with this setup a much higher compression of images can be achieved. Furthermore, Stage C is responsible for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually. Note that Stage A is a VAE and both Stage B & C are diffusion models.

For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to its small size.

Getting Started

This section will briefly outline how you can get started with Stable Cascade.

Inference

Running the model can be done through the notebooks provided in the inference section. You will find more details regarding downloading the models, compute requirements as well as some tutorials on how to use the models. Specifically, there are four notebooks provided for the following use-cases:

Text-to-Image

A compact notebook that provides you with basic functionality for text-to-image, image-variation and image-to-image.

Text-to-Image

Cinematic photo of an anthropomorphic penguin sitting in a cafe reading a book and having a coffee.

Image Variation

The model can also understand image embeddings, which makes it possible to generate variations of a given image (left). There was no prompt given here.

Image-to-Image

This works just as usual, by noising an image up to a specific point and then letting the model generate from that starting point. Here the left image is noised to 80% and the caption is: A person riding a rodent.

Furthermore, the model is also accessible in the diffusers 🤗 library. You can find the documentation and usage here.

ControlNet

This notebook shows how to use ControlNets that were trained by us or how to use one that you trained yourself for Stable Cascade. With this release, we provide the following ControlNets:

Inpainting / Outpainting

Face Identity

Note: The Face Identity ControlNet will be released at a later point.

Canny

Super Resolution

These can all be used through the same notebook and only require changing the config for each ControlNet. More information is provided in the inference guide.

LoRA

We also provide our own implementation for training and using LoRAs with Stable Cascade, which can be used to finetune the text-conditional model (Stage C). Specifically, you can add and learn new tokens and add LoRA layers to the model. This notebook shows how you can use a trained LoRA. For example, training a LoRA on my dog with the following kind of training images:

Lets me generate the following images of my dog given the prompt: Cinematic photo of a dog [fernando] wearing a space suit.

Image Reconstruction

Lastly, one thing that might be very interesting for people, especially if you want to train your own text-conditional model from scratch, maybe even with a completely different architecture than our Stage C, is to use the (Diffusion) Autoencoder that Stable Cascade uses to be able to work in the highly compressed space. Just like people use Stable Diffusion's VAE to train their own models (e.g. Dalle3), you could use Stage A & B in the same way, while benefiting from a much higher compression, allowing you to train and run models faster.
The notebook shows how to encode and decode images and what specific benefits you get. For example, say you have the following batch of images of dimension 4 x 3 x 1024 x 1024:

You can encode these images to a compressed size of 4 x 16 x 24 x 24, giving you a spatial compression factor of 1024 / 24 = 42.67. Afterwards you can use Stage A & B to decode the images back to 4 x 3 x 1024 x 1024, giving you the following output:

As you can see, the reconstructions are surprisingly close, even for small details. Such reconstructions are not possible with a standard VAE etc. The notebook gives you more information and easy code to try it out.

Training

We provide code for training Stable Cascade from scratch, finetuning, ControlNet and LoRA. You can find a comprehensive explanation for how to do so in the training folder.

Remarks

The codebase is in early development. You might encounter unexpected errors or not perfectly optimized training and inference code. We apologize for that in advance. If there is interest, we will continue releasing updates to it, aiming to bring in the latest improvements and optimizations. Moreover, we would be more than happy to receive ideas, feedback or even updates from people that would like to contribute. Cheers.

Gradio App

First install gradio and diffusers by running:

pip3 install gradio
pip3 install accelerate # optionally
pip3 install git+https://github.com/kashif/diffusers.git@wuerstchen-v3

Then from the root of the project run this command:

PYTHONPATH=./ python3 gradio_app/app.py

Citation

@misc{pernias2023wuerstchen,
      title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models}, 
      author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville},
      year={2023},
      eprint={2306.00637},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

LICENSE

All the code from this repo is under an MIT LICENSE
The model weights, that you can get from Hugginface following these instructions, are under a STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE

stablecascade's People

Contributors

Stargazers

Watchers

Forkers

furkangozukara yoimers ns3284 universewill m1ndb0ts miraut techthiyanes marcusrogerio jags111 heyalexchoi karthikra chrispoulin lxe traviscooper x-ck-x throwoutofcoffeeexception shiertier starburst997 thearchiver albertdbio kustomzone dalchrome santeemarcel nschaetti evelynmitchell niittymaa mestersimo folz brunoleitemilk vebbilder wikiup construct0-forks liubo0902 shankeleven sxela hipsterusername jhalljhall neggles ferranespigares isaka ffffffffchopin muharremokutan tarasovin okaris tutumomo magejosh n0kovo cansakirt chris-aftersource tomato2007 daudln x4bandtk katinaaaa richardsonjf fmarkos jade2290 tonywhite11 arebs23 wasahaiah notdeadponyua abhisharsinha y-adachi5963 marty-sullivan nbardy cnjelita damonreed mbrukman pablo-korea de30 allwavemedia zzmjohn k8tems keister dajisuan 2kpr royer-chang thanseefpp xiangzigg98 ilyamk dingsihan sorokinvld ivanusto vicentecarro xunnew dhungana aliang-cv bneayoub n9742507 zuberbaig89 sawduroor-d kynnyhsap arunsanganal matrex jianran bhaveshsaindane tamagusko suramyapathak adambear bewin01 syed-huzaifa-hassan

stablecascade's Issues

Can't train LoRA with single gpu

root@neej0vsc8w:/tmp/StableCascade# python3 train/train_c_lora.py configs/training/finetune_c_lora_mbl.yaml
Launching Script
['experiment_id', 'checkpoint_path', 'output_path', 'image_size', 'webdataset_path', 'grad_accum_steps', 'batch_size', 'updates', 'backup_every', 'save_every', 'lr', 'warmup_updates', 'model_version', 'effnet_checkpoint_path', 'previewer_checkpoint_path', 'module_filters', 'rank', 'train_tokens']
Traceback (most recent call last):
  File "/tmp/StableCascade/train/train_c_lora.py", line 330, in <module>
    warpcore()
  File "/tmp/StableCascade/core/__init__.py", line 292, in __call__
    self.setup_ddp(self.config.experiment_id, single_gpu=single_gpu)  # this will change the device to the CUDA rank
  File "/tmp/StableCascade/core/__init__.py", line 146, in setup_ddp
    process_id = int(os.environ.get("SLURM_PROCID"))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

How much VRAM is needed at a minimum to fine-tune the 3.6B parameter model C?

Thank you for releasing the new model so promptly; I’m very excited about fine-tuning it.
Could you please tell me how much VRAM is needed at a minimum to fine-tune the 3.6B parameter model C? Even when I use a local 48GB of VRAM for fine-tuning at a resolution of 768, I run into out-of-memory issues.
When I train lora with the 3.6B model C, with a batch size of 4 and a resolution of 768, the system VRAM occupies about 45GB.
These results occur with fsdp and EMA turned off. Is this level of VRAM usage normal, or is it because optimizations like xformers have not yet been implemented?
Additionally, what is the optimal resolution for training? The default appears to be 768, but is it not recommended to train at a resolution of 1024, as is done with sdxl?

Super Resolution notebook: Error(s) in loading state_dict for ControlNet

The notebook fails for me when trying to load the state_dict due to missing and unexpected keys. Upon further inspection, it seems that the correct controlnet_bottleneck_mode should be "effnet" instead of "large". However, even in that case, the number of channels in the SR checkpoint (3) is not the same as what the model expects (16). Changing "controlnet_filter" allows me to load the model, but just leads to another error further down the line. Could it be that you simply uploaded the wrong checkpoint?

how stop print /bin/sh: 1: aws: not found

I'm trying to train a lora and the only thing I see is printing is /bin/sh: 1: aws: not found
How do I stop this?

Inference not working

I installed the diffusers package as specified on the HF page, and I use this code for simple inference:

import os
import sys
import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

torch.cuda.device_count()
device = 'cpu'
num_images_per_prompt = 2

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16).to(device)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=torch.float16).to(device)

It's the same exact code as in the HF page. It breaks with this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], [line 4](vscode-notebook-cell:?execution_count=5&line=4)
      [1](vscode-notebook-cell:?execution_count=5&line=1) prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16, 
      [2](vscode-notebook-cell:?execution_count=5&line=2)                                                    low_cpu_mem_usage=False, ignore_mismatched_sizes=True
      [3](vscode-notebook-cell:?execution_count=5&line=3)                                                    ).to(device)
----> [4](vscode-notebook-cell:?execution_count=5&line=4) decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=torch.float16, 
      [5](vscode-notebook-cell:?execution_count=5&line=5)                                                        low_cpu_mem_usage=False, ignore_mismatched_sizes=True
      [6](vscode-notebook-cell:?execution_count=5&line=6)                                                        ).to(device)

File [c:\Python310\lib\site-packages\huggingface_hub\utils\_validators.py:118](file:///C:/Python310/lib/site-packages/huggingface_hub/utils/_validators.py:118), in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    [115](file:///C:/Python310/lib/site-packages/huggingface_hub/utils/_validators.py:115) if check_use_auth_token:
    [116](file:///C:/Python310/lib/site-packages/huggingface_hub/utils/_validators.py:116)     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> [118](file:///C:/Python310/lib/site-packages/huggingface_hub/utils/_validators.py:118) return fn(*args, **kwargs)

File [c:\Python310\lib\site-packages\diffusers\pipelines\pipeline_utils.py:1263](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1263), in DiffusionPipeline.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
   [1260](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1260)     loaded_sub_model = passed_class_obj[name]
   [1261](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1261) else:
   [1262](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1262)     # load sub model
-> [1263](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1263)     loaded_sub_model = load_sub_model(
   [1264](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1264)         library_name=library_name,
   [1265](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1265)         class_name=class_name,
   [1266](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1266)         importable_classes=importable_classes,
   [1267](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1267)         pipelines=pipelines,
   [1268](file:///C:/Python310/lib/site-packages/diffusers/pipelines/pipeline_utils.py:1268)         is_pipeline_module=is_pipeline_module,
...
    [846](file:///C:/Python310/lib/site-packages/diffusers/models/modeling_utils.py:846)     )

RuntimeError: Error(s) in loading state_dict for StableCascadeUnet:
	size mismatch for embedding.1.weight: copying a param with shape torch.Size([320, 16, 1, 1]) from checkpoint, the shape in current model is torch.Size([320, 64, 1, 1]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Nothing helps including the suggestions in the error message.

RuntimeError: Error(s) in loading state_dict for StableCascadeUnet:

RuntimeError: Error(s) in loading state_dict for StableCascadeUnet:
size mismatch for embedding.1.weight: copying a param with shape torch.Size([320, 16, 1, 1]) from checkpoint, the shape in current model is torch.Size([320, 64, 1, 1]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

In text_to_image.ipynb: 'NoneType' object has no attribute 'items'

Win11 in jupyter notebook, I'm using the python packages as installed from requirements.txt
I have no errors in the prior cells, but get this in the Load Extras & Models cell.

AttributeError                            Traceback (most recent call last)
Cell In[4], line 3
      1 # SETUP MODELS & DATA
      2 extras = core.setup_extras_pre()
----> 3 models = core.setup_models(extras)
      4 models.generator.eval().requires_grad_(False)
      5 print("STAGE C READY")

File h:\ai\stablecascade\train\train_c.py:163, in WurstCore.setup_models(self, extras)
    161         generator.load_state_dict(load_or_fail(self.config.generator_checkpoint_path))
    162     else:
--> 163         for param_name, param in load_or_fail(self.config.generator_checkpoint_path).items():
    164             set_module_tensor_to_device(generator, param_name, "cpu", value=param)
    165 generator = generator.to(dtype).to(self.device)

AttributeError: 'NoneType' object has no attribute 'items'

This Workflow Requires PEFT Backend?

When using this w/flow (Win 10, 8Gb VRAM RTX2070, 64Gb RAM)

Error occurred when executing StreamDiffusionCreateStream:

PEFT backend is required for this method.

File "B:\ComfyUI\execution.py", line 152, in recursive_execute
output_data, output_ui = get_output_data(obj, input_data_all)
File "B:\ComfyUI\execution.py", line 82, in get_output_data
return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
File "B:\ComfyUI\execution.py", line 75, in map_node_over_list
results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
File "B:\ComfyUI\custom_nodes\ComfyUI-Diffusers\nodes.py", line 264, in load_stream
stream.load_lcm_lora(lcm_lora)
File "C:\Users\john_\AppData\Local\Programs\Python\Python310\lib\site-packages\streamdiffusion\pipeline.py", line 87, in load_lcm_lora
self.pipe.load_lora_weights(
File "B:\ComfyUI\custom_nodes\ComfyUI-DiffusersStableCascade\src\diffusers\src\diffusers\loaders\lora.py", line 107, in load_lora_weights
raise ValueError("PEFT backend is required for this method.")

Run the Image-to-Image notebook, but Outputs aren't coming clear

Are there any default steps that should be changed to improve the outputs?

Mismatch error when I try to inference with small models

Hi, when I run small-small models to inpainting with default images provided I encounter this mismatch error:

Traceback (most recent call last):
  File "/home/marco/StableCascade/cn.py", line 104, in <module>
    for (sampled_c, _, _) in tqdm(sampling_c, total=extras.sampling_configs['timesteps']):
  File "/home/marco/miniconda3/envs/casc/lib/python3.9/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/marco/StableCascade/gdf/__init__.py", line 71, in sample
    pred, pred_unconditional = model(torch.cat([x, x], dim=0), noise_cond.repeat(2), **model_inputs).chunk(2)
  File "/home/marco/miniconda3/envs/casc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/marco/miniconda3/envs/casc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/marco/StableCascade/modules/stage_c.py", line 246, in forward
    level_outputs = self._down_encode(x, r_embed, clip, cnet)
  File "/home/marco/StableCascade/modules/stage_c.py", line 182, in _down_encode
    x = x + nn.functional.interpolate(next_cnet, size=x.shape[-2:], mode='bilinear',
RuntimeError: The size of tensor a (1536) must match the size of tensor b (2048) at non-singleton dimension 1

I also printed the two tensors involved in this error (stage_c.py):

torch.Size([8, 1536, 24, 24])
torch.Size([8, 2048, 24, 24])

Anyone with the same problem? Maybe it's my fault but I don't know how to solve it.
Thanks

Use 'wget -c ...' to allow wget to resume interrupted downloads

Please change https://github.com/Stability-AI/StableCascade/blob/master/models/download_models.sh to allow 'wget' to detect files that have already been fully or partly downloaded -- IOW change each 'wget' to 'wget -c'. This prevents a long re-download by people who may already have some required files downloaded, or whose prior download was interrupted.

This is an otherwise harmless change. It detects whether a file is already present before overwriting an existing file, and continues an interrupted download if for any reason a prior download is interrupted. Those of us with dodgy Internet connections need this feature.

ModuleNotFoundError: No module named 'inference.utils'

Try to run the notebook for Image to Image, but can't able to locate
from inference.utils import *
from core.utils import load_or_fail
from train import WurstCoreC, WurstCoreB

can't able to locate any of these module inference.utils, core, train

ModuleNotFoundError: No module named 'inference.utils'
ModuleNotFoundError: No module named 'core'
ModuleNotFoundError: No module named 'train'

Can't train lora with local dataset; Parameter validation failed: Invalid bucket name "file:"

According to the readme in train, the config supports local files.

webdataset_path:
  - s3://path/to/your/first/dataset/on/s3
  - file:/path/to/your/local/dataset.tar

However, when I run the training script, I get the following error and it seems like the script is stuck in an infinite loop trying to copy from aws.

output(config included):

**STARTIG JOB WITH CONFIG:**
adaptive_loss_weight: null
allow_tf32: true
backup_every: 1000
batch_size: 32
bucketeer_random_ratio: 0.05
captions_getter: null
checkpoint_extension: safetensors
checkpoint_path: /tmp/cascade/chk
clip_image_model_name: openai/clip-vit-large-patch14
clip_text_model_name: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
dataset_filters: null
dist_file_subfolder: ''
dtype: null
effnet_checkpoint_path: models/effnet_encoder.safetensors
ema_beta: null
ema_iters: null
ema_start_iters: null
experiment_id: stage_c_3b_lora
generator_checkpoint_path: models/stage_c_bf16.safetensors
grad_accum_steps: 4
image_size: 768
lora_checkpoint_path: null
lr: 0.0001
model_version: 3.6B
module_filters:
- .attn
multi_aspect_ratio:
- 1/1
- 1/2
- 1/3
- 2/3
- 3/4
- 1/5
- 2/5
- 3/5
- 4/5
- 1/6
- 5/6
- 9/16
output_path: /tmp/cascade/out
previewer_checkpoint_path: models/previewer.safetensors
rank: 4
save_every: 100
train_tokens:
- - '[mbl]'
  - ^cat</w>
training: true
updates: 10000
use_fsdp: false
wandb_entity: k8tems
wandb_project: StableCascade
warmup_updates: 1
webdataset_path:
- file:/tmp/mbl_2024_02_14_13_12.tar

------------------------------------

**INFO:**
adaptive_loss: null
ema_loss: null
iter: 0
total_steps: 0
train_tokens: null
wandb_run_id: pegmc3ny

------------------------------------

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']

Parameter validation failed:
Invalid bucket name "file:": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Training with batch size 32 (8/GPU)
['dataset', 'dataloader', 'iterator']
**DATA:**
dataloader: DataLoader
dataset: WebDataset
iterator: Bucketeer
training: NoneType

------------------------------------


Unknown options: -

Unknown options: -

Unknown options: -
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: OSError("(('aws s3 cp {  } -',), {'shell': True, 'bufsize': 8192}): exit 255 (read) {}", <webdataset.gopen.Pipe object at 0x7f55e963ae50>, 'pipe:aws s3 cp {  } -')
  warnings.warn(repr(exn))

The logs go on forever with most of it being, "Unknown options: -"

How would I go about training this into a game engine

Let's say I could build a database of game frames and keypresses. How could I use that to build this into a model that acts like the game when given prompts that contain game state information, a current game frame, and a button press.

The game state would have to be interpreted from the image into parseable text, so it probably would require some sort of vlm in addition to interpret the image based on the current game state rather than just captioning the image.

I'm sure a full stack design would be more optimal. Like a single end to end image + structured text to image + structured text model.

Did I just discover a new ai game engine bruhhhh can't wait for the general one.

Garbled up image using IPEX (Intel Extension for PyTorch)

Using notebook from inference/text_to_image.ipynb and replacing mentions of cuda to xpu made it possible to generate images on Intel Arc GPU but stage C images are garbled and are being upscaled either to black image or same garbled mess.

It is working on CPU.

Slow on gpu with 12 gb vram, but there's a solution...

The speed increases considerably if you load one model at a time.
It can also work with video cards with 12 gb vram (even at 1280x1536 resolution and similar)
So only 44 seconds for a 1280x1536 image with a nvidia rtx3060 with 12 GB VRAM
Example: https://github.com/another-ai/stable_cascade_easy

TypeError: argument of type 'NoneType' is not iterable

When running the text_to_image.ipynb notebook. Gets to the 3rd "Load Extras & Models" cell and then this error. I get the same error if I copy/paste the code into a new py script and run outside jupyter. Any ideas? I did setup a new clean virtual environment using the requirements.txt provided.

TypeError Traceback (most recent call last)
Cell In[3], line 3
1 # SETUP MODELS & DATA
2 extras = core.setup_extras_pre()
----> 3 models = core.setup_models(extras)
4 models.generator.eval().requires_grad_(False)
5 print("STAGE C READY")

File D:\Tests\StableCascade\train\train_c.py:128, in WurstCore.setup_models(self, extras)
126 effnet = EfficientNetEncoder()
127 effnet_checkpoint = load_or_fail(self.config.effnet_checkpoint_path)
--> 128 effnet.load_state_dict(effnet_checkpoint if 'state_dict' not in effnet_checkpoint else effnet_checkpoint['state_dict'])
129 effnet.eval().requires_grad_(False).to(self.device)
130 del effnet_checkpoint

TypeError: argument of type 'NoneType' is not iterable

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

I'm trying to train a lora but it gives me an error

!python train_c_lora.py configs/training/finetune_c_3b_lora.yaml

Launching Script
Traceback (most recent call last):
  File "/kaggle/working/StableCascade/train_c_lora.py", line 326, in <module>
    device=torch.device(int(os.environ.get("SLURM_LOCALID")))
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

Inpainting

at this section in the example notebook:

`batch_size = 4
url = "https://cdn.discordapp.com/attachments/1121232062708457508/1204787053892603914/cat_dog.png?ex=65d60061&is=65c38b61&hm=37c3d179a39b1eca4b8894e3c239930cedcbb965da00ae2209cca45f883f86f4&"
images = resize_image(download_image(url)).unsqueeze(0).expand(batch_size, -1, -1, -1)

batch = {'images': images}

mask = None

mask = torch.ones(batch_size, 1, images.size(2), images.size(3)).bool()

outpaint = False
threshold = 0.2

with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
cnet, cnet_input = core.get_cnet(batch, models, extras, mask=mask, outpaint=outpaint, threshold=threshold)
cnet_uncond = cnet

show_images(batch['images'])
show_images(cnet_input)`

i received this error:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in <cell line: 13>()
12
13 with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
---> 14 cnet, cnet_input = core.get_cnet(batch, models, extras, mask=mask, outpaint=outpaint, threshold=threshold)
15 cnet_uncond = cnet
16

/content/StableCascade/train/train_c_controlnet.py in get_cnet(self, batch, models, extras, cnet_input, **kwargs)
141 with torch.no_grad():
142 if cnet_input is None:
--> 143 cnet_input = extras.controlnet_filter(images, **kwargs)
144 if isinstance(cnet_input, tuple):
145 cnet_input, cnet_input_preview = cnet_input

TypeError: SREffnetFilter.call() got an unexpected keyword argument 'mask'`

and am unable to generate the masks for the next steps.
how should i fix this?

thanks for providing the notebooks, my text2image/image2image colab version runs great. I'm trying to build out the remaining features.

If float32 download option is selected, subsequent code fails

I executed the download script "models/download_models.sh" with these arguments: "essential big-big float32". This doesn't download the "bf16" variants. This in turn causes the inference/text_to_image.ipynb script to fail, because it expects the "bf16" variants to be present and doesn't manage the exception.

Some users may want only one or the other variant of the weights, not both, but this may trigger an unmanaged exception.

Quality small vs large model

Has anyone already compared the quality of the small models vs the big models?
I'm quite interested in the difference. If other people would also like to have more info on this, I can test it and report back here.

RuntimeError: cutlassF: no kernel found to launch!

i have been trying to run the model on both kaggle and colab but i always seems to get the following error
RuntimeError: cutlassF: no kernel found to launch!
_the code i am running
import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

device = "cuda"
num_images_per_prompt = 2

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16).to(device)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", torch_dtype=torch.float16).to(device)

prompt = "Anthropomorphic cat dressed as a pilot"
negative_prompt = "lower "

prior_output = prior(
prompt=prompt,
height=1024,
width=1024,
negative_prompt=negative_prompt,
guidance_scale=4.0,
num_images_per_prompt=num_images_per_prompt,
num_inference_steps=20
)
decoder_output = decoder(
image_embeddings=prior_output.image_embeddings.half(),
prompt=prompt,
negative_prompt=negative_prompt,
guidance_scale=0.0,
output_type="pil",
num_inference_steps=10
).images

#Now decoder_output is a list with your PIL images_
dependencies i am installing
!pip3 install git+https://github.com/kashif/diffusers.git@a3dc21385b7386beb3dab3a9845962ede6765887
!pip install invisible_watermark transformers accelerate safetensors

File not found (but it is there!)

Things that went great:

Cloning git clone https://github.com/Stability-AI/StableCascade
installing a venv python3 -m venv ./venv
activating the venv source ./venv/bin/activate
installing the python notebook ./venv/bin/pip3 install jupyter
installing the requirements ./venv/bin/pip3 install -r requirements.txt
downloading the inference files cd models; bash download_models.sh essential big-big float32
configuring the notebook to allow me to connect (the machine with the GPU is headless) ./venv/bin/python3 -m notebook --generate-config + https://stackoverflow.com/questions/39155953/exposing-python-jupyter-on-lan
Running the notebook. ./venv/bin/python3 -m notebook --ip 192.168.1.5 --port 8888
Setting the notebook to "trusted"

Where I got stuck: the second code block couldn't read the file, even though I checked and it is there.

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 3
      1 # SETUP STAGE C
      2 config_file = 'configs/inference/stage_c_3b.yaml'
----> 3 with open(config_file, "r", encoding="utf-8") as file:
      4     loaded_config = yaml.safe_load(file)
      6 core = WurstCoreC(config_dict=loaded_config, device=device, training=False)

File ~/StableCascade/venv/lib/python3.11/site-packages/IPython/core/interactiveshell.py:310, in _modified_open(file, *args, **kwargs)
    303 if file in {0, 1, 2}:
    304     raise ValueError(
    305         f"IPython won't let you open fd={file} by default "
    306         "as it is likely to crash IPython. If you know what you are doing, "
    307         "you can use builtins' open."
    308     )
--> 310 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'configs/inference/stage_c_3b.yaml'

ModuleNotFoundError: No module named 'gdf'

When I try to train model c using finetune_c_3b.yaml, the following error is reported. I'm sure that when I installed the required libraries in the virtual environment with the command python3 -m pip install -r requirements.txt, there was no error reported. How can I resolve this issue?

python3 train/train_c.py configs/training/finetune_c_3b.yaml
Traceback (most recent call last):
  File "F:\app\StableCascade\train\train_c.py", line 11, in <module>
    from gdf import GDF, EpsilonTarget, CosineSchedule
ModuleNotFoundError: No module named 'gdf'

When I use a custom dataset, training doesn't start after many "didn't find ['jpg', 'png'] in ['key', 'url', 'txt']" warnings.

Training doesn't seem to be starting at all after a series of warning messages stating, "didn't find x in y".

output:

/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'png']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['jpg', 'png'] in ['__key__', '__url__', 'txt']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'jpg']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'jpg']")
  warnings.warn(repr(exn))
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'jpg']")
  warnings.warn(repr(exn))
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 904/904 [00:00<00:00, 581kB/s]
vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 862k/862k [00:00<00:00, 39.3MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 30.6MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.22M/2.22M [00:00<00:00, 32.8MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 389/389 [00:00<00:00, 244kB/s]
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.88k/4.88k [00:00<00:00, 3.02MB/s]
pytorch_model.bin.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 120k/120k [00:00<00:00, 61.9MB/s]
Downloading shards:   0%|                                                                                                                                  | 0/2 [00:00<?, ?it/s/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'jpg']")14M/9.99G [00:06<02:14, 70.3MB/s]
  warnings.warn(repr(exn))
                                                                                                                                                                                /usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'jpg']")50G/9.99G [00:21<02:34, 54.8MB/s]
  warnings.warn(repr(exn))
                                                                                                                                                                                /usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'jpg']")77G/9.99G [00:26<02:09, 63.4MB/s]
  warnings.warn(repr(exn))
pytorch_model-00001-of-00002.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 9.99G/9.99G [02:28<00:00, 67.2MB/s]
pytorch_model-00002-of-00002.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 169M/169M [00:02<00:00, 68.9MB/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:31<00:00, 75.80s/it]
Loading checkpoint shards:   0%|                                                                                                                           | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.9/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.03s/it]
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.52k/4.52k [00:00<00:00, 1.98MB/s]
model.safetensors:  29%|██████████████████████████████████▋                                                                                   | 503M/1.71G [00:01<00:03, 394MB/s]/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: ValueError("didn't find ['txt'] in ['__key__', '__url__', 'jpg']")
  warnings.warn(repr(exn))
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.71G/1.71G [00:04<00:00, 354MB/s]
Updating tokens: [(49408, '[mbl]')]
LoRA training 128 layers
['tokenizer', 'text_model', 'generator', 'effnet', 'previewer', 'lora']
**MODELS:**
effnet: EfficientNetEncoder - trainable params 0
generator: StageC - trainable params 3592249360
generator_ema: NoneType - Not a nn.Module
image_model: CLIPVisionModelWithProjection - trainable params 0
lora: ModuleDict - trainable params 3147008
previewer: Previewer - trainable params 0
text_model: CLIPTextModelWithProjection - trainable params 1280
tokenizer: CLIPTokenizerFast - Not a nn.Module
training: NoneType - Not a nn.Module

------------------------------------

['lora']
**OPTIMIZERS:**
generator: NoneType
lora: AdamW
training: NoneType

------------------------------------

[]
**SCHEDULERS:**
lora: GradualWarmupScheduler
training: NoneType

------------------------------------

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
**EXTRAS:**
clip_preprocess: "Compose(\n    Resize(size=224, interpolation=bicubic, max_size=None,\
  \ antialias=warn)\n    CenterCrop(size=(224, 224))\n    Normalize(mean=(0.48145466,\
  \ 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))\n)"
effnet_preprocess: "Compose(\n    Normalize(mean=(0.485, 0.456, 0.406), std=(0.229,\
  \ 0.224, 0.225))\n)"
gdf: <gdf.GDF object at 0x7f6f872458e0>
sampling_configs: '{''cfg'': 5, ''sampler'': <gdf.samplers.DDPMSampler object at 0x7f6f87245a00>,
  ''shift'': 1, ''timesteps'': 20}'
training: None
transforms: "Compose(\n    ToTensor()\n    Resize(size=768, interpolation=bilinear,\
  \ max_size=None, antialias=True)\n    SmartCrop(\n  (saliency_model): MicroResNet(\n\
  \    (downsampler): Sequential(\n      (0): ReflectionPad2d((4, 4, 4, 4))\n    \
  \  (1): Conv2d(3, 8, kernel_size=(9, 9), stride=(4, 4))\n      (2): InstanceNorm2d(8,\
  \ eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n      (3): ReLU()\n\
  \      (4): ReflectionPad2d((1, 1, 1, 1))\n      (5): Conv2d(8, 16, kernel_size=(3,\
  \ 3), stride=(2, 2))\n      (6): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=True,\
  \ track_running_stats=False)\n      (7): ReLU()\n      (8): ReflectionPad2d((1,\
  \ 1, 1, 1))\n      (9): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2))\n    \
  \  (10): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n\
  \      (11): ReLU()\n    )\n    (residual): Sequential(\n      (0): ResBlock(\n\
  \        (resblock): Sequential(\n          (0): ReflectionPad2d((1, 1, 1, 1))\n\
  \          (1): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))\n          (2):\
  \ InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n\
  \          (3): ReLU()\n          (4): ReflectionPad2d((1, 1, 1, 1))\n         \
  \ (5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))\n          (6): InstanceNorm2d(32,\
  \ eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n        )\n\
  \      )\n      (1): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1), groups=32,\
  \ bias=False)\n      (2): ResBlock(\n        (resblock): Sequential(\n         \
  \ (0): ReflectionPad2d((1, 1, 1, 1))\n          (1): Conv2d(64, 64, kernel_size=(3,\
  \ 3), stride=(1, 1))\n          (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1,\
  \ affine=True, track_running_stats=False)\n          (3): ReLU()\n          (4):\
  \ ReflectionPad2d((1, 1, 1, 1))\n          (5): Conv2d(64, 64, kernel_size=(3, 3),\
  \ stride=(1, 1))\n          (6): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True,\
  \ track_running_stats=False)\n        )\n      )\n    )\n    (segmentator): Sequential(\n\
  \      (0): ReflectionPad2d((1, 1, 1, 1))\n      (1): Conv2d(64, 16, kernel_size=(3,\
  \ 3), stride=(1, 1))\n      (2): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=True,\
  \ track_running_stats=False)\n      (3): ReLU()\n      (4): Upsample2d()\n     \
  \ (5): ReflectionPad2d((4, 4, 4, 4))\n      (6): Conv2d(16, 1, kernel_size=(9, 9),\
  \ stride=(1, 1))\n      (7): Sigmoid()\n    )\n  )\n)\n)"

------------------------------------

**TRAINING STARTING...**
STARTING AT STEP: 1/40000
  0%|                                                                                                                                                  | 0/40000 [00:00<?, ?it/s]

Image-to-Image example code?

See here
https://huggingface.co/stabilityai/stable-cascade/discussions/13

ValueError: Trying to set a tensor of shape torch.Size([16, 1536, 1, 1]) in "weight" (which has shape torch.Size([16, 2048, 1, 1])), this look incorrect.

I try to load the small-small models but it generates an error


['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
models/stage_c_lite_bf16.safetensors
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-4-1d475d0c6014>](https://localhost:8080/#) in <cell line: 3>()
      1 # SETUP MODELS & DATA
      2 extras = core.setup_extras_pre()
----> 3 models = core.setup_models(extras)
      4 models.generator.eval().requires_grad_(False)
      5 print("STAGE C READY")

1 frames
[/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py](https://localhost:8080/#) in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    343     if value is not None:
    344         if old_value.shape != value.shape:
--> 345             raise ValueError(
    346                 f'Trying to set a tensor of shape {value.shape} in "{tensor_name}" (which has shape {old_value.shape}), this look incorrect.'
    347             )

ValueError: Trying to set a tensor of shape torch.Size([16, 1536, 1, 1]) in "weight" (which has shape torch.Size([16, 2048, 1, 1])), this look incorrect.

stage_c_3b.yaml

# GLOBAL STUFF
model_version: 3.6B
dtype: bfloat16

effnet_checkpoint_path: models/effnet_encoder.safetensors
previewer_checkpoint_path: models/previewer.safetensors
generator_checkpoint_path: models/stage_c_lite_bf16.safetensors

RuntimeError during Model State Dictionary Loading in text to image.ipynb

I encountered a RuntimeError while trying to run the code within text to image.ipynb, indicating an issue with loading the model's state dictionary. The error reports missing keys and also mentions the presence of unexpected keys.

Upon attempting to load the model's state dictionary, a RuntimeError is thrown, indicating several missing keys as well as unexpected keys in the state dictionary. The error message is as follows:

RuntimeError: Error(s) in loading state_dict for StageA:
Missing key(s) in state_dict: ...
Unexpected key(s) in state_dict: ...

I am seeking assistance to resolve the RuntimeError encountered while loading the model's state dictionary. Any suggestions on how to address the missing keys and handle the unexpected keys would be greatly appreciated.

Below is the full content in the powershell

cuda:0 ['model_version', 'effnet_checkpoint_path', 'previewer_checkpoint_path'] ['model_version', 'stage_a_checkpoint_path', 'effnet_checkpoint_path'] ['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess'] Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]C:\Users\YQBen\anaconda3\envs\SC\lib\site-packages\torch\_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.49it/s] ['tokenizer', 'text_model', 'generator', 'effnet', 'previewer'] STAGE C READY ['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess'] Traceback (most recent call last): File "g:\rise\ai\StableCascade\StableCascade\test.py", line 38, in <module> models_b = core_b.setup_models(extras_b, skip_clip=True) File "g:\rise\ai\StableCascade\StableCascade\train\train_b.py", line 150, in setup_models stage_a.load_state_dict(stage_a_checkpoint if 'state_dict' not in stage_a_checkpoint else stage_a_checkpoint['state_dict']) File "C:\Users\YQBen\anaconda3\envs\SC\lib\site-packages\torch\nn\modules\module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for StageA: Missing key(s) in state_dict: "in_block.1.weight", "in_block.1.bias", "down_blocks.0.gammas", "down_blocks.0.depthwise.1.weight", "down_blocks.0.depthwise.1.bias", "down_blocks.0.channelwise.0.weight", "down_blocks.0.channelwise.0.bias", "down_blocks.0.channelwise.2.weight", "down_blocks.0.channelwise.2.bias", "down_blocks.1.weight", "down_blocks.1.bias", "down_blocks.2.gammas", "down_blocks.2.depthwise.1.weight", "down_blocks.2.depthwise.1.bias", "down_blocks.2.channelwise.0.weight", "down_blocks.2.channelwise.0.bias", "down_blocks.2.channelwise.2.weight", "down_blocks.2.channelwise.2.bias", "down_blocks.3.0.weight", "down_blocks.3.1.weight", "down_blocks.3.1.bias", "down_blocks.3.1.running_mean", "down_blocks.3.1.running_var", "vquantizer.codebook.weight", "up_blocks.0.0.weight", "up_blocks.0.0.bias", "up_blocks.1.gammas", "up_blocks.1.depthwise.1.weight", "up_blocks.1.depthwise.1.bias", "up_blocks.1.channelwise.0.weight", "up_blocks.1.channelwise.0.bias", "up_blocks.1.channelwise.2.weight", "up_blocks.1.channelwise.2.bias", "up_blocks.2.gammas", "up_blocks.2.depthwise.1.weight", "up_blocks.2.depthwise.1.bias", "up_blocks.2.channelwise.0.weight", "up_blocks.2.channelwise.0.bias", "up_blocks.2.channelwise.2.weight", "up_blocks.2.channelwise.2.bias", "up_blocks.3.gammas", "up_blocks.3.depthwise.1.weight", "up_blocks.3.depthwise.1.bias", "up_blocks.3.channelwise.0.weight", "up_blocks.3.channelwise.0.bias", "up_blocks.3.channelwise.2.weight", "up_blocks.3.channelwise.2.bias", "up_blocks.4.gammas", "up_blocks.4.depthwise.1.weight", "up_blocks.4.depthwise.1.bias", "up_blocks.4.channelwise.0.weight", "up_blocks.4.channelwise.0.bias", "up_blocks.4.channelwise.2.weight", "up_blocks.4.channelwise.2.bias", "up_blocks.5.gammas", "up_blocks.5.depthwise.1.weight", "up_blocks.5.depthwise.1.bias", "up_blocks.5.channelwise.0.weight", "up_blocks.5.channelwise.0.bias", "up_blocks.5.channelwise.2.weight", "up_blocks.5.channelwise.2.bias", "up_blocks.6.gammas", "up_blocks.6.depthwise.1.weight", "up_blocks.6.depthwise.1.bias", "up_blocks.6.channelwise.0.weight", "up_blocks.6.channelwise.0.bias", "up_blocks.6.channelwise.2.weight", "up_blocks.6.channelwise.2.bias", "up_blocks.7.gammas", "up_blocks.7.depthwise.1.weight", "up_blocks.7.depthwise.1.bias", "up_blocks.7.channelwise.0.weight", "up_blocks.7.channelwise.0.bias", "up_blocks.7.channelwise.2.weight", "up_blocks.7.channelwise.2.bias", "up_blocks.8.gammas", "up_blocks.8.depthwise.1.weight", "up_blocks.8.depthwise.1.bias", "up_blocks.8.channelwise.0.weight", "up_blocks.8.channelwise.0.bias", "up_blocks.8.channelwise.2.weight", "up_blocks.8.channelwise.2.bias", "up_blocks.9.gammas", "up_blocks.9.depthwise.1.weight", "up_blocks.9.depthwise.1.bias", "up_blocks.9.channelwise.0.weight", "up_blocks.9.channelwise.0.bias", "up_blocks.9.channelwise.2.weight", "up_blocks.9.channelwise.2.bias", "up_blocks.10.gammas", "up_blocks.10.depthwise.1.weight", "up_blocks.10.depthwise.1.bias", "up_blocks.10.channelwise.0.weight", "up_blocks.10.channelwise.0.bias", "up_blocks.10.channelwise.2.weight", "up_blocks.10.channelwise.2.bias", "up_blocks.11.gammas", "up_blocks.11.depthwise.1.weight", "up_blocks.11.depthwise.1.bias", "up_blocks.11.channelwise.0.weight", "up_blocks.11.channelwise.0.bias", "up_blocks.11.channelwise.2.weight", "up_blocks.11.channelwise.2.bias", "up_blocks.12.gammas", "up_blocks.12.depthwise.1.weight", "up_blocks.12.depthwise.1.bias", "up_blocks.12.channelwise.0.weight", "up_blocks.12.channelwise.0.bias", "up_blocks.12.channelwise.2.weight", "up_blocks.12.channelwise.2.bias", "up_blocks.13.weight", "up_blocks.13.bias", "up_blocks.14.gammas", "up_blocks.14.depthwise.1.weight", "up_blocks.14.depthwise.1.bias", "up_blocks.14.channelwise.0.weight", "up_blocks.14.channelwise.0.bias", "up_blocks.14.channelwise.2.weight", "up_blocks.14.channelwise.2.bias", "out_block.0.weight", "out_block.0.bias". Unexpected key(s) in state_dict: "blocks.0.bias", "blocks.0.weight", "blocks.11.bias", "blocks.11.num_batches_tracked", "blocks.11.running_mean", "blocks.11.running_var", "blocks.11.weight", "blocks.12.bias", "blocks.12.weight", "blocks.14.bias", "blocks.14.num_batches_tracked", "blocks.14.running_mean", "blocks.14.running_var", "blocks.14.weight", "blocks.15.bias", "blocks.15.weight", "blocks.17.bias", "blocks.17.num_batches_tracked", "blocks.17.running_mean", "blocks.17.running_var", "blocks.17.weight", "blocks.18.bias", "blocks.18.weight", "blocks.2.bias", "blocks.2.num_batches_tracked", "blocks.2.running_mean", "blocks.2.running_var", "blocks.2.weight", "blocks.20.bias", "blocks.20.num_batches_tracked", "blocks.20.running_mean", "blocks.20.running_var", "blocks.20.weight", "blocks.21.bias", "blocks.21.weight", "blocks.23.bias", "blocks.23.num_batches_tracked", "blocks.23.running_mean", "blocks.23.running_var", "blocks.23.weight", "blocks.24.bias", "blocks.24.weight", "blocks.3.bias", "blocks.3.weight", "blocks.5.bias", "blocks.5.num_batches_tracked", "blocks.5.running_mean", "blocks.5.running_var", "blocks.5.weight", "blocks.6.bias", "blocks.6.weight", "blocks.8.bias", "blocks.8.num_batches_tracked", "blocks.8.running_mean", "blocks.8.running_var", "blocks.8.weight", "blocks.9.bias", "blocks.9.weight".

ControlNet for semantic segmentation

Congratulations on your work and dedication to research! It's impressive to see the progress you are making. I would like to know if, during the development of your project, the application of ControlNet for semantic segmentation was considered. If so, could you share if you encountered any difficulties inherent to this approach? Thank you for your attention.

Some issues with the effectiveness of image reconstructions

As you can see, the reconstructions are surprisingly close, even for small details. Such reconstructions are not possible with a standard VAE etc.

I compared the effects of StableCascade and standalone VAE under 512x512 sample images, and the results are as follows:

origin image：

StableCascade stage B&A：

VAE：

It seems that a standard VAE can achieve this kind of reconstruction effect. Does the readme imply that a standard VAE cannot achieve this effect under a 24x24 latent space?

How to run Super Resolution code?

I'm really struggling to figure it out. Thanks.

Always get the Cuda Out Of Memory error when training LoRA despite fixing the batch_size

Here is my logs:

STARTIG JOB WITH CONFIG:
adaptive_loss_weight: null
allow_tf32: true
backup_every: 1000
batch_size: 4
bucketeer_random_ratio: 0.05
captions_getter: null
checkpoint_extension: safetensors
checkpoint_path: output
clip_image_model_name: openai/clip-vit-large-patch14
clip_text_model_name: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
dataset_filters: null
dist_file_subfolder: ''
dtype: null
effnet_checkpoint_path: models/effnet_encoder.safetensors
ema_beta: null
ema_iters: null
ema_start_iters: null
experiment_id: stage_c_3b_lora
generator_checkpoint_path: models/stage_c_bf16.safetensors
grad_accum_steps: 4
image_size: 768
lora_checkpoint_path: null
lr: 0.0001
model_version: 3.6B
module_filters:

.attn
multi_aspect_ratio:
1/1
1/2
1/3
2/3
3/4
1/5
2/5
3/5
4/5
1/6
5/6
9/16
output_path: output
previewer_checkpoint_path: models/previewer.safetensors
rank: 4
save_every: 100
train_tokens:
- '[fernando]'
- ^dog
  training: true
  updates: 10000
  use_fsdp: false
  wandb_entity: quocanh34
  wandb_project: StableCascade
  warmup_updates: 1
  webdataset_path: file:data/fernando.tar

INFO:
adaptive_loss: null
ema_loss: null
iter: 0
total_steps: 0
train_tokens: null
wandb_run_id: 7spfifem

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
Training with batch size 4 (1/GPU)
['dataset', 'dataloader', 'iterator']
DATA:
dataloader: DataLoader
dataset: WebDataset
iterator: Bucketeer
training: NoneType

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:07<00:00, 3.88s/it]
Updating tokens: [(49408, '[fernando]')]
LoRA training 128 layers
['tokenizer', 'text_model', 'generator', 'effnet', 'previewer', 'lora']
MODELS:
effnet: EfficientNetEncoder - trainable params 0
generator: StageC - trainable params 3592249360
generator_ema: NoneType - Not a nn.Module
image_model: CLIPVisionModelWithProjection - trainable params 0
lora: ModuleDict - trainable params 3147008
previewer: Previewer - trainable params 0
text_model: CLIPTextModelWithProjection - trainable params 1280
tokenizer: CLIPTokenizerFast - Not a nn.Module
training: NoneType - Not a nn.Module

['lora']
OPTIMIZERS:
generator: NoneType
lora: AdamW
training: NoneType

[]
SCHEDULERS:
lora: GradualWarmupScheduler
training: NoneType

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
EXTRAS:
clip_preprocess: "Compose(\n Resize(size=224, interpolation=bicubic, max_size=None,
\ antialias=warn)\n CenterCrop(size=(224, 224))\n Normalize(mean=(0.48145466,
\ 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))\n)"
effnet_preprocess: "Compose(\n Normalize(mean=(0.485, 0.456, 0.406), std=(0.229,
\ 0.224, 0.225))\n)"
gdf: <gdf.GDF object at 0x7f40dd905d80>
sampling_configs: '{''cfg'': 5, ''sampler'': <gdf.samplers.DDPMSampler object at 0x7f40dd907430>,
''shift'': 1, ''timesteps'': 20}'
training: None
transforms: "Compose(\n ToTensor()\n Resize(size=768, interpolation=bilinear,
\ max_size=None, antialias=True)\n SmartCrop(\n (saliency_model): MicroResNet(\n
\ (downsampler): Sequential(\n (0): ReflectionPad2d((4, 4, 4, 4))\n
\ (1): Conv2d(3, 8, kernel_size=(9, 9), stride=(4, 4))\n (2): InstanceNorm2d(8,
\ eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n (3): ReLU()\n
\ (4): ReflectionPad2d((1, 1, 1, 1))\n (5): Conv2d(8, 16, kernel_size=(3,
\ 3), stride=(2, 2))\n (6): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=True,
\ track_running_stats=False)\n (7): ReLU()\n (8): ReflectionPad2d((1,
\ 1, 1, 1))\n (9): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2))\n
\ (10): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n
\ (11): ReLU()\n )\n (residual): Sequential(\n (0): ResBlock(\n
\ (resblock): Sequential(\n (0): ReflectionPad2d((1, 1, 1, 1))\n
\ (1): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))\n (2):
\ InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n
\ (3): ReLU()\n (4): ReflectionPad2d((1, 1, 1, 1))\n
\ (5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))\n (6): InstanceNorm2d(32,
\ eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n )\n
\ )\n (1): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1), groups=32,
\ bias=False)\n (2): ResBlock(\n (resblock): Sequential(\n
\ (0): ReflectionPad2d((1, 1, 1, 1))\n (1): Conv2d(64, 64, kernel_size=(3,
\ 3), stride=(1, 1))\n (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1,
\ affine=True, track_running_stats=False)\n (3): ReLU()\n (4):
\ ReflectionPad2d((1, 1, 1, 1))\n (5): Conv2d(64, 64, kernel_size=(3, 3),
\ stride=(1, 1))\n (6): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True,
\ track_running_stats=False)\n )\n )\n )\n (segmentator): Sequential(\n
\ (0): ReflectionPad2d((1, 1, 1, 1))\n (1): Conv2d(64, 16, kernel_size=(3,
\ 3), stride=(1, 1))\n (2): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=True,
\ track_running_stats=False)\n (3): ReLU()\n (4): Upsample2d()\n
\ (5): ReflectionPad2d((4, 4, 4, 4))\n (6): Conv2d(16, 1, kernel_size=(9, 9),
\ stride=(1, 1))\n (7): Sigmoid()\n )\n )\n)\n)"

TRAINING STARTING...
STARTING AT STEP: 1/40000
0%| | 0/40000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/workspace/StableCascade/train/train_c_lora.py", line 332, in
warpcore(single_gpu=True)
File "/workspace/StableCascade/./core/init.py", line 360, in call
self.train(data, extras, models, optimizers, schedulers)
File "/workspace/StableCascade/./train/base.py", line 254, in train
loss, loss_adjusted = self.forward_pass(data, extras, models)
File "/workspace/StableCascade/train/train_c_lora.py", line 275, in forward_pass
pred = models.generator(noised, noise_cond, **conditions)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/StableCascade/./modules/stage_c.py", line 244, in forward
level_outputs = self._down_encode(x, r_embed, clip, cnet)
File "/workspace/StableCascade/./modules/stage_c.py", line 186, in _down_encode
x = block(x, clip)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/StableCascade/./modules/common.py", line 85, in forward
x = x + self.attention(self.norm(x), kv, self_attn=self.self_attn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/StableCascade/./modules/common.py", line 23, in forward
x = self.attn(x, kv, kv, need_weights=False)[0]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 1243, in forward
self.in_proj_weight, self.in_proj_bias,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/parametrize.py", line 369, in get_parametrized
return parametrization()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/parametrize.py", line 266, in forward
x = self0
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/StableCascade/./modules/lora.py", line 20, in forward
return original_weights + lora_weights
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 0 has a total capacty of 23.65 GiB of which 47.19 MiB is free. Process 1577312 has 23.59 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 124.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

A few questions about the training process

I am wondering about how you guys code, debug, and train this model.

Are you writing code and debugging the model using local machine or cloud machine? For example, if our GPU is not big enough maybe we'll need to use something like Lightning Studio for cloud development.
I want to know how to replicate the entire training from scratch and how costly it is. So what is the training set used to train this model? Can it be disclosed?
I am assuming that the architecture for StableCascade and Stable Diffusion 1.5 is different enough that it requires re-training all the extensions such as ControlNet or Lora. Do you need to re-train the ControlNet for this model?
Is the training code provided in this repository roughly the same one that you used for training StableCascade here?

Thanks. Your effort coding this model is greatly appreciated!

Speed Slow Comparative to SDXL Normal???

This is taking minutes on my setup, which isn't much, but I get less than a minute with SDXL and about 3 minutes full trip 1st stage and 2nd stage both. Sometimes 7. I'm wondering if with 8GB V 4070 would result in about 3-4 it/s at best or if something is wrong and if something is wrong how would I fix it?
Any help appreciated. No help understood.

Have a great day!

image to image finetuing?

This is a great effort after SDXL, but I wonder, is there any example code about on image to image finetuning, just like instrcut pix2pix?
Thanks in advance!

Ideas for improving Stable Cascade's fine details.

I wish you lots of success with this fantastic new release! It's an incredible achievement for prompt coherence and complex hands and feet! I am stunned to see overlapping hands holding a flower, and seeing all the correct fingers. This new technology is amazing and you should be very proud. Amazingly good job by everyone involved!

Since this is a research project, I am curious. The model is very good already. But humans tend to look a bit like plastic due to the super smooth skin. Do you think that the "soft, smooth-skinned, airbrushed" look of skin will be curable via further finetuning? Or is that some kind of limitation of the small, internal latent space? Or maybe even the training data or dimensions?

I would guess that it is fixable via further refinement of the stages that add the final details (Stage B seems to be the fine details stage?).

Alternatively, users can of course add a typical "detailer" stage after the final image stage, to get crisp details. But I guess that such a tweak wouldn't be needed if Stable Cascade can be slightly revised to become better at details.

Requirement torch==2.1.2+cu118 not found

Running pip install -r requirements.txt on a Mac M1, python 3.11.5, pip==23.2.1, results in:

ERROR: Could not find a version that satisfies the requirement torch==2.1.2+cu118 (from versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0)
ERROR: No matching distribution found for torch==2.1.2+cu118

Same goes for torchvision==0.16.2+cu118. Dropping the +cu118 makes it install.

Scratch Training

Thanks for providing the great code.

I try to train from scratch with my own data, please let me know how to set up the config.

Very slow using Pinokio

I tried it using Pinokio and it's so slow even when using the GPU

ERROR: Just get noise with Clip Diffusion and error with Clip Cascade

stable _diffusion:

stable_cascade:

In the gradio_app/app.py load one model at a time

In the gradio_app/app.py:
1- load one model at a time,
2- c_dtype = torch.bfloat16
3- miss a generator=generator on def generate_prior(...)
as an example see: https://github.com/another-ai/stable_cascade_easy
With a nvidia rtx 3060 12 gb vram: 10 minutes vs 44 seconds

Advanced Gradio APP + Auto Windows, RunPod & Linux Installer - 1 Click - Not An Issue

You can download our scripts here : https://www.patreon.com/posts/98410661

I have developed an amazing 1 click installer and Gradio app for this newest amazing model.

This app working many times faster than the Jupyter Notebook shared here. It uses Diffusers Library.

Auto download everything including models

This model can be considered as next level of Stable Diffusion.

The gradio APP I developed supports low VRAM and works great on even 8 GB GPUs

Saves every generated image automatically in outputs folder and many a lot of improvements

Kaggle not working right now due to FP16 bug and I have reported it to be fixed. Hopefully after that notebook will work great

Batch size 4, 1536x1280 resolution it / s is 1.7 on RTX 4090

Batch size 1, 1024x1024 resolution it / s is 12.14 (encoder) / 10.6 (decoder) on RTX 4090

So 1 image takes like 4 seconds on RTX 4090 for 1024x1024

Please provide basic installation tutorials

Such as: Python environment, ...etc. Thanks!

ValueError: Trying to set a tensor of shape torch.Size([77, 1280]) in "weight" (which has shape torch.Size([77, 512])), this look incorrect.

I have try "pip install git+https://github.com/kashif/diffusers.git@wuerstchen-v3",but still occur the things:

A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.autotrackable has been moved to tensorflow.python.trackable.autotrackable. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base_delegate has been moved to tensorflow.python.trackable.base_delegate. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.graph_view has been moved to tensorflow.python.checkpoint.graph_view. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.util has been moved to tensorflow.python.checkpoint.checkpoint. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.checkpoint_management has been moved to tensorflow.python.checkpoint.checkpoint_management. The old module will be deleted in version 2.9.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.resource has been moved to tensorflow.python.trackable.resource. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.asset has been moved to tensorflow.python.trackable.asset. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.python_state has been moved to tensorflow.python.trackable.python_state. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.saving.checkpoint_options has been moved to tensorflow.python.checkpoint.checkpoint_options. The old module will be deleted in version 2.11.
Loading pipeline components...: 0%
0/3 [00:00<?, ?it/s]

ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_4928\1705133580.py in
5 num_images_per_prompt = 2
6 model_path= r"D:\all_models_archives\stable-cascade"
----> 7 prior = StableCascadePriorPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, use_safetensors =True).to(device)
8 decoder = StableCascadeDecoderPipeline.from_pretrained(model_path, torch_dtype=torch.float16, use_safetensors =True).to(device)
9

~\anaconda3\lib\site-packages\huggingface_hub\utils_validators.py in _inner_fn(*args, **kwargs)
116 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.name, has_token=has_token, kwargs=kwargs)
117
--> 118 return fn(*args, **kwargs)
119
120 return _inner_fn # type: ignore

~\anaconda3\lib\site-packages\diffusers\pipelines\pipeline_utils.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1261 else:
1262 # load sub model
-> 1263 loaded_sub_model = load_sub_model(
1264 library_name=library_name,
1265 class_name=class_name,

~\anaconda3\lib\site-packages\diffusers\pipelines\pipeline_utils.py in load_sub_model(library_name, class_name, importable_classes, pipelines, is_pipeline_module, pipeline_class, torch_dtype, provider, sess_options, device_map, max_memory, offload_folder, offload_state_dict, model_variants, name, from_flax, variant, low_cpu_mem_usage, cached_folder, revision)
529 # check if the module is in a subdirectory
530 if os.path.isdir(os.path.join(cached_folder, name)):
--> 531 loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
532 else:
533 # else load from the root directory

~\AppData\Roaming\Python\Python39\site-packages\transformers\modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3478 offload_index,
3479 error_msgs,
-> 3480 ) = cls._load_pretrained_model(
3481 model,
3482 state_dict,

~\AppData\Roaming\Python\Python39\site-packages\transformers\modeling_utils.py in _load_pretrained_model(cls, model, state_dict, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, is_quantized, keep_in_fp32_modules)
3868 if low_cpu_mem_usage:
3869 if not is_fsdp_enabled() or is_fsdp_enabled_and_dist_rank_0():
-> 3870 new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
3871 model_to_load,
3872 state_dict,

~\AppData\Roaming\Python\Python39\site-packages\transformers\modeling_utils.py in _load_state_dict_into_meta_model(model, state_dict, loaded_state_dict_keys, start_prefix, expected_keys, device_map, offload_folder, offload_index, state_dict_folder, state_dict_index, dtype, is_quantized, is_safetensors, keep_in_fp32_modules)
741 elif not is_quantized:
742 # For backward compatibility with older versions of accelerate
--> 743 set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
744 else:
745 if param.dtype == torch.int8 and param_name.replace("weight", "SCB") in state_dict.keys():

~\anaconda3\lib\site-packages\accelerate\utils\modeling.py in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
343 if value is not None:
344 if old_value.shape != value.shape:
--> 345 raise ValueError(
346 f'Trying to set a tensor of shape {value.shape} in "{tensor_name}" (which has shape {old_value.shape}), this look incorrect.'
347 )

ValueError: Trying to set a tensor of shape torch.Size([77, 1280]) in "weight" (which has shape torch.Size([77, 512])), this look incorrect.

TypeError: argument of type 'NoneType' is not iterable

Same error as #6 but different code location.

I setup a new environment as per requirements.txt, downloaded all the models into models (11 files 38.6 GB). Ran a local script using the same code as in text_to_image.ipynb. Gives this error.

cuda:0
['model_version', 'effnet_checkpoint_path', 'previewer_checkpoint_path']
['model_version', 'stage_a_checkpoint_path', 'effnet_checkpoint_path']
['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
Traceback (most recent call last):
  File "D:\Tests\StableCascade\stablecascade.py", line 38, in <module>
    models = core.setup_models(extras)
  File "D:\Tests\StableCascade\train\train_c.py", line 128, in setup_models
    effnet.load_state_dict(effnet_checkpoint if 'state_dict' not in effnet_checkpoint else effnet_checkpoint['state_dict'])
TypeError: argument of type 'NoneType' is not iterable

Any ideas? Thanks.

I can run the more basic script/code from
https://huggingface.co/stabilityai/stable-cascade
but not the code from the text_to_image.ipynb

OutOfMemoryError - is there a --lowmem style option?

My hardware:

torch.cuda.get_device_name(0) = NVIDIA GeForce GTX 1080 Ti
torch.cuda.get_device_properties(0).total_memory = 10.91 GB

I get an OutOfMemoryError in the notebook at line
models_b = core_b.setup_models(extras_b, skip_clip=True)

I'm coming from Automatic1111 which does run XL models (just barely!) Any hints on getting this to fit?

RuntimeError: Windows not yet supported for torch.compile

Is it simply because the win environment does not support torch.compile?
Even it can still generate images, are there any solutions to fix it? Thanks!

RuntimeError Traceback (most recent call last)
Cell In[4], line 2
1 models = WurstCoreC.Models(
----> 2 **{**models.to_dict(), 'generator': torch.compile(models.generator, mode="reduce-overhead", fullgraph=True)}
3 )
5 models_b = WurstCoreB.Models(
6 **{**models_b.to_dict(), 'generator': torch.compile(models_b.generator, mode="reduce-overhead", fullgraph=True)}
7 )

File ~\anaconda3\envs\casada\lib\site-packages\torch_init_.py:1723, in compile(model, fullgraph, dynamic, backend, mode, options, disable)
1720 else:
1721 backend = _TorchCompileWrapper(backend, mode, options, dynamic)
-> 1723 return torch._dynamo.optimize(backend=backend, nopython=fullgraph, dynamic=dynamic, disable=disable)(model)

File ~\anaconda3\envs\casada\lib\site-packages\torch_dynamo\eval_frame.py:583, in optimize(backend, nopython, guard_export_fn, guard_fail_fn, disable, dynamic)
548 def optimize(
549 backend="inductor",
550 *,
(...)
555 dynamic=None,
556 ):
557 """
558 The main entrypoint of TorchDynamo. Do graph capture and call
559 backend() to optimize extracted graphs.
(...)
581 ...
582 """
--> 583 check_if_dynamo_supported()
584 # Note: The hooks object could be global instead of passed around, however that would make
585 # for a confusing API usage and plumbing story wherein we nest multiple .optimize calls.
586 # There is some prior art around this, w/r/t nesting backend calls are enforced to be the same
587 # compiler, however, this feels onerous for callback and hooks, and it feels better to give our users an
588 # easier to understand UX at the cost of a little more plumbing on our end.
589 hooks = Hooks(guard_export_fn=guard_export_fn, guard_fail_fn=guard_fail_fn)

File ~\anaconda3\envs\casada\lib\site-packages\torch_dynamo\eval_frame.py:535, in check_if_dynamo_supported()
533 def check_if_dynamo_supported():
534 if sys.platform == "win32":
--> 535 raise RuntimeError("Windows not yet supported for torch.compile")
536 if sys.version_info >= (3, 12):
537 raise RuntimeError("Python 3.12+ not yet supported for torch.compile")

RuntimeError: Windows not yet supported for torch.compile

do we have to download models manually? error at line models = core.setup_models(extras)

I am making a Gradio app and having issue at model load

The error happens at train_c.py which makes no sense

the error happening at this line of code

models = core.setup_models(extras)

def initialize_models_and_extras():
    global core, core_b, extras, models, extras_b, models_b
    
    # Load configurations
    
    
    config_file = 'configs/inference/stage_c_3b.yaml'
    with open(config_file, "r", encoding="utf-8") as file:
        loaded_config = yaml.safe_load(file)

    config_file_b = 'configs/inference/stage_b_3b.yaml'
    with open(config_file_b, "r", encoding="utf-8") as file:
        config_file_b = yaml.safe_load(file)
    # Initialize cores
    core = WurstCoreC(config_dict=loaded_config, device=device, training=False)
    extras = core.setup_extras_pre()
    models = core.setup_models(extras)
    models.generator.eval().requires_grad_(False)

    core_b = WurstCoreB(config_dict=config_file_b, device=device, training=False)
    extras_b = core_b.setup_extras_pre()
    models_b = core_b.setup_models(extras_b, skip_clip=True)
    models_b = WurstCoreB.Models(
        **{**models_b.to_dict(), 'tokenizer': models.tokenizer, 'text_model': models.text_model}
    )
    models_b.generator.bfloat16().eval().requires_grad_(False)

xyz

Cannot set up models

I am trying to follow along, I'll try to write out what I have.

I have an M1 Air 8GB (small I know).
I installed Docker
I am running this:

> docker run -p 10000:8888 quay.io/jupyter/scipy-notebook:2024-01-15
Unable to find image 'quay.io/jupyter/scipy-notebook:2024-01-15' locally
2024-01-15: Pulling from jupyter/scipy-notebook

I started by creating an install.ipynb at the root and writing two commands:

pip install -r requirements.txt

This initially failed and I had to remove the +cu118 from a few requirements, and then it worked.

Then I ran:

!bash models/download_models.sh essential big-big bfloat16

Which did the job:

Downloading Essential Models (EfficientNet, Stage A, Previewer)
stage_a.safetensors 100%[===================>]  70.24M  45.7MB/s    in 1.5s    
previewer.safetenso 100%[===================>]  15.21M  41.4MB/s    in 0.4s    
effnet_encoder.safe 100%[===================>]  77.73M  55.2MB/s    in 1.4s    
Downloading Large Stage B & Large Stage C
stage_b_bf16.safete 100%[===================>]   2.91G  59.5MB/s    in 52s     
stage_c_bf16.safete 100%[===================>]   6.68G  51.8MB/s    in 2m 25s

Then I went to the text_to_image.ipynb file:

cpu

Load Config:

['model_version', 'effnet_checkpoint_path', 'previewer_checkpoint_path']
['model_version', 'stage_a_checkpoint_path', 'effnet_checkpoint_path']

Load Extras and Models:

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 3
      1 # SETUP MODELS & DATA
      2 extras = core.setup_extras_pre()
----> 3 models = core.setup_models(extras)
      4 models.generator.eval().requires_grad_(False)
      5 print("STAGE C READY")

File ~/StableCascade/train/train_c.py:128, in WurstCore.setup_models(self, extras)
    126 effnet = EfficientNetEncoder()
    127 effnet_checkpoint = load_or_fail(self.config.effnet_checkpoint_path)
--> 128 effnet.load_state_dict(effnet_checkpoint if 'state_dict' not in effnet_checkpoint else effnet_checkpoint['state_dict'])
    129 effnet.eval().requires_grad_(False).to(self.device)
    130 del effnet_checkpoint

TypeError: argument of type 'NoneType' is not iterable