Giter VIP home page Giter VIP logo

break-a-scene's Introduction

Break-A-Scene: Extracting Multiple Concepts from a Single Image [SIGGRAPH Asia 2023]

Break-A-Scene: Extracting Multiple Concepts from a Single Image

Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, Dani Lischinski

Given a single image with multiple concepts, annotated by loose segmentation masks, our method can learn a distinct token for each concept, and use natural language guidance to re-synthesize the individual concepts or combinations of them in various contexts.

Applications

Image Variations

Entangled Scene Decomposition

Background Extraction

Local Editing by Example

Installation

Install the conda virtual environment:

conda env create -f environment.yml
conda activate break-a-scene
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

Usage

Step 1 - Extracting concepts from a single image

Create a new folder containing your single image named img.jpg and the corresponding masks, one mask per concept, e.g., mask0.png, mask1.png. You can see the example folder in examples/creature

Then, you train the model by running the following:

python train.py \
  --instance_data_dir INPUT_PATH  \
  --num_of_assets NUMBER_OF_CONCEPTS \
  --initializer_tokens TOKEN0 TOKEN1 TOKEN2 \
  --class_data_dir PATH_TO_PRIOR_DIR \ 
  --phase1_train_steps 400 \
  --phase2_train_steps 400 \
  --output_dir OUTPUT_DIR

where --instance_data_dir is the path to the input folder, --num_of_assets is the number of concepts to extract, --initializer_tokens is an optional list of words describing the concepts (it can be omitted, but the model may produce better results with a proper initialization), --class_data_dir is a path that contains general images for the prior preservation loss (if you do not have such a folder, the script will generate them for you, can be used for future executions), --phase1_train_steps and --phase1_train_steps are the number of training steps per phase, and --output_dir is the path to save the trained model.

For example:

python train.py \
  --instance_data_dir examples/creature  \
  --num_of_assets 3 \
  --initializer_tokens creature bowl stone \
  --class_data_dir inputs/data_dir \
  --phase1_train_steps 400 \
  --phase2_train_steps 400 \
  --output_dir outputs/creature

Step 2 - Generating images

After training, a new model will be saved in OUTPUT_DIR with an extended vocabulary that contains the additional concepts <asset0> ... <assetN> where N = NUMBER_OF_CONCEPTS - 1. For example, in the above case, there will be additional 3 tokens <asset0>, <asset1> and <asset2>.

Now, you can generate images using:

python inference.py \
  --model_path TRAINED_MODEL_PATH \
  --prompt PROMPT \
  --output_path DESTINATION_PATH

For example, in the aforementioned case:

python inference.py \
  --model_path "outputs/creature" \
  --prompt "a photo of <asset0> at the beach" \
  --output_path "outputs/result.jpg"

Or:

python inference.py \
  --model_path "outputs/creature" \
  --prompt "an oil painting of <asset1> and <asset2>" \
  --output_path "outputs/result.jpg"

Local editing by example (optional)

After extracting the concepts, if you are interested in locally editing another image, you can use Blended Latent Diffusion as follows:

  1. Download the image editing script scripts/text_editing_stable_diffusion.py from Blended Latent Diffusion repository.
  2. Use the fine-tuned model along with the desired concept that you want to add by:
python text_editing_stable_diffusion.py \
  --model_path TRAINED_MODEL_PATH \
  --prompt PROMPT \
  --init_image THE_IMAGE_TO_EDIT \
  --mask THE_AREA_TO_EDIT \
  --output_path DESTINATION_PATH

For example, in the aforementioned case:

python text_editing_stable_diffusion.py \
  --model_path "outputs/creature" \
  --prompt "a photo of <asset2>" \
  --init_image "examples/chicken/img.jpg" \
  --mask "examples/chicken/mask2.png" \
  --output_path "outputs/result.jpg"

Citation

If you find this useful for your research, please cite the following:

@inproceedings{avrahami2023bas,
  author = {Avrahami, Omri and Aberman, Kfir and Fried, Ohad and Cohen-Or, Daniel and Lischinski, Dani},
  title = {Break-A-Scene: Extracting Multiple Concepts from a Single Image},
  year = {2023},
  isbn = {9798400703157},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3610548.3618154},
  doi = {10.1145/3610548.3618154},        
  booktitle = {SIGGRAPH Asia 2023 Conference Papers},
  articleno = {96},
  numpages = {12},
  keywords = {textual inversion, multiple concept extraction, personalization},
  location = {, Sydney, NSW, Australia, },
  series = {SA '23}
}

Disclaimer

This is not an officially supported Google product.

break-a-scene's People

Contributors

omriav avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

break-a-scene's Issues

If prior perservation loss is used for the results in the paper?

Thanks for your wonderful work.
It is observed that there are only reconsturction loss and attention map loss in the paper. However, in the provided code, prior perservation loss appears, which requires 100 (default) images for each concept. If prior perservation loss is used for the results presented in the paper. Why this loss has not been mentioned in the paper but appear in the code. Thanks again.

Releasing a model

Thanks for this very nice work! Are you planning to release a pretrained model from your work?

Learning concepts from multiple images

Hi,

Thanks for your great work. The instructions mentioned having a single image in the instance_data_dir folder and the corresponding masks. But, I was wondering if there was a way to learn concepts corresponding to fixed no. of masks across multiple images such that the mask ids across images correspond to the same object/background, but viewed from different viewpoints. The original textual_inversion method. for example, allows learning a single concept from multiple images, but of course, without masks.

I look forward to your reply.

Conversion of diffusion weights to ckpt

Hi,
I tried to convert the format of the fine-tuned model from hugging face to ckpt. Looking to try it with Automatic111 webui.
I used this script convert_diffusers_to_sd.py based on the discussion in this issue huggingface/diffusers#672

I am getting the following error while using it in webui.

Traceback (most recent call last):
  File "/home/styldod/ws/Github/stable-diffusion-webui/modules/shared.py", line 605, in set
    self.data_labels[key].onchange()
  File "/home/styldod/ws/Github/stable-diffusion-webui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
  File "/home/styldod/ws/Github/stable-diffusion-webui/webui.py", line 226, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: modules.sd_models.reload_model_weights()), call=False)
  File "/home/styldod/ws/Github/stable-diffusion-webui/modules/sd_models.py", line 556, in reload_model_weights
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/home/styldod/ws/Github/stable-diffusion-webui/modules/sd_models.py", line 286, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/home/styldod/ws/Github/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
	size mismatch for model.diffusion_model.input_blocks.1.1.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.input_blocks.1.1.proj_out.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.2.1.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.input_blocks.2.1.proj_out.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.4.1.proj_in.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.input_blocks.4.1.proj_out.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.5.1.proj_in.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.input_blocks.5.1.proj_out.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.7.1.proj_in.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.input_blocks.7.1.proj_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.8.1.proj_in.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.input_blocks.8.1.proj_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.middle_block.1.proj_in.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.middle_block.1.proj_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.3.1.proj_in.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.output_blocks.3.1.proj_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.4.1.proj_in.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.4.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.output_blocks.4.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.output_blocks.4.1.proj_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.5.1.proj_in.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.5.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.output_blocks.5.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1024]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
	size mismatch for model.diffusion_model.output_blocks.5.1.proj_out.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 1280, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.6.1.proj_in.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.6.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.output_blocks.6.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.output_blocks.6.1.proj_out.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.7.1.proj_in.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.7.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.output_blocks.7.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.output_blocks.7.1.proj_out.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.8.1.proj_in.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1024]) from checkpoint, the shape in current model is torch.Size([640, 768]).
	size mismatch for model.diffusion_model.output_blocks.8.1.proj_out.weight: copying a param with shape torch.Size([640, 640]) from checkpoint, the shape in current model is torch.Size([640, 640, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.9.1.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.output_blocks.9.1.proj_out.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.10.1.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.10.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.output_blocks.10.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.output_blocks.10.1.proj_out.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.11.1.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for model.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
	size mismatch for model.diffusion_model.output_blocks.11.1.proj_out.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
	size mismatch for cond_stage_model.transformer.text_model.embeddings.token_embedding.weight: copying a param with shape torch.Size([49409, 1024]) from checkpoint, the shape in current model is torch.Size([49408, 768]).
	size mismatch for cond_stage_model.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([77, 1024]) from checkpoint, the shape in current model is torch.Size([77, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.8.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.9.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.10.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.final_layer_norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for cond_stage_model.transformer.text_model.final_layer_norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

Command used for conversion - python convert_diffusers_to_sd.py --model_path $model_dir --checkpoint_path $ckpt_path

Question about the tool for generating masks

Hi there,

Great work! I am curious about how to generate masks for this method in a convenient way. It would be helpful if you could provide the details of the tool you used for generating masks. Is it an off-the-shelf segmentation model or a manual labeling tool?

How to avoid "CUDA out of memory"

Hi,
Thank you for your great work! Now when I run train.py using the example data(creature) I get the following error and I want to know if this script can be run using a single RTX3090 with 24G VRAM?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 23.70 GiB total capacity; 22.26 GiB already allocated; 89.44 MiB free; 22.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Look forward to your reply! Thnaks!

Concept extraction issue

I have tried many examples, and it worked fine. However, sometimes the model fails to extract the concept and "mixes" the concept into a single token. For example, consider the "rabbit" case in the example directory. After training, the prompt "a photo of a <asset0>" still gives the photo of both <asset0> and <asset1>, as shown below:
image

I found that training for fewer steps in phase 2 could help sometimes, but it doesn't always help. Do you have any suggestions to address the concept extraction failure cases?

Inference Time Details

Really great work! I couldn't find the inference time details in the paper around these two topics:

  1. Since the Unet is finetuned for all concepts at once, is it fair to say that we don't need to maintain a "copy" of Unet for each concept? So the extra cost of storage per concept/user is close to none?
  2. How does the inference speed compare to other methods like Dreambooth and HyperDreamBooth, etc?

GPU Memory Requirements for Training

I want to express my appreciation for your outstanding work on this project!

I'm curious about the minimum GPU memory requirements for training and whether multi-GPU training is supported for handling the model.

Thank you for your time and assistance!

How to extract the background?

An extremely interesting job!!!😎

I have a question here,
I want to know how to extract the background. If I want to extract the background, do I also need a separate mask?

Separate two anime characters

hello, I have tried many times to Separate two anime characters, including increase or reduce the steps of phase 2, or enlarge the lambda_attention(0.01-0.05-0.1), but I always failed. can you give me some advice.

CUDA out of memory

using the arguments below:

python3 /home/anasrezklinux/test_pycharm_link/main/break_a_scene/break_a_scene_git_repository/train.py --instance_data_dir /mnt/c/Users/noobw/PycharmProjects/pythonProject/photos/break_scene_photo --num_of_assets 2 --initializer_tokens ['person', 'person'] --class_data_dir /home/anasrezklinux/test_pycharm_link/main/break_a_scene/person --phase1_train_steps 12 --phase2_train_steps 12 --use_8bit_adam --gradient_accumulation_steps 2 --enable_xformers_memory_efficient_attention --output_dir /home/anasrezklinux/test_pycharm_link/main/break_a_scene

I am having the following OOM error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 12.00 GiB of which 0 bytes is free. Process 162525 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 17.76 GiB is allocated by PyTorch, and 315.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

do you have any suggestion to how could I circumvent this? I have 12GB VRAM on my GPU.

Question about background extraction

Thank you for sharing the excellent work.

After all the training process, what is the token that refers to the background? How can I extract the background?
I might have missed it in the code, but it seems there is no step to register a background token.

Do I need to add it myself, like other assets?

error when using stabilityai/stable-diffusion-2-1 to train

Hello,

When using stabilityai/stable-diffusion-2-1 to train instead of stabilityai/stable-diffusion-2-1-base I get the following error:

Traceback (most recent call last): File "/root/break-a-scene/train.py", line 1524, in <module> SpatialDreambooth() File "/root/break-a-scene/train.py", line 627, in __init__ self.main() File "/root/break-a-scene/train.py", line 1152, in main model_pred = model_pred * downsampled_mask RuntimeError: The size of tensor a (96) must match the size of tensor b (64) at non-singleton dimension 3

I made sure that input image and masks are 768 x 768 (same as the class images generated by stabilityai/stable-diffusion-2-1)

Please share any suggestions on how I can debug/fix this.

Thanks

support xl?

This is a very nice project. I just wondering if support SD XL ??

Thank you

The dataset images in Break-A-Scene.

Hi, Thanks for your great work.

Could you release the dataset images or selection scripts from CoCO in our paper ?

As I heard in the paper, there are 5400 imges per baseline.

Inpainting training and inference

Thanks for sharing the awesome work.
I am able to train and test the inference of the Text2Image model. How to do the same for the inpainting model?

class_prompt parameter setting

  1. How to setup class_prompt parameter for class data generation.
    help = 'The prompt to specify images in the same class as provided instance images.' in train.py.
  2. importance of prior_loss

asset understanding

@omriav Can you please clarify the below questions
1.what is asset here, is it text token of each concept? if yes, which format it will store in folder.
2.training is needed,we cant do direct inference even if we have single concept in each image,right?

"
After training, a new model will be saved in OUTPUT_DIR with an extended vocabulary that contains the additional concepts ... where N = NUMBER_OF_CONCEPTS - 1. For example, in the above case, there will be additional 3 tokens , and ."

Improved Generation of Celebrity Images Compared to Other Humans as Concepts?

Given the abundance of high-quality and diverse images of celebrities available for training, is it plausible that the model might generate better or more accurate images of celebrities (extracted as a concept during training) compared to other less photographed or publicized individuals?

In my tests when I used an image where one of the concepts was a celebrity, my image generation results have 3/5 images where the generated image looks exactly like the human concept. I see similar result when I used an image from VITON-HD dataset but when using my personal image the generated images do not have any likeness to my face at all

Is this expected behavior? If so, does that mean that using LoRA for fine-tuning is a better fit for this use case?

Any insights or references to related work would be highly appreciated. Thank you for your time and consideration!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.