Giter VIP home page Giter VIP logo

prompt_injection's Introduction

Prompt Injection Node for ComfyUI

This custom node for ComfyUI allows you to inject specific prompts at specific blocks of the Stable Diffusion UNet, providing fine-grained control over the generated image. It is based on the concept that the content/subject understanding of the model is primarily contained within the MID0 and MID1 blocks, as demonstrated in the B-Lora (Content Style implicit separation) paper. Features

Inject different prompts into specific UNet blocks Three different node variations for flexible workflow integration Customize the learning rate of specific blocks to focus on content, lighting, style, or other aspects Potential for developing a "Mix of Experts" approach by swapping blocks on-the-fly based on prompt content

Usage

Add the prompt_injection.py node to your ComfyUI custom nodes directory In your ComfyUI workflow, connect the desired node variation based on your input preferences Specify the prompts for each UNet block you want to customize Connect the output to the rest of your workflow and generate the image

Node Variations

Prompt Injection (Single Prompt): Injects a single prompt into the specified UNet blocks Prompt Injection (Multiple Prompts): Allows injecting different prompts into each specified UNet block Prompt Injection (Prompt Dictionary): Accepts a dictionary of block names and their corresponding prompts

Example

Injecting the prompt "white cat" into the OUTPUT0 and OUTPUT1 blocks, while using the prompt "blue dog" for all other blocks, results in an image with the composition of the "blue dog" prompt but with a cat as the subject/content. Acknowledgements

Modified and simplified version of the node from: https://github.com/pamparamm/sd-perturbed-attention Inspired by discussions and findings shared by @Mobioboros and @DataVoid

Future Work

Investigate the location of different concepts (e.g., lighting) within the UNet blocks Develop a "guts diagram" of the SDXL UNet to understand where each aspect is stored Explore the use of different learning rates for specific blocks during fine-tuning or LoRA training Implement a "Mix of Experts" approach by swapping blocks on-the-fly based on prompt content

Feel free to contribute, provide feedback, and share your findings!

prompt_injection's People

Contributors

datacte avatar comfy-pr-bot avatar moonride303 avatar cubiq avatar

Stargazers

rane avatar onestepmumu avatar  avatar Jean-Baptiste Hauchard avatar Stanislav Ursache avatar Lin Yeh avatar  avatar  avatar ZRGX avatar David Dickinson avatar  avatar  avatar Zixu Zhuang avatar Cyber Dick Lang avatar  avatar fofr avatar David Marx avatar  avatar Peter Baylies avatar raf avatar Mel Massadian avatar Alexander G. Morano avatar  avatar syddharth avatar Jodh Singh avatar Jonathan Fischoff avatar  avatar  avatar  avatar Pete Sarabia avatar  avatar  avatar Ian Douglas avatar  avatar Kelvin avatar  avatar Kylin2 avatar 爱可可-爱生活 avatar  avatar  avatar  avatar Phạm Hưng avatar Dustin Rush avatar  avatar Do avatar toyxyz avatar  avatar  avatar ertu avatar  avatar  avatar Jeff Cook avatar Jonas avatar Luciano Santa Brígida avatar Valérien avatar Lau Van Kiet avatar  avatar David Löwenfels avatar  avatar Álvaro Somoza avatar Harsha B Subramanyam avatar gradetwo avatar Zekaryas avatar  avatar CNMagicturtle avatar  avatar Jeongmin Lee avatar  avatar  avatar

Watchers

Rahul Y Gupta avatar Jean-Baptiste Hauchard avatar Biswaroop avatar raf avatar  avatar  avatar

prompt_injection's Issues

Block learning rates

I'd love to test this as I am currently training a bunch of photographic subject and style Lora's and would love to better understand which areas are affected. Can you give me some headers?

Made a version to influence SVD (please help me test)

Disclaimer: This does not work yet.

Posting this here because this repo helped me a ton (and the other fork).

I actually got clip conditioning working to some extent for injecting svd a little with this repo's ideas and learning about clip text embedding and injecting weight and bias layers with images and text. I updated the svd_img2vid_conditioning (will create repo)

Here I've included the various SVD model probing results I got back, where I put comment dummy data to scan for hidden inputs and outputs in the svd model and just exploring this.

Updated the prompt_injection.py with Attn2 Prompt Injection Node. This is all wrong after review, but maybe can use as a framework. This week I'm using the pipeline_stable_video_diffusion.py for how to actually do the embeddings right and will update this in the future.

import comfy.model_patcher
import comfy.samplers
import torch
import torch.nn.functional as F

def build_patch(patchedBlocks, weight=1.0, sigma_start=0.0, sigma_end=1.0):
    def prompt_injection_patch(n, context_attn1: torch.Tensor, value_attn1, extra_options):
        (block, block_index) = extra_options.get('block', (None,None))
        sigma = extra_options["sigmas"].detach().cpu()[0].item() if 'sigmas' in extra_options else 999999999.9
        
        batch_prompt = n.shape[0] // len(extra_options["cond_or_uncond"])

        if sigma <= sigma_start and sigma >= sigma_end:
            if (block and f'{block}:{block_index}' in patchedBlocks and patchedBlocks[f'{block}:{block_index}']):
                if context_attn1.dim() == 3:
                    c = context_attn1[0].unsqueeze(0)
                else:
                    c = context_attn1[0][0].unsqueeze(0)
                b = patchedBlocks[f'{block}:{block_index}'][0][0].repeat(c.shape[0], 1, 1).to(context_attn1.device)
                out = torch.stack((c, b)).to(dtype=context_attn1.dtype) * weight
                out = out.repeat(1, batch_prompt, 1, 1) * weight

                return n, out, out 

        return n, context_attn1, value_attn1
    return prompt_injection_patch

def build_svd_patch(patchedBlocks, weight=1.0, sigma_start=0.0, sigma_end=1.0):
    def prompt_injection_patch(n, context_attn1: torch.Tensor, value_attn1, extra_options):
        (block, block_index) = extra_options.get('block', (None, None))
        sigma = extra_options["sigmas"].detach().cpu()[0].item() if 'sigmas' in extra_options else 999999999.9

        if sigma_start <= sigma <= sigma_end:
            if block and f'{block}:{block_index}' in patchedBlocks and patchedBlocks[f'{block}:{block_index}']:
                if context_attn1.dim() == 3:
                    c = context_attn1[0].unsqueeze(0)
                else:
                    c = context_attn1[0][0].unsqueeze(0)
                b = patchedBlocks[f'{block}:{block_index}'][0][0].repeat(c.shape[0], 1, 1).to(context_attn1.device)
                
                # Interpolate to match the sizes
                if c.size() != b.size():
                    b = F.interpolate(b.unsqueeze(0), size=c.size()[1:], mode='nearest').squeeze(0)
                
                out = torch.cat((c, b), dim=-1).to(dtype=context_attn1.dtype) * weight
                return n, out  # Ensure exactly two values are returned for SVD
        return n, context_attn1, value_attn1  # Ensure exactly three values are returned

    return prompt_injection_patch

class SVDPromptInjection:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {"model": ("MODEL",)},
            "optional": {
                "all": ("CONDITIONING",),
                "time_embed": ("CONDITIONING",),
                "label_emb": ("CONDITIONING",),
                "input_blocks_0": ("CONDITIONING",),
                "input_blocks_1": ("CONDITIONING",),
                "input_blocks_2": ("CONDITIONING",),
                "input_blocks_3": ("CONDITIONING",),
                "input_blocks_4": ("CONDITIONING",),
                "input_blocks_5": ("CONDITIONING",),
                "input_blocks_6": ("CONDITIONING",),
                "input_blocks_7": ("CONDITIONING",),
                "input_blocks_8": ("CONDITIONING",),
                "middle_block_0": ("CONDITIONING",),
                "middle_block_1": ("CONDITIONING",),
                "middle_block_2": ("CONDITIONING",),
                "output_blocks_0": ("CONDITIONING",),
                "output_blocks_1": ("CONDITIONING",),
                "output_blocks_2": ("CONDITIONING",),
                "output_blocks_3": ("CONDITIONING",),
                "output_blocks_4": ("CONDITIONING",),
                "output_blocks_5": ("CONDITIONING",),
                "output_blocks_6": ("CONDITIONING",),
                "output_blocks_7": ("CONDITIONING",),
                "output_blocks_8": ("CONDITIONING",),
                "weight": ("FLOAT", {"default": 1.0, "min": -2.0, "max": 5.0, "step": 0.05}),
                "start_at": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.0, "step": 0.001}),
                "end_at": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.001}),
            }
        }

    RETURN_TYPES = ("MODEL",)
    FUNCTION = "patch"
    CATEGORY = "advanced/model"

    def patch(self, model: comfy.model_patcher.ModelPatcher, all=None, time_embed=None, label_emb=None, input_blocks_0=None, input_blocks_1=None, input_blocks_2=None, input_blocks_3=None, input_blocks_4=None, input_blocks_5=None, input_blocks_6=None, input_blocks_7=None, input_blocks_8=None, middle_block_0=None, middle_block_1=None, middle_block_2=None, output_blocks_0=None, output_blocks_1=None, output_blocks_2=None, output_blocks_3=None, output_blocks_4=None, output_blocks_5=None, output_blocks_6=None, output_blocks_7=None, output_blocks_8=None, weight=1.0, start_at=0.0, end_at=1.0):
        if not any((all, time_embed, label_emb, input_blocks_0, input_blocks_1, input_blocks_2, input_blocks_3, input_blocks_4, input_blocks_5, input_blocks_6, input_blocks_7, input_blocks_8, middle_block_0, middle_block_1, middle_block_2, output_blocks_0, output_blocks_1, output_blocks_2, output_blocks_3, output_blocks_4, output_blocks_5, output_blocks_6, output_blocks_7, output_blocks_8)):
            return (model,)

        m = model.clone()
        sigma_start = m.get_model_object("model_sampling").percent_to_sigma(start_at)
        sigma_end = m.get_model_object("model_sampling").percent_to_sigma(end_at)

        patchedBlocks = {}
        blocks = {
            'time_embed': [0],
            'label_emb': [0],
            'input_blocks': list(range(9)),
            'middle_block': list(range(3)),
            'output_blocks': list(range(9))
        }

        for block in blocks:
            for index in blocks[block]:
                block_name = f"{block}_{index}"
                value = locals().get(block_name, None)
                if value is None:
                    value = all
                if value is not None:
                    patchedBlocks[f"{block}:{index}"] = value

        m.set_model_attn2_patch(build_svd_patch(patchedBlocks, weight=weight, sigma_start=sigma_start, sigma_end=sigma_end))

        return (m,)

class PromptInjection:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "model": ("MODEL",),
            },
            "optional": {
                "all":  ("CONDITIONING",),
                "input_4":  ("CONDITIONING",),
                "input_5":  ("CONDITIONING",),
                "input_7":  ("CONDITIONING",),
                "input_8":  ("CONDITIONING",),
                "middle_0": ("CONDITIONING",),
                "output_0": ("CONDITIONING",),
                "output_1": ("CONDITIONING",),
                "output_2": ("CONDITIONING",),
                "output_3": ("CONDITIONING",),
                "output_4": ("CONDITIONING",),
                "output_5": ("CONDITIONING",),
                "weight": ("FLOAT", {"default": 1.0, "min": -2.0, "max": 5.0, "step": 0.05}),
                "start_at": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.0, "step": 0.001}),
                "end_at": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.001}),
            }
        }

    RETURN_TYPES = ("MODEL",)
    FUNCTION = "patch"

    CATEGORY = "advanced/model"

    def patch(self, model: comfy.model_patcher.ModelPatcher, all=None, input_4=None, input_5=None, input_7=None, input_8=None, middle_0=None, output_0=None, output_1=None, output_2=None, output_3=None, output_4=None, output_5=None, weight=1.0, start_at=0.0, end_at=1.0):
        if not any((all, input_4, input_5, input_7, input_8, middle_0, output_0, output_1, output_2, output_3, output_4, output_5)):
            return (model,)

        m = model.clone()
        sigma_start = m.get_model_object("model_sampling").percent_to_sigma(start_at)
        sigma_end = m.get_model_object("model_sampling").percent_to_sigma(end_at)

        patchedBlocks = {}
        blocks = {'input': [4, 5, 7, 8], 'middle': [0], 'output': [0, 1, 2, 3, 4, 5]}

        for block in blocks:
            for index in blocks[block]:
                value = locals()[f"{block}_{index}"] if locals()[f"{block}_{index}"] is not None else all
                if value is not None:
                    patchedBlocks[f"{block}:{index}"] = value

        m.set_model_attn2_patch(build_patch(patchedBlocks, weight=weight, sigma_start=sigma_start, sigma_end=sigma_end))

        return (m,)

class SimplePromptInjection:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "model": ("MODEL",),
            },
            "optional": {
                "block": (["input:4", "input:5", "input:7", "input:8", "middle:0", "output:0", "output:1", "output:2", "output:3", "output:4", "output:5"],),
                "conditioning": ("CONDITIONING",),
                "weight": ("FLOAT", {"default": 1.0, "min": -2.0, "max": 5.0, "step": 0.05}),
                "start_at": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.0, "step": 0.001}),
                "end_at": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.001}),
            }
        }

    RETURN_TYPES = ("MODEL",)
    FUNCTION = "patch"

    CATEGORY = "advanced/model"

    def patch(self, model: comfy.model_patcher.ModelPatcher, block, conditioning=None, weight=1.0, start_at=0.0, end_at=1.0):
        if conditioning is None:
            return (model,)

        m = model.clone()
        sigma_start = m.get_model_object("model_sampling").percent_to_sigma(start_at)
        sigma_end = m.get_model_object("model_sampling").percent_to_sigma(end_at)

        m.set_model_attn2_patch(build_patch({f"{block}": conditioning}, weight=weight, sigma_start=sigma_start, sigma_end=sigma_end))

        return (m,)

class SimplePromptInjection:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "model": ("MODEL",),
            },
            "optional": {
                "block": (["input:4", "input:5", "input:7", "input:8", "middle:0", "output:0", "output:1", "output:2", "output:3", "output:4", "output:5"],),
                "conditioning": ("CONDITIONING",),
                "weight": ("FLOAT", {"default": 1.0, "min": -2.0, "max": 5.0, "step": 0.05}),
                "start_at": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.0, "step": 0.001}),
                "end_at": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.001}),
            }
        }

    RETURN_TYPES = ("MODEL",)
    FUNCTION = "patch"

    CATEGORY = "advanced/model"

    def patch(self, model: comfy.model_patcher.ModelPatcher, block, conditioning=None, weight=1.0, start_at=0.0, end_at=1.0):
        if conditioning is None:
            return (model,)

        m = model.clone()
        sigma_start = m.get_model_object("model_sampling").percent_to_sigma(start_at)
        sigma_end = m.get_model_object("model_sampling").percent_to_sigma(end_at)

        m.set_model_attn2_patch(build_patch({f"{block}": conditioning}, weight=weight, sigma_start=sigma_start, sigma_end=sigma_end))

        return (m,)

class AdvancedPromptInjection:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "model": ("MODEL",),
            },
            "optional": {
                "locations": ("STRING", {"multiline": True, "default": "output:0,1.0\noutput:1,1.0"}),
                "conditioning": ("CONDITIONING",),
                "start_at": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.0, "step": 0.001}),
                "end_at": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.001}),
            }
        }

    RETURN_TYPES = ("MODEL",)
    FUNCTION = "patch"

    CATEGORY = "advanced/model"

    def patch(self, model: comfy.model_patcher.ModelPatcher, locations: str, conditioning=None, start_at=0.0, end_at=1.0):
        if not conditioning:
            return (model,)

        m = model.clone()
        sigma_start = m.get_model_object("model_sampling").percent_to_sigma(start_at)
        sigma_end = m.get_model_object("model_sampling").percent_to_sigma(end_at)

        for line in locations.splitlines():
            line = line.strip().strip('\n')
            weight = 1.0
            if ',' in line:
                line, weight = line.split(',')
                line = line.strip()
                weight = float(weight)
            if line:
                m.set_model_attn2_patch(build_patch({f"{line}": conditioning}, weight=weight, sigma_start=sigma_start, sigma_end=sigma_end))

        return (m,)


NODE_CLASS_MAPPINGS = {
    "PromptInjection": PromptInjection,
    "SimplePromptInjection": SimplePromptInjection,
    "AdvancedPromptInjection": AdvancedPromptInjection,
    "SVDPromptInjection": SVDPromptInjection
}

NODE_DISPLAY_NAME_MAPPINGS = {
    "PromptInjection": "Attn2 Prompt Injection",
    "SimplePromptInjection": "Attn2 Prompt Injection (simple)",
    "AdvancedPromptInjection": "Attn2 Prompt Injection (advanced)",
    "SVDPromptInjection": "Attn2 SVD Prompt Injection"
}

Here are a bunch of SVD probe results for hidden input/outputs I meantioned above. I was mostly interesting in CLIPTextTransformer, CLIPVisionTransformer Edit: I did find a way to get clip conditioning working now with it to some extent, and additional guidance image inputs, will make a repo soon:

Added path: C:/Users/NewPC/Downloads for folder: checkpoints
config.json: 100%|████████████████████████████████████████████████████████████████████████| 4.19k/4.19k [00:00<?, ?B/s]
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████| 605M/605M [00:48<00:00, 12.5MB/s]
C:\Users\NewPC\Downloads\venv sim\venv\lib\site-packages\torch\_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████| 316/316 [00:00<?, ?B/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████| 592/592 [00:00<?, ?B/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████| 862k/862k [00:00<00:00, 5.51MB/s]
merges.txt: 100%|███████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 1.55MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 2.22M/2.22M [00:01<00:00, 1.88MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████| 389/389 [00:00<?, ?B/s]
Loading checkpoint from path: C:/Users/NewPC/Downloads/svd_merge_with_motionctrl_50-2.safetensors
Method: add_module
  Input Types: N/A
  Return Type: N/A
  Error: module name should be a string. Got NoneType
Method: apply
  Input Types: N/A
  Return Type: N/A
  Error: 'NoneType' object is not callable
Method: bfloat16
  Input Types: {}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: buffers
  Input Types: {'recurse': 'NoneType'}
  Return Type: generator
  Output Sample: <generator object Module.buffers at 0x0000029E07A993F0>
Method: children
  Input Types: {}
  Return Type: generator
  Output Sample: <generator object Module.children at 0x0000029E07A99460>
Method: cpu
  Input Types: {}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: cuda
  Input Types: {'device': 'NoneType'}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: double
  Input Types: {}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: eval
  Input Types: {}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: extra_repr
  Input Types: {}
  Return Type: str
  Output Sample:
Method: float
  Input Types: {}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: forward
  Input Types: N/A
  Return Type: N/A
  Error: The image to be converted to a PIL image contains values outside the range [0, 1], got [-4.921847820281982, 4.285804271697998] which cannot be converted to uint8.
Method: get_buffer
  Input Types: N/A
  Return Type: N/A
  Error: 'NoneType' object has no attribute 'rpartition'
Method: get_extra_state
  Input Types: N/A
  Return Type: N/A
  Error: Reached a code path in Module.get_extra_state() that should never be called. Please file an issue at https://github.com/pytorch/pytorch/issues/new?template=bug-report.yml to report this bug.
Method: get_parameter
  Input Types: N/A
  Return Type: N/A
  Error: 'NoneType' object has no attribute 'rpartition'
Method: get_submodule
  Input Types: N/A
  Return Type: N/A
  Error: 'NoneType' object has no attribute 'split'
Method: half
  Input Types: {}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: ipu
  Input Types: N/A
  Return Type: N/A
  Error: PyTorch is not linked with support for ipu devices
Method: load_state_dict
  Input Types: N/A
  Return Type: N/A
  Error: Expected state_dict to be dict-like, got <class 'NoneType'>.
Method: modules
  Input Types: {}
  Return Type: generator
  Output Sample: <generator object Module.modules at 0x0000029E07A994D0>
Method: named_buffers
  Input Types: {'prefix': 'NoneType', 'recurse': 'NoneType', 'remove_duplicate': 'NoneType'}
  Return Type: generator
  Output Sample: <generator object Module.named_buffers at 0x0000029E07A995B0>
Method: named_children
  Input Types: {}
  Return Type: generator
  Output Sample: <generator object Module.named_children at 0x0000029E07A99620>
Method: named_modules
  Input Types: {'memo': 'NoneType', 'prefix': 'NoneType', 'remove_duplicate': 'NoneType'}
  Return Type: generator
  Output Sample: <generator object Module.named_modules at 0x0000029E07A99690>
Method: named_parameters
  Input Types: {'prefix': 'NoneType', 'recurse': 'NoneType', 'remove_duplicate': 'NoneType'}
  Return Type: generator
  Output Sample: <generator object Module.named_parameters at 0x0000029E07A99700>
Method: parameters
  Input Types: {'recurse': 'NoneType'}
  Return Type: generator
  Output Sample: <generator object Module.parameters at 0x0000029E07A997E0>
Method: register_backward_hook
  Input Types: {'hook': 'NoneType'}
  Return Type: RemovableHandle
  Output Sample: <torch.utils.hooks.RemovableHandle object at 0x0000029DFCD4B790>
Method: register_buffer
  Input Types: N/A
  Return Type: N/A
  Error: buffer name should be a string. Got NoneType
Method: register_forward_hook
  Input Types: {'hook': 'NoneType', 'prepend': 'NoneType', 'with_kwargs': 'NoneType'}
  Return Type: RemovableHandle
  Output Sample: <torch.utils.hooks.RemovableHandle object at 0x0000029E06913A30>
Method: register_forward_pre_hook
  Input Types: {'hook': 'NoneType', 'prepend': 'NoneType', 'with_kwargs': 'NoneType'}
  Return Type: RemovableHandle
  Output Sample: <torch.utils.hooks.RemovableHandle object at 0x0000029E06913850>
Method: register_full_backward_hook
  Input Types: N/A
  Return Type: N/A
  Error: Cannot use both regular backward hooks and full backward hooks on a single Module. Please use only one of them.
Method: register_full_backward_pre_hook
  Input Types: {'hook': 'NoneType', 'prepend': 'NoneType'}
  Return Type: RemovableHandle
  Output Sample: <torch.utils.hooks.RemovableHandle object at 0x0000029E06913A90>
Method: register_load_state_dict_post_hook
  Input Types: {'hook': 'NoneType'}
  Return Type: RemovableHandle
  Output Sample: <torch.utils.hooks.RemovableHandle object at 0x0000029E06911720>
Method: register_module
  Input Types: N/A
  Return Type: N/A
  Error: module name should be a string. Got NoneType
Method: register_parameter
  Input Types: N/A
  Return Type: N/A
  Error: parameter name should be a string. Got NoneType
Method: register_state_dict_pre_hook
  Input Types: {'hook': 'NoneType'}
  Return Type: RemovableHandle
  Output Sample: <torch.utils.hooks.RemovableHandle object at 0x0000029E06913CD0>
Method: requires_grad_
  Input Types: N/A
  Return Type: N/A
  Error: requires_grad_(): argument 'requires_grad' (position 1) must be bool, not NoneType
Method: set_extra_state
  Input Types: N/A
  Return Type: N/A
  Error: Reached a code path in Module.set_extra_state() that should never be called. Please file an issue at https://github.com/pytorch/pytorch/issues/new?template=bug-report.yml to report this bug.
Method: share_memory
  Input Types: {}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: state_dict
  Input Types: N/A
  Return Type: N/A
  Error: Module.state_dict() got an unexpected keyword argument 'args'
Method: to
  Input Types: N/A
  Return Type: N/A
  Error: to() received an invalid combination of arguments - got (args=NoneType, kwargs=NoneType, ), but expected one of:
 * (torch.device device, torch.dtype dtype, bool non_blocking, bool copy, *, torch.memory_format memory_format)
 * (torch.dtype dtype, bool non_blocking, bool copy, *, torch.memory_format memory_format)
 * (Tensor tensor, bool non_blocking, bool copy, *, torch.memory_format memory_format)

Method: to_empty
  Input Types: {'device': 'NoneType'}
  Return Type: WrappedModel
  Output Sample: WrappedModel(
  (model): Module()
  (clip_model): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
            )
            (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
        (position_embedding): Embedding(50, 768)
      )
      (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=768, out_features=512, bias=False)
    (text_projection): Linear(in_features=512, out_features=512, bias=False)
  )
)
Method: train
  Input Types: N/A
  Return Type: N/A
  Error: training mode is expected to be boolean
Method: type
  Input Types: N/A
  Return Type: N/A
  Error: _has_compatible_shallow_copy_type(): argument 'from' (position 2) must be Tensor, not str
Method: xpu
  Input Types: N/A
  Return Type: N/A
  Error: PyTorch is not linked with support for xpu devices
Method: zero_grad
  Input Types: {'set_to_none': 'NoneType'}
  Return Type: NoneType
  Output Sample: None

(venv) C:\Users\NewPC\Downloads\venv sim>

2nd PROBE

Current Probe Results

The current probe provides a detailed breakdown of the results, revealing both successful and failed method executions:
Successful Methods

Methods that executed successfully include add_module, apply, bfloat16, buffers, children, cpu, cuda, double, eval, extra_repr, float, forward, get_submodule, half, ipu, modules, named_buffers, named_children, named_modules, named_parameters, parameters, register_backward_hook, register_forward_hook, register_forward_pre_hook, register_full_backward_pre_hook, register_load_state_dict_post_hook, register_module, register_state_dict_pre_hook, requires_grad_, share_memory, to, train, type, xpu, and zero_grad.

Error-Prone Methods

Common Errors:
    NoneType attribute errors: Methods like get_buffer, get_parameter, get_submodule, register_buffer, and register_parameter failed due to encountering NoneType attributes.
    Specific argument errors: Methods like register_full_backward_hook and set_extra_state failed due to issues with method-specific arguments.
    Internal errors: Methods like load_state_dict, state_dict, and to_empty encountered internal errors related to unexpected keyword arguments or invalid combination of arguments.

Improving the Probing Script

To enhance the probing script further, consider the following updates:

Default Values for Specific Arguments: Ensure that the methods requiring specific argument types are provided with appropriate default values.
Enhanced Error Handling: Add more descriptive error messages and handle specific exceptions gracefully.

Here is an updated function to handle method-specific arguments more effectively:

4th PROBE
From the latest probe execution, we can extract and summarize several key learnings and issues:
Successful Method Probes:

Methods without Arguments:
    bfloat16, cpu, cuda, double, eval, float, half, ipu, share_memory, train, type, to_empty, xpu, zero_grad: These methods were successfully executed and returned a WrappedModel instance or NoneType where appropriate.
    buffers, children, modules, named_buffers, named_children, named_modules, named_parameters, parameters: These methods returned generator objects successfully, indicating they can iterate over model components.

Partially Successful Method Probes:

Methods with Simple Arguments:
    register_backward_hook, register_buffer, register_forward_hook, register_forward_pre_hook, register_full_backward_hook, register_full_backward_pre_hook, register_load_state_dict_post_hook, register_module, register_parameter, register_state_dict_pre_hook: These methods returned RemovableHandle objects or similar, indicating they can accept simple callable arguments.

Error-Inducing Methods:

Methods Requiring Specific Arguments:
    add_module, apply, get_buffer, get_extra_state, get_parameter, get_submodule, load_state_dict, register_buffer, set_extra_state, state_dict, to: These methods failed due to missing or inappropriate arguments.
    train: This method expected a boolean argument.
    forward: Although successfully executed with given inputs, the initial probe indicated a possible mismatch in expected input sizes or missing parameters.

Key Findings from forward Method:

Forward Method Outputs:
    Successfully executed with specific input types (e.g., init_image: Tensor, width: int, height: int, video_frames: int, motion_bucket_id: int, fps: int, augmentation_level: float).
    Returned a dictionary containing 'positive', 'negative', and 'latent' tensors, confirming the primary functionality and output structure.

Next Steps and Recommendations:

Provide Appropriate Arguments:
    For methods like add_module, apply, get_buffer, etc., determine and provide appropriate arguments. Default test values or actual data inputs can be used to handle these methods correctly.

Enhance Probing Script:
    Update the probing script to handle methods with specific arguments and provide default values where necessary. Improve error handling to offer more specific messages.
    Example update for handling method arguments:

Potential Holes and Additional Inputs:

To fully explore the potential holes in the model and test additional inputs, consider the following:

Additional Input Types:
    Based on common practices in video generation models, incorporate more varied input types such as:
        noise: A tensor representing Gaussian noise.
        motion_vector: A tensor representing motion vectors in the video frames.
        style_transfer: A tensor for style transfer operations.
        keyframes: A tensor for keyframe extraction and interpolation.
        depth_map: A tensor representing depth information of the video frames.
        semantic_map: A tensor representing semantic segmentation maps.

Testing Return Types:
    Extend the testing to check the return types from various methods and ensure the correct handling of outputs like:
        CONDITIONING: Ensure the output is correctly conditioned.
        LATENT: Verify the latent space representation is accurate.
        FEATURES: Check for additional feature maps or tensors.

Based on the findings from your recent script execution, we have gained a clearer understanding of the model's methods and their input-output types. Here is a summary of the new insights compared to previous findings:

Successful Method Probes:
    bfloat16, cpu, cuda, double, eval, float, half, ipu, requires_grad_, share_memory, to, train, xpu, zero_grad:
        These methods return a WrappedModel instance, indicating they can be successfully executed without additional parameters. This was a confirmation of their utility in model manipulation and evaluation contexts.

Partially Successful Method Probes:
    buffers, children, modules, named_buffers, named_children, named_modules, named_parameters, parameters:
        These methods return a generator object, which is useful for iterating over different components of the model.

Error-Inducing Methods:
    Several methods like add_module, apply, get_buffer, get_extra_state, get_parameter, get_submodule, load_state_dict, register_*, set_extra_state, type, to_empty:
        These methods require additional arguments and cannot be executed with default or empty parameters. They are crucial for more advanced operations but need specific inputs.

Insights from forward Method:
    The forward method was successfully executed with specific input types and returned a dictionary containing 'positive', 'negative', and 'latent' tensors. This confirms the primary functionality of your model in terms of its output structure.

Model Summary:
    The model summary indicated an output shape of [16, 4, 72, 128] for the WrappedModel, with the total estimated size of the model being 7.08 MB. This suggests that the model is relatively lightweight in terms of memory usage.

Error Handling Improvements:
    The script successfully catches and prints errors for methods that require additional parameters, providing a clearer path for debugging and further development.

Next Steps:

Addressing Errors in Probed Methods:
    For methods like add_module, apply, get_buffer, etc., you need to provide appropriate parameters when calling these methods. Consider defining these parameters or setting default test values to handle them correctly.

Extending Probing and Summarization:
    Extend the probing script to handle more complex scenarios, such as testing with actual data inputs rather than randomly generated tensors. This could provide deeper insights into the model's behavior under real-world conditions.

Further Debugging:
    Focus on the methods that failed due to missing arguments. Determine the necessary parameters and retry probing these methods with appropriate inputs.

Based on your latest probe execution, we have gathered the following insights:

Successful Method Probes:
    bfloat16, cpu, cuda, double, eval, float, half, ipu, requires_grad_, share_memory, to_empty, train, type, xpu, zero_grad:
        These methods return a WrappedModel instance, indicating they can be successfully executed without additional parameters.

Partially Successful Method Probes:
    buffers, children, modules, named_buffers, named_children, named_modules, named_parameters, parameters:
        These methods return a generator object, which is useful for iterating over different components of the model.

Error-Inducing Methods:
    Methods like add_module, apply, get_buffer, get_extra_state, get_parameter, get_submodule, load_state_dict, register_*, set_extra_state, state_dict, to:
        These methods require additional arguments and cannot be executed with default or empty parameters.

Insights from forward Method:
    The forward method was successfully executed with specific input types and returned a dictionary containing 'positive', 'negative', and 'latent' tensors. This confirms the primary functionality of your model in terms of its output structure.

Error Handling Improvements:
    The script successfully catches and prints errors for methods that require additional parameters, providing a clearer path for debugging and further development.

Next Steps:

Addressing Errors in Probed Methods:
    For methods like add_module, apply, get_buffer, etc., you need to provide appropriate parameters when calling these methods. Consider defining these parameters or setting default test values to handle them correctly.

Extending Probing and Summarization:
    Extend the probing script to handle more complex scenarios, such as testing with actual data inputs rather than randomly generated tensors. This could provide deeper insights into the model's behavior under real-world conditions.

Further Debugging:
    Focus on the methods that failed due to missing arguments. Determine the necessary parameters and retry probing these methods with appropriate inputs.

Update the Probing Script:
    Enhance the script to provide more specific error messages and handle various input types. Here’s an updated snippet for the probing script:

SD3 negative injection?

Since some people have recently found success in improving human anatomy by adding NSFW terms to negative prompt when generating with SD3, can this method be used to solve the weird human anatomy problem of SD3 medium?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.