Giter VIP home page Giter VIP logo

mlc-ai / web-stable-diffusion Goto Github PK

View Code? Open in Web Editor NEW
3.4K 35.0 211.0 16.35 MB

Bringing stable diffusion models to web browsers. Everything runs inside the browser with no server support.

Home Page: https://mlc.ai/web-stable-diffusion

License: Apache License 2.0

Python 20.64% HTML 0.30% Shell 0.26% JavaScript 2.57% Jupyter Notebook 76.23%
webgpu deep-learning stable-diffusion web-assembly webml tvm

web-stable-diffusion's Introduction

Web Stable Diffusion

This project brings stable diffusion models onto web browsers. Everything runs inside the browser with no server support. To our knowledge, this is the world’s first stable diffusion completely running on the browser. Please checkout our demo webpage to try it out.

You are also more than welcomed to checkout Web LLM if you are interested in deploying LLM-based chat bots to browser.

Browser screenshot

We have been seeing amazing progress through AI models recently. Thanks to the open-source effort, developers can now easily compose open-source models together to produce amazing tasks. Stable diffusion enables the automatic creation of photorealistic images as well as images in various styles based on text input. These models are usually big and compute-heavy, which means we have to pipe through all computation requests to (GPU) servers when developing web applications based on these models. Additionally, most of the workloads have to run on a specific type of GPUs where popular deep-learning frameworks are readily available.

This project takes a step to change that status quo and bring more diversity to the ecosystem. There are a lot of reasons to get some (or all) of the computation to the client side. There are many possible benefits, such as cost reduction on the service provider side, as well as an enhancement for personalization and privacy protection. The development of personal computers (even mobile devices) is going in the direction that enables such possibilities. The client side is getting pretty powerful.

Building special client apps for those applications is one option (which we also support), but won’t it be even more amazing if we can simply open a browser and directly bring AI natively to your browser tab? There is some level of readiness in the ecosystem. WebAssembly allows us to port more lower-level runtimes onto the web. To solve the compute problem, WebGPU is getting matured lately and enables native GPU executions on the browser.

We are just seeing necessary elements coming together on the client side, both in terms of hardware and browser ecosystem. Still, there are big hurdles to cross, to name a few:

  • We need to bring the models somewhere without the relevant GPU-accelerated Python frameworks.
  • Most of the AI frameworks have a heavy reliance on optimized computed libraries that are maintained by hardware vendors. We need to start from zero. To get the maximum benefit, we might also need to produce variants per client environment.
  • Careful planning of memory usage so we can fit the models into memory.

We do not want to only do it for just one model. Instead, we would like to present a repeatable, hackable, composable workflow that enables anyone to easily develop and optimize these models in a Python-first environment and universally deploy them everywhere, including the web.

Get Started

We have a Jupyter notebook that walks you through all the stages, including

  • elaborate the key points of web ML model deployment and how we do to meet these points,
  • import the stable diffusion model,
  • optimize the model,
  • build the model,
  • deploy the model locally with native GPU runtime, and
  • deploy the model on web with WebGPU runtime.

If you want to go through these steps in command line, please follow the commands below:

Commands
  • Install TVM Unity. You can either

    • use pip3 install mlc-ai-nightly -f https://mlc.ai/wheels to install the TVM Unity wheel, or
    • follow TVM’s documentation to build from source. Please use git checkout origin/unity to checkout to TVM Unity after git clone.
  • To import, optimize and build the stable diffusion model:

    python3 build.py

    By default build.py takes apple/m2-gpu as build target. You can also specify CUDA target via

    python3 build.py --target cuda
  • To deploy the model locally with native GPU runtime:

    python3 deploy.py --prompt "A photo of an astronaut riding a horse on mars."

    You can substitute the prompt with your own one, and optionally use --negative-prompt "Your negative prompt" to specify a negative prompt.

  • To deploy the model on web with WebGPU runtime, the last section “Deploy on web” of the walkthrough notebook has listed the full instructions which you can refer to. We also provide the same list of plain instructions here:

    Instructions

    First, let’s install all the prerequisite:

    1. emscripten. It is an LLVM-based compiler which compiles C/C++ source code to WebAssembly.
      • Follow the installation instruction to install the latest emsdk.
      • Source emsdk_env.sh by source path/to/emsdk_env.sh, so that emcc is reachable from PATH and the command emcc works.
    2. Rust.
    3. wasm-pack. It helps build Rust-generated WebAssembly, which used for tokenizer in our case here.
    4. Install jekyll by following the official guides. It is the package we use for website.
    5. Install jekyll-remote-theme by command
      gem install jekyll-remote-theme
    6. Install Chrome Canary. It is a developer version of Chrome that enables the use of WebGPU.

    We can verify the success installation by trying out emcc, jekyll and wasm-pack in terminal respectively.

    Then, prepare all the necessary dependencies for web build:

    ./scripts/prep_deps.sh

    We can now build the model to WebGPU backend and export the executable to disk in the WebAssembly file format, by running

    python3 build.py --target webgpu

    The last thing to do is setting up the site with

    ./scripts/local_deploy_site.sh

    With the site set up, you can go to localhost:8888/ in Chrome Canary to try out the demo on your local machine. Don’t forget to use

    /Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --enable-dawn-features=disable_robustness

    to launch Chrome Canary to turn off the robustness check from Chrome.

How?

The key technology here is machine learning compilation (MLC). Our solution is built on the shoulders of the open-source ecosystem, including PyTorch, Hugging Face diffusers and tokenizers, rust, wasm, and WebGPU. The main flow is built on Apache TVM Unity, an exciting ongoing development in the Apache TVM

  • We take Runway’s stable diffusion v1-5 models from the Hugging Face diffuser library.
  • We use TorchDynamo and Torch FX to capture key model components into an IRModule in TVM.
  • Each function in TVM’s IRModule can be further transformed and generated with runnable code that can be deployed universally on any environment supported by minimum TVM runtime (javascript being one of them).
  • TensorIR and MetaSchedule are used to build automated solutions to generate optimized programs. These transformations are tuned on a specific device through native GPU runtimes and then used to generate optimized GPU shaders. We provide a database that records these transformations so new builds can be done without tuning.
  • We build static memory planning optimizations to reuse memory across multiple layers.
  • We use Emscripten and typescript to build a TVM web runtime that can deploy generated modules.
  • We also leverage the wasm port of the rust tokenizers library from Hugging Face.

workflow

All parts of this workflow are done in Python, except, of course, the last part which builds a 400-loc JavaScript app that connects things together. This is also a fun process of interactive development, bringing new models.

All these are made possible by the open-source ecosystem that we leverage. Specifically, we make heavy use of TVM Unity, an exciting latest development in the TVM project that enables such Python-first interactive MLC development experiences which allows us to easily compose new optimizations, all in Python, and incrementally bring our app to the web. TVM Unity also provides an easy way to compose new solutions in the ecosystem. For example, we can bring in other WebGPU shader generators or shader libraries easily to this workflow in the future.

Comparison with Native GPU Runtime, Limitations, and Opportunities

Besides the WebGPU runtime, we also provide options for native deployment with local GPU runtime. These options can be used both as a tool to deploy on a native environment as well as a reference point to compare native GPU driver performance and WebGPU.

WebGPU works by translating WGSL (WebGPU Shading Language) shaders to native shaders. So, in theory, we can reach zero gaps between the WebGPU runtime and the native environment. If we directly use Chrome to check the current demo on Apple silicon, however, we can find a performance degradation (about 3x). This is because Chrome’s WebGPU implementation inserts bound clips for all array index access, such that a[i] becomes a[min(i, a.size)]. Ideally, downstream shader compilers should be able to optimize the bound clipping out, but here unfortunately, it is not the case. This gap can be fixed once WebGPU implementation becomes more mature, checks the index access range, and drops such clipping.

You can get around this by using a special flag to launch Chrome (thanks to Dawn developers for providing the pointers), by exiting Chrome completely, then in the command line, type

/path/to/chrome-canary --enable-dawn-features=disable_robustness

Then you will find that the execution speed is as fast as the native GPU environment. We anticipate this problem will get resolved as WebGPU matures.

We are just seeing the dawn of what we believe to be an eruption. WebGPU is still evolving (though it is getting close to shipping this year), and only available through Chrome Canary, and can be unstable. It also still comes with limitations, such as only support for FP32 (FP16 shader extension is on the spec but not yet implemented). The stable diffusion here would require a GPU with a decent amount of RAM (8GB). We have only tested our solution through Apple silicons so far. There are also opportunities to support advanced optimizations such as FlashAttention and quantization to further improve the performance of the system.

These are opportunities to bring several times of performance improvements to the current solutions. We believe many of them can be tackled in the near future. A single component of this solution can still be useful. For example, one can choose just to deploy the text encoder part of the model. Additionally, the same Python-first development, universal deployment workflow can be used to bring ML models to other environments, such as new hardware or mobile cases. Finally, the same machine learning compilation stack is also shared with server class use cases and can be used to optimize server workloads as well.

Acknowledgement

This project is made possible thanks to collaboration with

CMU School of Computer Science Catalyst MLC OctoML

This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. We want to thank the open-source ML community members who make these models publicly available, and PyTorch, Hugging Face communities that make these models accessible. We would like to thank the tokenizer wasm port by Mithril Security. We also would like to thank the WebAssembly, Emscripten, Rust, and WebGPU communities. Finally, thanks to Dawn developers, who provide timely answers to questions on Chrome.

web-stable-diffusion's People

Contributors

ehsanmok avatar eltociear avatar g-schaffer avatar ganler avatar guoyaol avatar masahi avatar masterjh5574 avatar ranqiangjun avatar sunzj avatar tqchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

web-stable-diffusion's Issues

Can we have builds to go along with the updates?

I saw a rather interesting update commit 2cbd64d and would like to try it out, but I have an RX 580 and couldn't get it to build without CUDA in WSL. Could someone do a dist build and/or update the github pages example?

Thanks!

Demo fails on latest tvm/unity branch

I'm currently going through the web-stable-diffusion demo, and I'm finding that the clip_to_text_embeddings function isn't adding functions to the exported IRModule, causing the assert to fail. This issue seems to have started appearing after March 20th, and I'm working to track down the isuse in the repo:

This seems to happen on MacOS and Linux. I haven't tried in Windows because PyTorch doesn't support compiled models on Windows yet.

I'm not sure if there's been an API change on TVM, or if this is a TVM bug or an issue with the demo.

Cannot build stable diffusion model: "BackendCompilerFailed: backend='_capture' raised AssertionError"

I tried building the stable diffusion model using the walkthrough.ipynb notebook or the build.py file, but when I run the "Combine every piece together" part :

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
clip = clip_to_text_embeddings(pipe)
unet = unet_latents_to_noise_pred(pipe, torch_dev_key)
vae = vae_to_image(pipe)
concat_embeddings = concat_embeddings()
image_to_rgba = image_to_rgba()
schedulers = [
    dpm_solver_multistep_scheduler_steps(),
    trace.PNDMScheduler.scheduler_steps()
]

mod: tvm.IRModule = utils.merge_irmodules(
    clip,
    unet,
    vae,
    concat_embeddings,
    image_to_rgba,
    *schedulers,
)

Both results in the same error:

/usr/local/lib/python3.10/dist-packages/torch/__init__.py:1565 in __call__                       │
│                                                                                                  │
│   1562 │   │   │   │   self.dynamic == other.dynamic)                                            │
│   1563 │                                                                                         │
│   1564def __call__(self, model_, inputs_):                                                  │
│ ❱ 1565 │   │   return self.compiler_fn(model_, inputs_, **self.kwargs)                           │
│   1566                                                                                           │
│   1567                                                                                           │
│   1568 def compile(model: Optional[Callable] = None, *,                                          │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/tvm/relax/frontend/torch/dynamo.py:151 in _capture       │
│                                                                                                  │
│   148def _capture(graph_module: fx.GraphModule, example_inputs):                            │
│   149 │   │   assert isinstance(graph_module, torch.fx.GraphModule)                              │
│   150 │   │   input_info = [(tuple(tensor.shape), str(tensor.dtype)) for tensor in example_inp   │
│ ❱ 151 │   │   mod_ = from_fx(                                                                    │
│   152 │   │   │   graph_module,                                                                  │
│   153 │   │   │   input_info,                                                                    │
│   154 │   │   │   keep_params_as_input=keep_params_as_input,                                     │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/tvm/relax/frontend/torch/fx_translator.py:1387 in        │
│ from_fx                                                                                          │
│                                                                                                  │
│   1384to print out the tabular representation of the PyTorch module, and then               │
│   1385check the placeholder rows in the beginning of the tabular.                           │
│   1386 │   """                                                                                   │
│ ❱ 1387 │   return TorchFXImporter().from_fx(                                                     │
│   1388 │   │   model, input_info, keep_params_as_input, unwrap_unit_return_tuple, no_bind_retur  │
│   1389 │   )                                                                                     │
│   1390                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/tvm/relax/frontend/torch/fx_translator.py:1282 in        │
│ from_fx                                                                                          │
│                                                                                                  │
│   1279 │   │   │   │   │   │   self.env[node] = self.convert_map[node.target](node)              │
│   1280 │   │   │   │   │   else:                                                                 │
│   1281 │   │   │   │   │   │   raise ValueError(f"Unsupported op {node.op}")                     │
│ ❱ 1282 │   │   │   assert output is not None                                                     │
│   1283 │   │   │   self.block_builder.emit_func_output(output)                                   │
│   1284 │   │                                                                                     │
│   1285 │   │   mod = self.block_builder.get()                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
BackendCompilerFailed: backend='_capture' raised:
AssertionError: 


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

It seems there is a problem with TorchDynamo

Also a somewhat unrelated error, but I couldn't get to install the CUDA version of the mlc/tvm package:

!python3 -m pip install mlc-ai-nightly-cu116 -f https://mlc.ai/wheels

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://mlc.ai/wheels
ERROR: Could not find a version that satisfies the requirement mlc-ai-nightly-cu116 (from versions: none)
ERROR: No matching distribution found for mlc-ai-nightly-cu116

Both errors can be reproduced by running the notebook on google colab

TVMError: Check failed: (!name_supply_->ContainsName(global_symbol_value)) is false: IRModule contains duplicate global symbol: main

hello, everyone.
I follow up get start of https://github.com/mlc-ai/web-stable-diffusion,and run :
web-stable-diffusion git:(main) python build.py
and Report the following error

Automatically configuring target: metal -keys=metal,gpu -max_function_args=31 -max_num_threads=256 -max_shared_memory_per_block=32768 -max_threads_per_block=1024 -thread_warp_size=32
Load cached module from dist/mod_cache_before_build.pkl and skip tracing. You can use --use-cache=0 to retrace
[13:09:57] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: matmul2
[13:09:57] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: fused_matmul1_multiply1
[13:09:57] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: fused_transpose5_reshape4
[13:09:57] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: fused_reshape3_transpose3_transpose4
[13:09:57] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: fused_reshape2_transpose_transpose1
[13:09:57] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: group_norm1
[13:09:57] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: fused_conv2d2_add1_add2_divide_divide
[13:09:58] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: fused_reshape3_transpose3
[13:09:58] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: softmax
[13:09:58] /Users/qiyei2016/work/AI/tvm/src/relax/transform/meta_schedule.cc:166: Warning: Tuning record is not found for primfunc: transpose
Traceback (most recent call last):
  File "/Users/qiyei2016/work/AI/web-stable-diffusion/build.py", line 169, in <module>
    build(mod, ARGS)
  File "/Users/qiyei2016/work/AI/web-stable-diffusion/build.py", line 137, in build
    ex = relax.build(mod_deploy, args.target)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/relax/vm_build.py", line 335, in build
    mod = pipeline(mod)
          ^^^^^^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/ir/transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/relax/pipeline.py", line 99, in _pipeline
    mod = seq(mod)
          ^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/ir/transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/ir/transform.py", line 307, in _pass_func
    return inst.transform_module(mod, ctx)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/relax/backend/dispatch_sort_scan.py", line 163, in transform_module
    return sort_scan_dispater.builder_.finalize()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/relax/block_builder.py", line 682, in finalize
    return _ffi_api.BlockBuilderFinalize(self)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/Users/qiyei2016/work/AI/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "/Users/qiyei2016/work/AI/tvm/src/relax/transform/normalize.cc", line 234
TVMError: Check failed: (!name_supply_->ContainsName(global_symbol_value)) is false: IRModule contains duplicate global symbol: main

The environment is
OS: Macos 14.4
CPU: M1max
XCode: 15.4

building for target webgpu results in "ValueError: At least one GPU backend is expected to be enabled"

Using version https://github.com/mlc-ai/web-stable-diffusion/tree/ce0c2fbd0fffd7ee39e7be9da34052a8809d98db

environment: Ubuntu 22 LTS server without graphics card.

Executing

python3 build.py --target webgpu

causes the following error:

Traceback (most recent call last):
  File "build.py", line 153, in <module>
    torch_dev_key = utils.detect_available_torch_device()
  File "web_stable_diffusion/utils.py", line 14, in detect_available_torch_device
    raise ValueError("At least one GPU backend is expected to be enabled")
ValueError: At least one GPU backend is expected to be enabled

See

raise ValueError("At least one GPU backend is expected to be enabled")
.

Is it possible to enable a GPU backend in torch even if the building system environment does not provide that GPU backend?

The operation failed for an operation-specific reason

I get the error "Generate error, OperationError: The operation failed for an operation-specific reason" when trying to use the demo of both this (WEBSD) and WebLLM i have a computer with a 12th Gen. Intel(R) Core(TM) i7-12700F CPU, 2X 8GB DDR4 Memory, 1X SSD, 2X HDD, And a NVIDIA GEForce RTX 3060, It is running Windows 10 (
Edition Windows 10 Home
Version 22H2
Installed on ‎7/‎15/‎2023
OS build 19045.3208
Experience Windows Feature Experience Pack 1000.19041.1000.0
)
Running Google Chrome Version 115.0.5790.110 (Official Build) (64-bit)

Attached it a screenshot of WEBSD after it gave the error and froze.
Screenshot 2023-08-03 215302

AssertionError: Unsupported function type position_ids

Hi, I am trying to run the build example build.py but I am getting the error:

(webgpu-env) user@le-big-mac repos/web-stable-diffusion (main *) » python3 build.py
Automatically configuring target: metal -keys=metal,gpu -max_function_args=31 -max_num_threads=256 -max_shared_memory_per_block=32768 -max_threads_per_block=1024 -thread_warp_size=32
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.
Traceback (most recent call last):
  File "~Dropbox/repos/web-stable-diffusion/build.py", line 157, in <module>
    mod, params = trace_models(torch_dev_key)
  File "~Dropbox/repos/web-stable-diffusion/build.py", line 80, in trace_models
    clip = trace.clip_to_text_embeddings(pipe)
  File "~Dropbox/repos/web-stable-diffusion/web_stable_diffusion/trace/model_trace.py", line 27, in clip_to_text_embeddings
    mod = dynamo_capture_subgraphs(
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/tvm/relax/frontend/torch/dynamo.py", line 198, in dynamo_capture_subgraphs
    compiled_model(*params, **kwargs)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 408, in _fn
    return fn(*args, **kwargs)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 569, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 671, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 377, in _convert_frame_assert
    return _compile(
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 595, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 243, in time_wrapper
    r = func(*args, **kwargs)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 512, in compile_inner
    out_code = transform_code_object(code, transform)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 150, in _fn
    return fn(*args, **kwargs)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 477, in transform
    tracer.run()
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2120, in run
    super().run()
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 815, in run
    and self.step()
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 778, in step
    getattr(self, inst.opname)(inst)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2235, in RETURN_VALUE
    self.output.compile_subgraph(
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 880, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1025, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 243, in time_wrapper
    r = func(*args, **kwargs)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1096, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1077, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/torch/__init__.py", line 1655, in __call__
    return self.compiler_fn(model_, inputs_, **self.kwargs)
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/tvm/relax/frontend/torch/dynamo.py", line 184, in _capture
    mod_ = from_fx(
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/tvm/relax/frontend/torch/fx_translator.py", line 1635, in from_fx
    return TorchFXImporter().from_fx(
  File "~miniconda3/envs/webgpu-env/lib/python3.10/site-packages/tvm/relax/frontend/torch/fx_translator.py", line 1520, in from_fx
    func_name in self.convert_map
torch._dynamo.exc.BackendCompilerFailed: backend='_capture' raised:
AssertionError: Unsupported function type position_ids

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

OS: macOS Sonoma 14.0
Device: M1 Mac
Python: 3.10

Jupyter walkthrough trace_models

          can fixed with this patch
diff --git a/python/tvm/relax/frontend/torch/fx_translator.py b/python/tvm/relax/frontend/torch/fx_translator.py
index fde31af60..392ff2b39 100644
--- a/python/tvm/relax/frontend/torch/fx_translator.py
+++ b/python/tvm/relax/frontend/torch/fx_translator.py
@@ -177,6 +177,23 @@ class TorchFXImporter:
             return self._call_binary_op(
                 relax.op.add, relax.const(lhs, dtype=rhs.struct_info.dtype), rhs
             )
+        elif isinstance(lhs, int):
+            return self._call_binary_op(
+                relax.op.add, relax.const(lhs, dtype="int64"), rhs
+            )
+        elif isinstance(rhs, int):
+            return self._call_binary_op(
+                relax.op.add, lhs, relax.const(rhs, dtype="int64")
+            )
+        elif isinstance(lhs, float):
+            return self._call_binary_op(
+                relax.op.add, relax.const(lhs, dtype="float32"), rhs
+            )
+        elif isinstance(rhs, float):
+            return self._call_binary_op(
+                relax.op.add, lhs, relax.const(rhs, dtype="float32")
+            )
+
         return lhs + rhs
 
     def _max(self, node: fx.node.Node) -> relax.Expr:

Originally posted by @haili-tian in #36 (comment)

after adding this code patch, it could fix the clip_to_text_embeddings(pipe) function, but when execute vae_to_image(pipe) encountered another error

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
File /opt/anaconda/envs/venv-mlc/lib/python3.10/site-packages/torch/_dynamo/output_graph.py:670, in OutputGraph.call_user_compiler(self, gm)
    669 else:
--> 670     compiled_fn = compiler_fn(gm, self.fake_example_inputs())
    671 _step_logger()(logging.INFO, f"done compiler function {name}")

File /opt/anaconda/envs/venv-mlc/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py:1055, in wrap_backend_debug.<locals>.debug_wrapper(gm, example_inputs, **kwargs)
   1054 else:
-> 1055     compiled_gm = compiler_fn(gm, example_inputs)
   1057 return compiled_gm

File /opt/anaconda/envs/venv-mlc/lib/python3.10/site-packages/tvm/relax/frontend/torch/dynamo.py:161, in dynamo_capture_subgraphs.<locals>._capture(graph_module, example_inputs)
    160 input_info = [(tuple(tensor.shape), str(tensor.dtype)) for tensor in example_inputs]
--> 161 mod_ = from_fx(
    162     graph_module,
    163     input_info,
    164     keep_params_as_input=keep_params_as_input,
    165     unwrap_unit_return_tuple=True,
    166 )
    167 new_name = f"subgraph_{len(mod.get_global_vars())}"

File /opt/anaconda/envs/venv-mlc/lib/python3.10/site-packages/tvm/relax/frontend/torch/fx_translator.py:1492, in from_fx(model, input_info, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple)
   1404 """Convert a PyTorch FX GraphModule to a Relax program
   1405 
   1406 Parameters
   (...)
   1490 check the placeholder rows in the beginning of the tabular.
   1491 """
-> 1492 return TorchFXImporter().from_fx(
   1493     model, input_info, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple
   1494 )

File /opt/anaconda/envs/venv-mlc/lib/python3.10/site-packages/tvm/relax/frontend/torch/fx_translator.py:1377, in TorchFXImporter.from_fx(self, model, input_info, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple)
   1375 func_name = node.name.rstrip("0123456789_")
   1376 assert (
-> 1377     func_name in self.convert_map
   1378 ), f"Unsupported function type {func_name}"
   1379 self.env[node] = self.convert_map[func_name](node)

AssertionError: Unsupported function type conv2d

The above exception was the direct cause of the following exception:

BackendCompilerFailed                     Traceback (most recent call last)
Cell In[13], line 1
----> 1 vae = vae_to_image(pipe)

Cell In[10], line 22, in vae_to_image(pipe)
     19 vae_to_image = VAEModelWrapper(vae)
     21 z = torch.rand((1, 4, 64, 64), dtype=torch.float32)
---> 22 mod = dynamo_capture_subgraphs(
     23     vae_to_image.forward,
     24     z,
     25     keep_params_as_input=True,
     26 )
     27 assert len(mod.functions) == 1
     29 return tvm.IRModule({"vae": mod["subgraph_0"]})

File /opt/anaconda/envs/venv-mlc/lib/python3.10/site-packages/tvm/relax/frontend/torch/dynamo.py:175, in dynamo_capture_subgraphs(model, *params, **kwargs)
    172 compiled_model = torch.compile(model, backend=_capture)
    174 with torch.no_grad():
--> 175     compiled_model(*params, **kwargs)
    177 return mod

it seems conv2d is considered call_function instead of call_module causing it to not exist in self.convert_map but I do not understand why conv2d become a function instead of module.

Exporting Stable diffusion 's VAE model Failed

I'm using torch 2.1.0.dev20230425+cpu and diffuser 0.16  to build stable diffusion v1_5, But I got the following error:
assert ( AssertionError: Unsupported function type scaled_dot_product_attention

I print symbolic traced graph , and I found vae module is using torch.c.scaled_dot_procuct_attention op

scaled_dot_product_attention = torch._C._nn.scaled_dot_product_attention(permute, permute_1, permute_2, dropout_p = 0.0, is_causal = False); permute_1 = permute_2 = None
I can attach the code which can make this happend
output.zip

Fails on Windows (and Android) with "maxStorageBufferBindingSize exceeds limit. requested=1024MB, limit=128MB."

Hi,
testing latest Chrome Canary on Windows and Android..
Win RTX 4070
Android Poco F3
both fails with:
"Find an error initializing the WebGPU device Error: Cannot initialize runtime because of requested maxStorageBufferBindingSize exceeds limit. requested=1024MB, limit=128MB."

also changed boot launch to:
--enable-dawn-features=disable_robustness
but that doesn't make a difference..
thanks..

Loading models from disk

First, this is extremely cool.

Second: I note that there are an absolute ton of models derived from Stable Diffusion around. Depending on your device, you might be able to fit a few of them in the cache TVM uses, but it requires duplicating the entire model, which is a waste of disk space if you have a local copy. (This matters a lot when you have dozens of models i.e. hundreds of GBs' worth.) And if it gets evicted from the cache, or try to run it in another browser or an incognito window or whatever, you have to download it again.

It would be nice if you could keep your model files on disk and load them into memory without making a copy (either using the file system API or drag-and-drop). Is that something that could happen? Poking around a bit, it seems like maybe that would require a change to TVM, but I don't know for sure.

(Someday I'd like to get to a place where whole fancy UIs for stable diffusion, including all of the computation, can be hosted on Github Pages, ideally multiple such UIs. And for multiple UIs to be practical, they'll need to be able to load models from disk, since they can't share the cache.)

Is it possible to adjust the 512x512 to a different height/width?

I have been adding this amazing software to my personal website and currently I have it making new wallpapers every few minutes. But what I would love is if I could sync up the height/width of the viewport with the image generation. I see some hardcoded 512's but I am unable to build this without CUDA (at least in WSL). Is it possible on the JavaScript side to set these values? Thanks!

Can I auto-tunning SD models by myself?

with tvm.transform.PassContext(opt_level=3): ex = relax.build(mod_deploy, args.target)
args.target: “cuda”
pip install -I mlc_ai_nightly_cu121 -f https://mlc.ai/wheels

but,get errors as below!

Traceback (most recent call last):
File "web-stable-diffusion/build.py", line 184, in
build(mod, ARGS)
File "web-stable-diffusion/build.py", line 151, in build
ex = relax.build(mod_deploy, args.target)
File "/usr/local/lib/python3.8/dist-packages/tvm/relax/vm_build.py", line 338, in build
return _vmlink(builder, target, tir_mod, ext_libs, params, system_lib=system_lib)
File "/usr/local/lib/python3.8/dist-packages/tvm/relax/vm_build.py", line 242, in _vmlink
lib = tvm.build(
File "/usr/local/lib/python3.8/dist-packages/tvm/driver/build_module.py", line 281, in build
rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 262, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 251, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
tvm._ffi.base.TVMError: Traceback (most recent call last):
10: TVMFuncCall
9: ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_15TypedPackedFuncIFNS0_6ModuleERKNS0_3MapINS_6TargetENS_8IRModuleEvvEES7_EE17AssignTypedLambdaINS_UlSB_S7_E4_EEEvT_SsEUlRKNS0_7TVMArgsEPNS0_11TVMRetValueEE_EEE4CallEPKS1_SH_SL
8: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
7: tvm::SplitMixedModule(tvm::IRModule, tvm::Target const&, tvm::Target const&)
6: tvm::ApplyPasses(tvm::IRModule, tvm::transform::Sequential)
5: tvm::transform::Pass::operator()(tvm::IRModule) const
4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
3: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
2: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
1: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
0: ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_15TypedPackedFuncIFNS_8IRModuleES5_NS_9transform11PassContextEEE17AssignTypedLambdaIZNS_3tir9transform12VerifyMemoryEvEUlS5_S7_E_EEvT_EUlRKNS0_7TVMArgsEPNS0_11TVMRetValueEE_EEE4CallEPKS1_SF_SJ
Did you forget to bind?
Variable B is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable A is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable matmul is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable matmul is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable matmul is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
File "/workspace/tvm/src/tir/analysis/verify_memory.cc", line 205
RuntimeError: Memory verification failed with the following errors:
from tvm.script import tir as T

@T.prim_func
def matmul20(A: T.Buffer((T.int64(2), T.int64(256), T.int64(1280)), "float32"), B: T.Buffer((T.int64(1280), T.int64(1280)), "float32"), matmul: T.Buffer((T.int64(2), T.int64(256), T.int64(1280)), "float32")):
T.func_attr({"global_symbol": "matmul20", "op_pattern": 4, "target": T.target({"arch": "sm_86", "host": {"keys": ["cpu"], "kind": "llvm", "tag": ""}, "keys": ["cuda", "gpu"], "kind": "cuda", "max_num_threads": 1024, "tag": "", "thread_warp_size": 32}), "tir.noalias": T.bool(True)})
for i0, i1, i2, k in T.grid(2, 256, 1280, 1280):
cse_var_2: T.int32 = i0 * 327680 + i1 * 1280
cse_var_1: T.int32 = cse_var_2 + i2
matmul_1 = T.Buffer((T.int64(655360),), data=matmul.data)
if k == 0:
matmul_1[cse_var_1] = T.float32(0)
A_1 = T.Buffer((T.int64(655360),), data=A.data)
B_1 = T.Buffer((T.int64(1638400),), data=B.data)
matmul_1[cse_var_1] = matmul_1[cse_var_1] + A_1[cse_var_2 + k] * B_1[k * 1280 + i2]

[MetaSchedule] [CUDA target] Did you forget to bind?

Currently, the parameters I am using is as follows:

def do_all_tune(mod, target):
    tunning_dir = "gpu3090"
    tunning_record = "database_tuning_record.json"
    tunning_workload = "database_workload.json"
    cooldown_interval = 150
    trial_cnt = 2000

    local_runner = ms.runner.LocalRunner(cooldown_sec=cooldown_interval, timeout_sec=10)
    database = ms.tir_integration.tune_tir(
        mod=mod,
        target=target,
        work_dir=tunning_dir,
        max_trials_global=trial_cnt,
        max_trials_per_task=2,
        runner=local_runner,
        special_space={},
    )
    if os.path.exists(tunning_record):
        os.remove(tunning_record)
    if os.path.exists(tunning_workload):
        os.remove(tunning_workload)
    database.dump_pruned(
        ms.database.JSONDatabase(
            path_workload=tunning_workload,
            path_tuning_record=tunning_record,
        )
    )

Could you kindly share the parameters you are using to generate the log? I'm curious to know.

Check failed: kNumAttrs == attrs.size() (2 vs. 1) : ValueError: Incorrect kNumAttrs for instruction: Split

When I tried to build the model on my Mac, I got an error:

python3 build.py
Automatically configuring target: metal -keys=metal,gpu -max_function_args=31 -max_num_threads=256 -max_shared_memory_per_block=32768 -max_threads_per_block=1024 -thread_warp_size=32
Load cached module from dist/mod_cache_before_build.pkl and skip tracing. You can use --use-cache=0 to retrace
Traceback (most recent call last):
  File "/Users/b03/Desktop/old/build.py", line 195, in <module>
    build(mod, ARGS)
  File "/Users/b03/Desktop/old/build.py", line 130, in build
    db = ms.database.create(work_dir=args.db_path)
  File "/opt/anaconda3/envs/web2/lib/python3.10/site-packages/tvm/meta_schedule/database/database.py", line 417, in create
    return JSONDatabase(*args, **kwargs)
  File "/opt/anaconda3/envs/web2/lib/python3.10/site-packages/tvm/meta_schedule/database/json_database.py", line 86, in __init__
    self.__init_handle_by_constructor__(
  File "tvm/_ffi/_cython/./object.pxi", line 132, in tvm._ffi._cy3.core.ObjectBase.__init_handle_by_constructor__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 288, in tvm._ffi._cy3.core.ConstructorCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/opt/anaconda3/envs/web2/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/support/parallel_for.cc", line 128
RuntimeError: parallel_for_dynamic error with [17:46:32] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/meta_schedule/database/json_database.cc:198: ValueError: Unable to parse TuningRecord, on line 7 of file log_db/database_tuning_record.json. The workload is:
# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(rxplaceholder: T.Buffer((T.int64(1), T.int64(4), T.int64(64), T.int64(64)), "float32"), T_multiply: T.Buffer((T.int64(1), T.int64(4), T.int64(64), T.int64(64)), "float32")):
        T.func_attr({"op_pattern": 0, "tir.noalias": T.bool(True)})
        # with T.block("root"):
        for ax0, ax1, ax2, ax3 in T.grid(T.int64(1), T.int64(4), T.int64(64), T.int64(64)):
            with T.block("T_multiply"):
                v_ax0, v_ax1, v_ax2, v_ax3 = T.axis.remap("SSSS", [ax0, ax1, ax2, ax3])
                T.reads(rxplaceholder[v_ax0, v_ax1, v_ax2, v_ax3])
                T.writes(T_multiply[v_ax0, v_ax1, v_ax2, v_ax3])
                T_multiply[v_ax0, v_ax1, v_ax2, v_ax3] = T.float32(0.041666667908430099) * rxplaceholder[v_ax0, v_ax1, v_ax2, v_ax3]
The JSONObject of TuningRecord is:
[T.int64(6), [[[["GetBlock", [], ["T_multiply", "main"], ["b0"]], ["GetLoops", ["b0"], [], ["l1", "l2", "l3", "l4"]], ["Fuse", ["l1", "l2", "l3", "l4"], [T.int64(1)], ["l5"]], ["SampleCategorical", [], [[T.int64(32), T.int64(64), T.int64(128), T.int64(256), T.int64(512), T.int64(1024)], [T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666)]], ["v6"]], ["Split", ["l5", "None", "v6"], [T.int64(1)], ["l7", "l8"]], ["Bind", ["l7"], ["blockIdx.x"], []], ["Bind", ["l8"], ["threadIdx.x"], []], ["EnterPostproc", [], [], []]], [[T.int64(3), T.int64(0)]]], [T.float64(2.767325769084694e-05)], {"thread_warp_size": T.int64(32), "host": {"mtriple": "arm64-apple-macos", "tag": "", "kind": "llvm", "mcpu": "apple-latest", "keys": ["arm_cpu", "cpu"]}, "max_threads_per_block": T.int64(1024), "max_function_args": T.int64(31), "max_num_threads": T.int64(256), "tag": "", "max_shared_memory_per_block": T.int64(32768), "kind": "metal", "keys": ["metal", "gpu"]}, [["TENSOR", "float32", [T.int64(1), T.int64(4), T.int64(64), T.int64(64)]], ["TENSOR", "float32", [T.int64(1), T.int64(4), T.int64(64), T.int64(64)]]]]]
The error message is:
[17:46:32] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/meta_schedule/database/database.cc:167: ValueError: Unable to parse the JSON object: [[[["GetBlock", [], ["T_multiply", "main"], ["b0"]], ["GetLoops", ["b0"], [], ["l1", "l2", "l3", "l4"]], ["Fuse", ["l1", "l2", "l3", "l4"], [T.int64(1)], ["l5"]], ["SampleCategorical", [], [[T.int64(32), T.int64(64), T.int64(128), T.int64(256), T.int64(512), T.int64(1024)], [T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666), T.float64(0.16666666666666666)]], ["v6"]], ["Split", ["l5", "None", "v6"], [T.int64(1)], ["l7", "l8"]], ["Bind", ["l7"], ["blockIdx.x"], []], ["Bind", ["l8"], ["threadIdx.x"], []], ["EnterPostproc", [], [], []]], [[T.int64(3), T.int64(0)]]], [T.float64(2.767325769084694e-05)], {"thread_warp_size": T.int64(32), "host": {"mtriple": "arm64-apple-macos", "tag": "", "kind": "llvm", "mcpu": "apple-latest", "keys": ["arm_cpu", "cpu"]}, "max_threads_per_block": T.int64(1024), "max_function_args": T.int64(31), "max_num_threads": T.int64(256), "tag": "", "max_shared_memory_per_block": T.int64(32768), "kind": "metal", "keys": ["metal", "gpu"]}, [["TENSOR", "float32", [T.int64(1), T.int64(4), T.int64(64), T.int64(64)]], ["TENSOR", "float32", [T.int64(1), T.int64(4), T.int64(64), T.int64(64)]]]]
The error is: [17:46:32] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/tir/schedule/primitive/.././instruction_traits.h:387: InternalError: Check failed: kNumAttrs == attrs.size() (2 vs. 1) : ValueError: Incorrect kNumAttrs for instruction: Split

Environments:
Mac M2
Python 3.10
mlc-ai-nightly 0.15.dev315

Demo site generation error: Required limit

screenshot 2023-03-17 at 5 30 24 PM

Find an error initializing the WebGPU device OperationError: Required limit (1073741824) is greater than the supported limit (268435456). - While validating maxBufferSize - While validating required limits

How to get around this error?

[bug] "ms.database.create(work_dir=args.db_path)", JSONReader: cannot find field purity

root@Precision-3660:web-stable-diffusion# python3 build.py --target cuda
Load cached module from dist/mod_cache_before_build.pkl and skip tracing. You can use --use-cache=0 to retrace
Traceback (most recent call last):
File "web-stable-diffusion/build.py", line 175, in
build(mod, ARGS)
File "web-stable-diffusion/build.py", line 136, in build
db = ms.database.create(work_dir=args.db_path)
File "/root/miniconda3/envs/mlc-llm/lib/python3.10/site-packages/tvm/meta_schedule/database/database.py", line 417, in create
return JSONDatabase(args, **kwargs)
File "/root/miniconda3/envs/mlc-llm/lib/python3.10/site-packages/tvm/meta_schedule/database/json_database.py", line 86, in init
self.init_handle_by_constructor(
File "tvm/_ffi/_cython/./object.pxi", line 132, in tvm._ffi._cy3.core.ObjectBase.init_handle_by_constructor
File "tvm/_ffi/_cython/./packed_func.pxi", line 287, in tvm._ffi._cy3.core.ConstructorCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 276, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
ValueError: Traceback (most recent call last):
4: TVMFuncCall
3: _ZN3tvm7runtime13PackedFun
2: tvm::runtime::TypedPackedFunc<tvm::meta_schedule::Database (tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String)>::AssignTypedLambda<tvm::meta_schedule::Database (
)(tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String)>(tvm::meta_schedule::Database ()(tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String), std::string)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
1: tvm::meta_schedule::Database::JSONDatabase(tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String)
0: tvm::meta_schedule::Workload::FromJSON(tvm::runtime::ObjectRef const&) [clone .cold]
8: TVMFuncCall
7: _ZN3tvm7runtime13PackedFun
6: tvm::runtime::TypedPackedFunc<tvm::meta_schedule::Database (tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String)>::AssignTypedLambda<tvm::meta_schedule::Database ()(tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String)>(tvm::meta_schedule::Database ()(tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String), std::string)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
5: tvm::meta_schedule::Database::JSONDatabase(tvm::runtime::String, tvm::runtime::String, bool, tvm::runtime::String)
4: tvm::meta_schedule::Workload::FromJSON(tvm::runtime::ObjectRef const&)
3: tvm::LoadJSON(std::string)
2: tvm::JSONAttrSetter::Set(tvm::runtime::ObjectPtrtvm::runtime::Object, tvm::JSONNode)
1: void tvm::JSONAttrSetter::ParseValue(char const*, bool*) const
0: tvm::JSONAttrSetter::GetValue(char const*) const [clone .part.0]
File "/workspace/tvm/src/meta_schedule/database/database.cc", line 66
ValueError: Unable to parse the JSON object: ["4686030152544265303", "UqcAAAAAAAB..."
The error is: [19:14:28] /workspace/tvm/src/node/serialization.cc:375: JSONReader: cannot find field purity

Uploading image.png…

How to use fp16 precison version of Stable Diffusion 1.5?

when I use fp16 version ,I got an capture error on my 3090TI.

pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", revision='fp16', torch_dtype=torch.float16, local_files_only=True
)

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

ValueError: Multiple tensors as index not yet supported

I installed web-stable-diffusion according to the official documentation, but it couldn't run. It seemed to be caused by changes in relax.

/web-stable-diffusion# python3 build.py --target cuda
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 17.38it/s]
/root/anaconda3/envs/mlc/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py:181: FutureWarning: The configuration file of this scheduler: LCMScheduler {
"_class_name": "LCMScheduler",
"_diffusers_version": "0.24.0",
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"beta_start": 0.00085,
"clip_sample": false,
"clip_sample_range": 1.0,
"dynamic_thresholding_ratio": 0.995,
"num_train_timesteps": 1000,
"original_inference_steps": 50,
"prediction_type": "epsilon",
"rescale_betas_zero_snr": false,
"sample_max_value": 1.0,
"set_alpha_to_one": true,
"steps_offset": 0,
"thresholding": false,
"timestep_scaling": 10.0,
"timestep_spacing": "leading",
"trained_betas": null
}
is outdated. steps_offset should be set to 1 instead of 0. Please make sure to update the config accordingly as leaving steps_offset might led to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the scheduler/scheduler_config.json file
deprecate("steps_offset!=1", "1.0.0", deprecation_message, standard_warn=False)
Traceback (most recent call last):
File "/root/zrx/web-stable-diffusion/build.py", line 158, in
mod, params = trace_models(torch_dev_key)
File "/root/zrx/web-stable-diffusion/build.py", line 81, in trace_models
clip = trace.clip_to_text_embeddings(pipe)
File "/root/zrx/web-stable-diffusion/web_stable_diffusion/trace/model_trace.py", line 27, in clip_to_text_embeddings
mod = dynamo_capture_subgraphs(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/tvm/relax/frontend/torch/dynamo.py", line 198, in dynamo_capture_subgraphs
compiled_model(*params, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/root/zrx/web-stable-diffusion/web_stable_diffusion/trace/model_trace.py", line 20, in forward
text_embeddings = self.clip(text_input_ids)[0]
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 800, in forward
return self.text_model(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 697, in forward
causal_attention_mask = _create_4d_causal_attention_mask(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
return _compile(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
out_code = transform_code_object(code, transform)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
tracer.run()
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2069, in run
super().run()
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 719, in run
and self.step()
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 683, in step
getattr(self, inst.opname)(inst)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in RETURN_VALUE
self.output.compile_subgraph(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/root/anaconda3/envs/mlc/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/dynamo/output_graph.py", line 1009, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/torch/init.py", line 1607, in call
return self.compiler_fn(model
, inputs
, **self.kwargs)
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/tvm/relax/frontend/torch/dynamo.py", line 184, in capture
mod
= from_fx(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/tvm/relax/frontend/torch/fx_translator.py", line 1635, in from_fx
return TorchFXImporter().from_fx(
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/tvm/relax/frontend/torch/fx_translator.py", line 1522, in from_fx
self.env[node] = self.convert_mapfunc_name
File "/root/anaconda3/envs/mlc/lib/python3.9/site-packages/tvm/relax/frontend/torch/fx_translator.py", line 1291, in _getitem
raise ValueError("Multiple tensors as index not yet supported")
torch._dynamo.exc.BackendCompilerFailed: backend='_capture' raised:
ValueError: Multiple tensors as index not yet supported

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

[12:20:18] /workspace/tvm/src/relax/ir/block_builder.cc:64: Warning: BlockBuilder destroyed with remaining blocks!

Image Save/Copy Issue - Grabs Previous Image Instead of Current One

When trying to copy or save an image on the demo webpage, it always takes the previous image instead of the currently displayed one. Additionally, if you attempt to copy or save the first generated image, it results in a transparent 512x512 image instead.

Steps to reproduce:

  1. Generate an image.
  2. Attempt to copy or save the first generated image.
  3. Generate another image.
  4. Attempt to copy or save the second image.

Expected results:

  • The first image should be copied/saved as a proper image.
  • The second image should be copied/saved.

Actual results:

  • The first image results in a transparent 512x512 image when copied/saved.
  • The first image is copied/saved instead of the second image.

Environment:

  • Operating System: Windows 11
  • Browser: Chrome Canary Version 114.0.5715.0 (Official Build) canary (64-bit), launched with --enable-dawn-features=disable_robustness
  • GPU: Radeon RX 7900 XTX

Please investigate this issue, as it prevents users from properly saving or copying the most recent image.

Not an issue but request

I was wondering if you could make something like what i have attached(inside zip GitHub
Offline WEBSD.zip
doesn't allow upload of html files.) to this publicly available; it is an offline copy of WEBsd in one html file using data urls and not having something like it that can work offline almost defeats the purpose of doing everything client side.

raise TypeError(f"Please convert {rhs} with `const` first")

while loading model ,get

text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
Traceback (most recent call last):
  File "/srv/workspace/anaconda3/envs/web_sd/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 670, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.fake_example_inputs())
  File "/srv/workspace/anaconda3/envs/web_sd/lib/python3.9/site-packages/torch/_dynamo/debug_utils.py", line 1055, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/srv/workspace/framework/tvm/python/tvm/relax/frontend/torch/dynamo.py", line 151, in _capture
    mod_ = from_fx(
  File "/srv/workspace/framework/tvm/python/tvm/relax/frontend/torch/fx_translator.py", line 1421, in from_fx
    return TorchFXImporter().from_fx(
  File "/srv/workspace/framework/tvm/python/tvm/relax/frontend/torch/fx_translator.py", line 1308, in from_fx
    self.env[node] = self.convert_map[func_name](node)
  File "/srv/workspace/framework/tvm/python/tvm/relax/frontend/torch/fx_translator.py", line 180, in _add
    return lhs + rhs
  File "/srv/workspace/framework/tvm/python/tvm/relax/expr.py", line 155, in __add__
    return _binary_op_helper(self, other, _op_ffi_api.add)  # type: ignore
  File "/srv/workspace/framework/tvm/python/tvm/relax/expr.py", line 104, in _binary_op_helper
    raise TypeError(f"Please convert {rhs} with `const` first")
TypeError: Please convert 1 with `const` first

VAE mod.functions Assert

Hi, MLC Team! I'm trying to reproduce your model with the tvm unity and newest llvm branch. When I run the build.py script like this:

python build.py --target cuda

It turned out to trigger the assert assert len(mod.functions) == 1

When I print the len(mod.functions), I find that the len of it is 8, but we only reserve the mod["subgraph_0"], I was confused that why should we set this assert?

def vae_to_image(pipe) -> tvm.IRModule:
    class VAEModelWrapper(torch.nn.Module):
        def __init__(self, vae):
            super().__init__()
            self.vae = vae

        def forward(self, latents):
            latents = 1 / 0.18215 * latents
            z = self.vae.post_quant_conv(latents)
            image = self.vae.decoder(z)
            image = (image / 2 + 0.5).clamp(min=0, max=1)
            image = (image.permute(0, 2, 3, 1) * 255).round()
            return image

    vae = pipe.vae
    vae_to_image = VAEModelWrapper(vae)

    z = torch.rand((1, 4, 64, 64), dtype=torch.float32)
    mod = dynamo_capture_subgraphs(
        vae_to_image.forward,
        z,
        keep_params_as_input=True,
    )
    assert len(mod.functions) == 1

    return tvm.IRModule({"vae": mod["subgraph_0"]})

And after I commented this assert and finally saved the params, I used python3 deploy.py --prompt "A photo of an astronaut riding a horse on mars." --device-name cuda to use this model. It turned out that a runtime error had triggered:

RuntimeError: Check failed: input_shape[i] == reg (4 vs. 512) : ErrorContext(fn=image_to_rgba, loc=param[0], param=x, annotation=R.Tensor((1, 512, 512, 3), dtype="float32"))  match_cast error,  shape[1] mismatch to specified constant.

Maybe it is because of the assert issues I had ignored before?

ControlNet support

Hello!
Amazing project! Hats off, this work could really allow a lot of AI projects to provide value and entertainment to users without costing an arm and a leg.

I've been prototyping a user generated content system for the last month or so and a core requirement for it is ControlNet support.

This isn't necessarily a request for support to be added (though that would be amazing) but is also checking to see if TVM/Unity and this project are capable of having that support added to it.

I noticed the lack of img2img as well and wasn't sure if this was out of grasp for the moment.

Thanks again, really excited about this project!

UI/UX & System Design

Hi @tqchen ,

Do you have plans for building a better web app or even a beautiful template for large model based applications?

Thanks,

Max

How to use SD 2.1?

Excited to see support for Stable Diffusion v2.1 in #29. As I don't have the hardware to build this, would it be possible to run the GitHub Action to create the needed files?

I am not sure I understand what would need to change in order to move from v1.5 to v2.1. I imagine I would need a different .wasm file and tokenizer.json, as well as to point cacheUrl somewhere else.

Any help would be much appreciated as I would love to see the quality of images with the newer model. Thanks!

FYI: @masahi, @spectrometerHBH & @tqchen

PS: Sorry for this issue sounding so similar to mlc-ai/web-llm#89, but it's the same asks so I figured it didn't need much changing.

Is there a way to set/change the seed in runtime?

I've been playing around with the code and have been trying to set my own seed to control the image outputs but have not had any luck so far. Any clue or information that could put me in the right path would be greatly appreciated! Thank you

ModuleNotFoundError: No module named 'typing_extensions'

Using version https://github.com/mlc-ai/web-stable-diffusion/tree/ce0c2fbd0fffd7ee39e7be9da34052a8809d98db , I wanted to set up web-stable-diffusion "to deploy the model on web with WebGPU runtime" (as outlined at the specific README.md section.

The step scripts/prep_deps.sh failed with

ModuleNotFoundError: No module named 'typing_extensions'

I was able to resolve this issue by installing the typing-extensions module with

pip install typing-extensions

Executing scripts/prep_deps.sh again then finished successfully.

So maybe typing-extensions should be declared as an requirement.

web worker and wasm ?

Wonder if the GUI would have less responsive issues if Web worker was used ?

Also if WASM was used ?

Assistance Required with tvmjs Integration and webgpu.get_fmap Error

Firstly, I'd like to express my admiration for the remarkable work done on this project. The advancements and capabilities it offers are truly impressive.

I've been diligently following the provided "walkthrough.ipynb" to familiarize myself with the pipeline. Unfortunately, I encountered an issue with the trace part, which seems to malfunction, possibly due to updates in the diffusers library. To circumvent this, I opted for a simplified network module as demonstrated below:

class Net(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x + 1

Following this, I proceeded to convert the network to ONNX format and subsequently to IR:

trace = torch.jit.trace(net, input.to(dtype).to(device))

torch.onnx.export(
    trace, input.to(dtype).to(device), "test/net.onnx", verbose=True, input_names=["input"], output_names=["input"],
)
# Exported graph: graph(%input.1 : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu)):
#   %/Constant_output_0 : Float(requires_grad=0, device=cpu) = onnx::Constant[value={1}, onnx_name="/Constant"](), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   %input : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu) = onnx::Add[onnx_name="/Add"](%input.1, %/Constant_output_0), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   return (%input)

# ============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
# verbose: False, log level: Level.ERROR
# ======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
onnx_model_path = "test/net.onnx"
model = onnx.load(onnx_model_path)
tvm_model = from_onnx(model, keep_params_in_input=True)
tvm_model

# # from tvm.script import ir as I
# # from tvm.script import relax as R

# @I.ir_module
# class Module:
#     @R.function
#     def main(input_1: R.Tensor((1, 3), dtype="float32")) -> R.Tensor((1, 3), dtype="float32"):
#         R.func_attr({"num_input": 1})
#         with R.dataflow():
#             gv: R.Tensor((1, 3), dtype="float32") = R.add(input_1, R.const(1, "float32"))
#             R.output(gv)
#         return gv

After that, I compiled it to wasm:

tvm_model, model_params = relax.frontend.detach_params(tvm_model) # no params actually
target = tvm.target.Target(
    "webgpu", host="llvm -mtriple=wasm32-unknown-unknown-wasm"
)
ex = relax.build(mod=tvm_model, target=target)
ex.export_library("test/net.wasm")

Finally, I used the following JS to run it:

const tvmjs = require("./public/dist/tvmjs.bundle.js");
const EmccWASI = require("./public/dist/tvmjs_runtime.wasi.js");


window.tvmjs = tvmjs

async function asyncInitTVM() {


    const wasmSource = await (
        await fetch("./public/net.wasm")
    ).arrayBuffer();


    logger = function (message) {
        console.log(message);
    };

    const tvm = await tvmjs.instantiate(
        new Uint8Array(wasmSource),
        new EmccWASI(),
        logger
    );

    const output = await tvmjs.detectGPUDevice();
    if (output !== undefined) {
        var label = "WebGPU";
        if (output.adapterInfo.description.length != 0) {
            label += " - " + output.adapterInfo.description;
        } else {
            label += " - " + output.adapterInfo.vendor;
        }
        console.log("Initialize GPU device: " + label);
        tvm.initWebGPU(output.device);
    } else {
        console.log("This browser env do not support WebGPU");
    }



    tvm.withNewScope(() => {
        device = tvm.webgpu();
        // device = tvm.cpu();
        vm = tvm.detachFromCurrentScope(tvm.createVirtualMachine(device));
        net = tvm.detachFromCurrentScope(vm.getFunction("main"));
    })

    await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());

    const input_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([1, 1, 1])
        )
    });
    const input_gpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", device)
        )
    });

    input_gpu.copyFrom(input_cpu);
    await tvm.webgpu().sync();
    console.log("input_cpu", input_cpu.toArray());

    tvm.withNewScope(() => {
        output_gpu = net(input_gpu);
        output_gpu = tvm.detachFromCurrentScope(output_gpu);
    });


    const output_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([2, 3, 4])
        )
    });

    output_cpu.copyFrom(output_gpu);
    await tvm.webgpu().sync();
    console.log("output_cpu", output_cpu.toArray());

}

asyncInitTVM()

However, I've hit a roadblock during the execution phase, particularly at await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());, where the console outputs the following error:

tvmjs.bundle.js:1863  Uncaught (in promise) Error: Cannot find function webgpu.get_fmap
    at Module.getFunction (tvmjs.bundle.js:1863:23)
    at Instance.eval (tvmjs.bundle.js:2791:38)
    at Generator.next (<anonymous>)
    at eval (tvmjs.bundle.js:28:75)
    at new Promise (<anonymous>)
    at __awaiter (tvmjs.bundle.js:24:16)
    at Instance.asyncLoadWebGPUPipelines (tvmjs.bundle.js:2786:20)
    at asyncInitTVM (main.js:48:15)

In addition, I found that when I use llvm as build target instead of webgpu and use tvm.cpu() as device and skip this line, the example is working.

Given the scarcity of detailed documentation and tutorials on integrating custom networks with tvmjs, especially regarding WebGPU support, I find myself in need of your expertise and guidance.

Could you please help me identify any potential missteps in my approach? I am particularly interested in ensuring that my network can be successfully operated using tvmjs and would greatly appreciate any insights or suggestions you might have.

Thank you very much for your time and assistance.

Huge performance gap between TVM and TRT on Stable Diffusion v1.5

GPU: Nvidia RTX 3090TI.

  1. Firstly, I use the log db in the repo, it gives me 3.7s to get the result.
  2. Then, I tried to tuning myself using meta-schedule(with trial count set to 50,000), it gives me 2.5s.

But, on TensorRT v8.6, for one iteration of unet, it gives me only 25ms, rather than 96ms with TVM(USE_CUBLAS =ON ; USE_CUDNN =ON; CUDA Version 12.1)

I wonder why the latency gap of stable diffusion model is so huge between TVM and TensorRT.
BTW, a few weeks ago, I got a different result between TVM and TRT, where my in-house model auto-tuned by TVM performs a wonderful infer latency (almost nearby TensorRT8.5).

Do you have any ideas about it? Thanks advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.