runpod / containers Goto Github PK

View Code? Open in Web Editor NEW

132.0 3.0 83.0 158.94 MB

🐳 | Dockerfiles for the RunPod container images used for our official templates.

Home Page: https://hub.docker.com/u/runpod

License: MIT License

Dockerfile 0.24% Shell 0.58% Jupyter Notebook 98.84% Python 0.26% HCL 0.08% HTML 0.01%

bittensor docker runpod stable-diffusion

containers's Introduction

RunPod Containers

This repository contains the Dockerfiles for the RunPod containers used for our official templates. Resulting containers are available on Docker Hub.

Container	RunPod Template	Description
fast-stable-diffusion	RunPod Fast Stable Diffusion
kasm-desktop	RunPod Desktop
vscode-server	RunPod VS Code Server
discoart	RunPod Disco Diffusion

Changes

The containers serverless-automatic and sd-auto-abdbarho have been removed from this repository. The worker replacement can be found in the runpod-workers/worker-a1111 repository.

Container Requirements

Dependencies

The following dependencies are required as part of RunPod platform functionality.

nginx - Required for proxying ports to the user.
openssh-server - Required for SSH access to the container.
'pip install jupyterlab' - Required for JupyterLab access to the container.

runpod.yaml

Each container foulder needs to have a runpod.yaml file. This file will contain version info as well as services to be ran. The runpod.yaml file should be formatted as follows:

version: '1.0.0'
services:
  - name: 'service1'
    port: 9000
    proxy_port: 9001
  - name: 'service2'
    port: 9002
    proxy_port: 9003

README

Every container folder needs to have its own README.md file, this file will be displayed both on the Docker Hub as well as the README section of the template on the RunPod website. Additionally, if the container is opening a port other than 8888 that is passed through the proxy and the service is not running yet, the README will be displayed to the user.

Building Containers

buildx bake

docker buildx bake

docker buildx bake --push

docker build should be ran from the root of the repository, not from the container folder. The build command should be ran as follows:

docker build -t runpod/<container-name>:<version> -f <container-name>/Dockerfile .

containers's People

Contributors

Stargazers

Watchers

containers's Issues

Add Torch Vision to Comfy UI

Hello,

Template: https://github.com/runpod/containers/tree/main/official-templates/stable-diffusion-comfyui

User comment:

I'm trying to install ComfyUI Manager the standard way with git clone into the custom_nodes folder and it doesn't appear in the UI. I don't know of any other way. Am I missing something?

OK nevermind. I figured it out. I had to install torch vision. The extension is 1.5MB and it's the basic one that let's you download other extensions, so it would be convenient to include it.

Thanks!
JM

Container Cleanup

Single start.sh script that can be used for all of the containers
Implement NGINX proxy referencing README.md
Contains the NGINX config

start.sh

The script will add SSH key
Launch Jupyter
Launch NGINX for Proxy
Copy ENV variables to IP SSH

Investigate Torch image and SXM5 comparability

Image appears to with with PCIe but not SXM5

Serverless Automatic needs --listen

The workers won't actually be able to start up. I fixed this in my own build and it worked. https://github.com/runpod/containers/blob/main/serverless-automatic/start.sh#L7

Unable to build comfyui image

I'm trying to build the img locally running:

docker build -t runpod/stable-diffusion-comfyui-custom -f official-templates/stable-diffusion-comfyui/Dockerfile .

From the root of the repository (all files have full permission as well).

Regardless I'm getting the following error:

[+] Building 1.1s (12/12) FINISHED                                                                                                                                                                                           docker:default
 => [internal] load .dockerignore                                                                                                                                                                                                      0.0s
 => => transferring context: 2B                                                                                                                                                                                                        0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                                                   0.0s
 => => transferring dockerfile: 2.93kB                                                                                                                                                                                                 0.0s
 => ERROR [internal] load metadata for docker.io/library/scripts:latest                                                                                                                                                                1.0s
 => CANCELED [internal] load metadata for docker.io/nvidia/cuda:11.8.0-base-ubuntu22.04                                                                                                                                                1.0s
 => ERROR [internal] load metadata for docker.io/library/proxy:latest                                                                                                                                                                  1.0s
 => CANCELED [internal] load metadata for docker.io/runpod/stable-diffusion:models-1.0.0                                                                                                                                               1.0s
 => CANCELED [internal] load metadata for docker.io/runpod/stable-diffusion-models:2.1                                                                                                                                                 1.0s
 => [auth] library/proxy:pull token for registry-1.docker.io                                                                                                                                                                           0.0s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                                                                                                                                                             0.0s
 => [auth] library/scripts:pull token for registry-1.docker.io                                                                                                                                                                         0.0s
 => [auth] runpod/stable-diffusion-models:pull token for registry-1.docker.io                                                                                                                                                          0.0s
 => [auth] runpod/stable-diffusion:pull token for registry-1.docker.io                                                                                                                                                                 0.0s
------
 > [internal] load metadata for docker.io/library/scripts:latest:
------
------
 > [internal] load metadata for docker.io/library/proxy:latest:
------
Dockerfile:70
--------------------
  68 |     # Start Scripts
  69 |     COPY pre_start.sh /pre_start.sh
  70 | >>> COPY --from=scripts start.sh /
  71 |     RUN chmod +x /start.sh
  72 |     
--------------------
ERROR: failed to solve: scripts: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

Can you please help me figure out what I'm doing wrong? Keep in mind that I have not modified anything from the Dockerfile yet :( (I also succesfully did ''docker login'')

Adding --api support to oobabooga/text-generation-web-ui

It would be very nice to have an environment variable to launch oobabooga (https://github.com/runpod/containers/blob/main/oobabooga/start.sh) with the API. See https://github.com/oobabooga/text-generation-webui#api for more information on this.

Otherwise, the only way to launch text-generation-web-ui with the API is to build my own docker image from scratch.

An alternative would be an ARGS environment variable to let us pass whatever we need to the python app.

Thank you for your consideration :)

ComfyUI base doesn't build

This is the second time I've tried to use a base image to host on runpod, and it's the second time it hasn't worked. It's frustrating. Please fix

can't build stable-diffusion-auto docker

#16 317.6 RuntimeError: Couldn't install torch.
#16 317.6 Command: "/workspace/venv/bin/python3" -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

Automatic WEBUI - Image to Image?

The Automatic WEBUI is currently just hitting the text to Image endpoint:
check_api_availability("http://127.0.0.1:3000/sdapi/v1/txt2img")

How can we also hit the the Image to Image endpoint?

thanks

Upgrade PyTorch version to 2.1.0 and CUDA 12.1.1

Hi Runpod, it would be great if you can either upgrade the current template for PyTorch and CUDA to the new version or create a new template with the newer version of PyTorch and CUDA since some libraries have a dependency on this.

Automatic WEBUI - custom .ckpt or safe tensor not loaded

I just followed the tutorial on RunPod Automatic WEBUI. and the custom safe tensor specified is not loaded when running inference. Instead, what is loaded is the default SD "model".

the the modded to load probably, we need to specify on this endpoint sdapi/v1/options:

"sd_model_checkpoint": "Anything-V3.0-pruned.ckpt [2700c435]",

response = requests.post(url=f'{url}/sdapi/v1/options', json=option_payload)

runpod/base needs a cuda 12.1.1 version

The 12.1.0 container is EOL (https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md) and will be deleted soon.

This is printed on boot:

2024-04-17T06:32:00.716216802Z *************************
2024-04-17T06:32:00.716263593Z ** DEPRECATION NOTICE! **
2024-04-17T06:32:00.716538501Z *************************
2024-04-17T06:32:00.716629960Z THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
2024-04-17T06:32:00.716661327Z     https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Add thebloke's docker image

Will use his docker and then a FROM to build our own, adding our files on top.

tensorflow template is always stuck on exporting environment variables

I tested this out with the tensorflow template on a 4090, it seems to never go ahead of "Exporting Environment variables"

cuda/cudnn mismatch in TF instances

Hi RunPod!

I am experiencing issues with the performance of TensorFlow on your A100 80GB machines. The problems seem to originate from an apparent version mismatch between CUDA, cuDNN, and cuBLAS, which is not aligning properly with the version of TensorFlow currently utilized on your systems.

Additionally, I have noticed significantly slow training times on my setups that are beyond what is normally expected. This sluggish performance is particularly noticeable when compared with a 40GB Colab A100 machine which often even outperforms your 1 A100 80GB setup.

Here are the error messages I am receiving:

When initiating training on my single A100 80GB machine:

2023-07-15 00:38:25.585795: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /usr/local/cuda/lib64/libcublas.so.11: undefined symbol: cublasGetSmCountTarget
2023-07-15 00:38:25.781595: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

And also, when I prepared my data, model, and everything else on my 4xA100 80GB machine a while back:

2023-06-22 20:07:13.541513: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /usr/local/cuda/lib64/libcublas.so.11: undefined symbol: cublasGetSmCountTarget
2023-06-22 20:07:13.881071: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-06-22 20:07:14.015923: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2023-06-22 20:07:14.563243: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2023-06-22 20:07:15.052808: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401

I picked RunPod as my go-to choice when I decided to move on from Colab, thanks to the potential I saw in your platform. Despite the current, let's call them firmware challenges, I'm hopeful that you get your systems up do date and fixed.

All the best!

Building wheel for pycairo (pyproject.toml): finished with status 'error' when installng ControlNet v1.1.142

This error appears when relaunching the webui process after installing ControlNet v1.1.142
running : runpod/stable-diffusion:web-automatic-5.0.0

2023-05-06T18:42:40.063374181Z   Building wheel for pycairo (pyproject.toml): finished with status 'error'
2023-05-06T18:42:40.063378201Z Failed to build pycairo
2023-05-06T18:42:40.063381781Z 
2023-05-06T18:42:40.063385181Z stderr:   error: subprocess-exited-with-error
2023-05-06T18:42:40.063388871Z   
2023-05-06T18:42:40.063392351Z   × Building wheel for pycairo (pyproject.toml) did not run successfully.
2023-05-06T18:42:40.063398041Z   │ exit code: 1
2023-05-06T18:42:40.063401791Z   ╰─> [12 lines of output]
2023-05-06T18:42:40.063405601Z       running bdist_wheel
2023-05-06T18:42:40.063409130Z       running build
2023-05-06T18:42:40.063412730Z       running build_py
2023-05-06T18:42:40.063416320Z       creating build
2023-05-06T18:42:40.063419860Z       creating build/lib.linux-x86_64-cpython-310
2023-05-06T18:42:40.063423460Z       creating build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063427140Z       copying cairo/__init__.py -> build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063430940Z       copying cairo/__init__.pyi -> build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063434690Z       copying cairo/py.typed -> build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063438410Z       running build_ext
2023-05-06T18:42:40.063441890Z       'pkg-config' not found.
2023-05-06T18:42:40.063445430Z       Command ['pkg-config', '--print-errors', '--exists', 'cairo >= 1.15.10']
2023-05-06T18:42:40.063449340Z       [end of output]
2023-05-06T18:42:40.063452820Z   
2023-05-06T18:42:40.063456250Z   note: This error originates from a subprocess, and is likely not a problem with pip.
2023-05-06T18:42:40.063460050Z   ERROR: Failed building wheel for pycairo
2023-05-06T18:42:40.063463640Z ERROR: Could not build wheels for pycairo, which is required to install pyproject.toml-based projects

it can be solved by installing the following before attempting to run the controlnet installer:

apt-get install libcairo2 libcairo2-dev

env variable for --ServerApp.preferred_dir=/workspace

is it possible to use volume mount path env variable for --ServerApp.preferred_dir={from env} or like --notebook-dir={from env} user can start with volume mount path in my case /content instead of /workspace

containers/torch/start.sh

Line 19 in 564385a

 jupyter lab --allow-root --no-browser --port=8888 --ip=* --ServerApp.terminado_settings='{"shell_command":["/bin/bash"]}' --ServerApp.token=$JUPYTER_PASSWORD --ServerApp.allow_origin=* --ServerApp.preferred_dir=/workspace 

TLB training fails

Then trying to train on some models like Lykon/Dreamshape it fails.
There is an os error that it cannot find config.json.
Please check it

New versions (>v6?) of A1111 fail to use the `--xformers` flag (`no module 'xformers'. Processing without...` in startup logs)

I noticed that image generation was significantly slower in new versions of the runpod official A1111 image. Looking into it, it seems like it's due to xformers not being installed, or not loading correctly for whatever reason.

To reproduce (giving my specific steps, but I think it'd occur on secure cloud, and non-3090 machines too):

Go to community cloud
Select a 3090
Deploy with official v10 "Runpod Stable Diffusion" template (runpod/stable-diffusion:web-ui-10.0.0)
Look at startup logs, and also observe that generating a single image with 20 steps takes about 1.3 seconds instead of 1.1 seconds with runpod/stable-diffusion:web-automatic-6.0.1, and also observe that in the A1111 UI at the bottom of the page, it says xformers: N/A instead of xformers: <version number>

Here's a snippet from the startup logs:

2023-07-29T06:28:19.671508453Z 
2023-07-29T06:28:19.671510697Z ---
2023-07-29T06:28:20.581231736Z Python 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
2023-07-29T06:28:20.581256173Z Version: v1.5.1
2023-07-29T06:28:20.581258698Z Commit hash: 68f336bd994bed5442ad95bad6b6ad5564a5409a
2023-07-29T06:28:20.581260371Z 
2023-07-29T06:28:20.581261924Z 
2023-07-29T06:28:20.581263447Z Launching Web UI with arguments: -f --port 3000 --xformers --skip-install --listen --enable-insecure-extension-access
2023-07-29T06:28:20.581282242Z no module 'xformers'. Processing without...
2023-07-29T06:28:20.581286390Z no module 'xformers'. Processing without...
2023-07-29T06:28:20.581287803Z No module 'xformers'. Proceeding without it.

Pytorch cuda image nvidia driver not installed?

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

TGI Docker file not able to access GPU?

I am creating a pod that uses HF's text-generation-interface (TGI) Docker container (see image_name below). I can create a pod successfully as long as I do not pass in the --quantize parameter within the docker_args. For example, if I pass in docker_args="--model-id "tiiuae/falcon-7b-instruct" --num-shard 1 --quantize bitsandbytes" The error in the container log has...2023-08-10T11:30:29.101220272-06:00 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. and ends with the message: 2023-08-10T11:30:29.101315592-06:00 ValueError: quantization is not available on CPU

HF support's comment when I asked on GitHub: it seems more to me that the GPU is not detected in the docker image, and that error message is bogus stemming from that. (I can run fine with 1.0.0 with bnb on a simple docker + gpu environement). Another comment just made by HF GitHub: Something about shm not being properly set or something.... If I try with the other quantization option gptq, the container throws a signal 4. Is the container seeing the GPU? What is going on with bitsandbytes? Why signal 4. I am hoping to minimize the amount of memory and inference time. Help very much appreciated.

Here is my call to create_pod: pod = runpod.create_pod( name=model_id, image_name="ghcr.io/huggingface/text-generation-inference:1.0.0", gpu_type_id=gpu_type, cloud_type=cloud_type, docker_args=f"--model-id {model_id} --num-shard {num_shard} -quantize {quantize}", gpu_count=gpu_count, volume_in_gb=volume_in_gb, container_disk_in_gb=5, ports="80/http", volume_mount_path="/data", # min_vcpu_count=2, # min_memory_in_gb=15, ) The specs on my community pod is 1 x RTX 3090 9 vCPU 37 GB RAM.

Thank you.

Tensorflow container has been deleted?

I am trying to understand where this container comes from https://hub.docker.com/r/runpod/tensorflow
It links to this git repo but the folder has been deleted.

I would like to run a newer version of tensorflow but don't understand how I could update the container that currently exists in runpod for tensorflow

Deployed image doestn`t react on clicks such as 'Generate', 'Refresh' etc

I just built (using docker buildx bake) and deployed an image of stable diffusion webui. Everything started fine, but webui doesn`t react to any of my clicks. I skipped the creation of runpod.yaml file, but just because I don't understand the purpose of it and how to fulfil it. I am quite new to this. Sorry if my problem is really silly. Would be happy for any help ^)

[Feature Request] jupyter lab start notebook from url

If we can start a notebook from a URL, this feature would become very helpful for both users and template creators.

In my case, I should instruct the user to enter the following URL: https://github.com/camenduru/stable-diffusion-webui-runpod, and then copy and paste the code into a new notebook. However, this manual process can be avoided by using JupyterLab's 'start notebook' feature.