Giter VIP home page Giter VIP logo

deepfacelab-docker's Introduction

DeepFaceLab/TensorFlow GPU-enabled Docker Container

Provides an NVIDIA GPU-enabled container with DeepFaceLab pre-installed on an Anaconda and TensorFlow container xychelsea/tensorflow:latest-gpu.

DeepFaceLab with TensorFlow

DeepFaceLab is an open source research project, based on TensorFlow exploring the role of machine learning as a tool in the creative process. TensorFlow is an open source platform for machine learning. It provides tools, libraries and community resources for researcher and developers to build and deploy machine learning applications. Anaconda is an open data science platform based on Python 3. This container installs TensorFlow through the conda command with a lightweight version of Anaconda (Miniconda) and the conda-forge repository in the /usr/local/anaconda directory. The default user, anaconda runs a Tini shell /usr/bin/tini, and comes preloaded with the conda command in the environment $PATH. Additional versions with NVIDIA/CUDA support and Jupyter Notebooks tags are available.

NVIDIA/CUDA GPU-enabled Containers

Two flavors provide an NVIDIA GPU-enabled container with TensorFlow pre-installed through Anaconda.

Getting the containers

Vanilla DeepFaceLab

The base container, based on the xychelsea/tensorflow:latest from the Anaconda 3 container stack (xychelsea/anaconda3:latest) running Tini shell. For the container with a /usr/bin/tini entry point, use:

docker pull xychelsea/deepfacelab:latest

With Jupyter Notebooks server pre-installed, pull with:

docker pull xychelsea/deepfacelab:latest-jupyter

DeepFaceLab with NVIDIA/CUDA GPU support

Modified versions of nvidia/cuda:latest container, with support for NVIDIA/CUDA graphical processing units through the Tini shell. For the container with a /usr/bin/tini entry point:

docker pull xychelsea/deepfacelab:latest-gpu

With Jupyter Notebooks server pre-installed, pull with:

docker pull xychelsea/deepfacelab:latest-gpu-jupyter

Running the containers

To run the containers with the generic Docker application or NVIDIA enabled Docker, use the docker run command with a bound volume directory workspace attached at mount point /usr/local/deepfacelab/workspace.

Vanilla DeepFaceLab

docker run --rm -it \
    -v workspace:/usr/local/deepfacelab/workspace \
    xychelsea/deepfacelab:latest

With Jupyter Notebooks server pre-installed, run with:

docker run --rm -it -d
     -v workspace:/usr/local/deepfacelab/workspace \
     -p 8888:8888 \
     xychelsea/deepfacelab:latest-jupyter

DeepFaceLab with NVIDIA/CUDA GPU support

docker run --gpus all --rm -it
     -v workspace:/usr/local/deepface/workspace \
     xychelsea/deepfacelab:latest-gpu /bin/bash

With Jupyter Notebooks server pre-installed, run with:

docker run --gpus all --rm -it -d
     -v workspace:/usr/local/deepfacelab/workspace \
     -p 8888:8888 \
     xychelsea/deepfacelab:latest-gpu-jupyter

Using DeepFaceLab

[TK]

Building the containers

To build either a GPU-enabled container or without GPUs, use the deepfacelab-docker GitHub repository.

git clone git://github.com/iperov/DeepFaceLab.git

Vanilla DeepFaceLab

The base container, based on the xychelsea/deepfacelab:latest from the Anaconda 3 container stack (xychelsea/anaconda3:latest) running Tini shell:

docker build -t deepfacelab:latest -f Dockerfile .

With Jupyter Notebooks server pre-installed, build with:

docker build -t deepfacelab:latest-jupyter -f Dockerfile.jupyter .

DeepFaceLab with NVIDIA/CUDA GPU support

docker build -t deepfacelab:latest-gpu -f Dockerfile.nvidia .

With Jupyter Notebooks server pre-installed, build with:

docker build -t deepfacelab:latest-gpu-jupyter -f Dockerfile.nvidia-jupyter .

Environment

The default environment uses the following configurable options:

ANACONDA_GID=100
ANACONDA_PATH=/usr/local/anaconda3
ANACONDA_UID=1000
ANACONDA_USER=anaconda
ANACONDA_ENV=magenta
DEEPFACELAB_PATH=/usr/local/deepfacelab
DEEPFACELAB_HOME=$HOME/deepfacelab
DEEPFACELAB_WORKSPACE=$DEEPFACELAB_PATH/workspace
DEEPFACELAB_SCRIPTS=$DEEPFACELAB_PATH/scripts

References

deepfacelab-docker's People

Contributors

seaniedan avatar xychelsea avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deepfacelab-docker's Issues

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

Choose one GPU idx.

[CPU] : CPU
[0] : Tesla P40
[1] : Tesla P40
[2] : Tesla P40
[3] : Tesla P40
[4] : Tesla P40
[5] : Tesla P40
[6] : Tesla P40

[2] Which GPU index to choose? : 2

[wf] Face type ( f/wf/head ?:help ) :
wf
[0] Max number of faces from image ( ?:help ) :
0
[512] Image size ( 256-2048 ?:help ) :
512
[90] Jpeg quality ( 1-100 ?:help ) :
90
[n] Write debug images to aligned_debug? ( y/n ) :
n
Performing manual extract...
Running on Tesla P40
0%| | 0/655 [00:00<?, ?it/s]Traceback (most recent call last):
File "/usr/local/deepfacelab/main.py", line 345, in
arguments.func(arguments)
File "/usr/local/deepfacelab/main.py", line 33, in process_extract
Extractor.main( detector = arguments.detector,
File "/usr/local/deepfacelab/mainscripts/Extractor.py", line 812, in main
data = ExtractSubprocessor ([ ExtractSubprocessor.Data(Path(filename)) for filename in input_image_paths ], 'landmarks-manual', image_size, jpeg_quality, face_type, output_debug_path if output_debug else None, manual_window_size=manual_window_size, device_config=device_config).run()
File "/usr/local/deepfacelab/core/joblib/SubprocessorBase.py", line 224, in run
self.on_result (cli.host_dict, obj['data'], obj['result'])
File "/usr/local/deepfacelab/mainscripts/Extractor.py", line 638, in on_result
self.redraw()
File "/usr/local/deepfacelab/mainscripts/Extractor.py", line 605, in redraw
view_landmarks = (np.array(self.landmarks) * self.view_scale).astype(np.int).tolist()
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

Dockerbuild not building

Hello,

when I follow the instructions and
docker build -t deepfacelab:latest-gpu-jupyter -f Dockerfile.nvidia-jupyter .
I get as far as:

Step 16/24 : RUN conda create -c nvidia -n deepfacelab python=3.7 cudnn=8.0.4 cudatoolkit=11.0.221
 ---> Running in 9489a15817fb
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local/anaconda3/envs/deepfacelab

  added / updated specs:
    - cudatoolkit=11.0.221
    - cudnn=8.0.4
    - python=3.7


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |            1_gnu          22 KB  conda-forge
    ca-certificates-2020.12.5  |       ha878542_0         137 KB  conda-forge
    certifi-2020.12.5          |   py37h89c1867_1         143 KB  conda-forge
    cudatoolkit-11.0.221       |       h6bb024c_0       953.0 MB  nvidia
    cudnn-8.0.4                |       cuda11.0_0       518.7 MB  nvidia
    ld_impl_linux-64-2.35.1    |       hea4e1c9_2         618 KB  conda-forge
    libffi-3.3                 |       h58526e2_2          51 KB  conda-forge
    libgcc-ng-9.3.0            |      h2828fa1_19         7.8 MB  conda-forge
    libgomp-9.3.0              |      h2828fa1_19         376 KB  conda-forge
    libstdcxx-ng-9.3.0         |      h6de172a_19         4.0 MB  conda-forge
    ncurses-6.2                |       h58526e2_4         985 KB  conda-forge
    openssl-1.1.1k             |       h7f98852_0         2.1 MB  conda-forge
    pip-21.1.1                 |     pyhd8ed1ab_0         1.1 MB  conda-forge
    python-3.7.10              |hffdb5ce_100_cpython        57.3 MB  conda-forge
    python_abi-3.7             |          1_cp37m           4 KB  conda-forge
    readline-8.1               |       h46c0cb4_0         295 KB  conda-forge
    setuptools-49.6.0          |   py37h89c1867_3         947 KB  conda-forge
    sqlite-3.35.5              |       h74cdb3f_0         1.4 MB  conda-forge
    tk-8.6.10                  |       h21135ba_1         3.2 MB  conda-forge
    wheel-0.36.2               |     pyhd3deb0d_0          31 KB  conda-forge
    xz-5.2.5                   |       h516909a_1         343 KB  conda-forge
    zlib-1.2.11                |    h516909a_1010         106 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        1.52 GB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
  ca-certificates    conda-forge/linux-64::ca-certificates-2020.12.5-ha878542_0
  certifi            conda-forge/linux-64::certifi-2020.12.5-py37h89c1867_1
  cudatoolkit        nvidia/linux-64::cudatoolkit-11.0.221-h6bb024c_0
  cudnn              nvidia/linux-64::cudnn-8.0.4-cuda11.0_0
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.35.1-hea4e1c9_2
  libffi             conda-forge/linux-64::libffi-3.3-h58526e2_2
  libgcc-ng          conda-forge/linux-64::libgcc-ng-9.3.0-h2828fa1_19
  libgomp            conda-forge/linux-64::libgomp-9.3.0-h2828fa1_19
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-9.3.0-h6de172a_19
  ncurses            conda-forge/linux-64::ncurses-6.2-h58526e2_4
  openssl            conda-forge/linux-64::openssl-1.1.1k-h7f98852_0
  pip                conda-forge/noarch::pip-21.1.1-pyhd8ed1ab_0
  python             conda-forge/linux-64::python-3.7.10-hffdb5ce_100_cpython
  python_abi         conda-forge/linux-64::python_abi-3.7-1_cp37m
  readline           conda-forge/linux-64::readline-8.1-h46c0cb4_0
  setuptools         conda-forge/linux-64::setuptools-49.6.0-py37h89c1867_3
  sqlite             conda-forge/linux-64::sqlite-3.35.5-h74cdb3f_0
  tk                 conda-forge/linux-64::tk-8.6.10-h21135ba_1
  wheel              conda-forge/noarch::wheel-0.36.2-pyhd3deb0d_0
  xz                 conda-forge/linux-64::xz-5.2.5-h516909a_1
  zlib               conda-forge/linux-64::zlib-1.2.11-h516909a_1010


Proceed ([y]/n)? 

Downloading and Extracting Packages
_libgcc_mutex-0.1    | 3 KB      | ########## | 100% 
readline-8.1         | 295 KB    | ########## | 100% 
wheel-0.36.2         | 31 KB     | ########## | 100% 
setuptools-49.6.0    | 947 KB    | ########## | 100% 
ca-certificates-2020 | 137 KB    | ########## | 100% 
openssl-1.1.1k       | 2.1 MB    | ########## | 100% 
python-3.7.10        | 57.3 MB   | ########## | 100% 
xz-5.2.5             | 343 KB    | ########## | 100% 
cudnn-8.0.4          | 518.7 MB  | ########## | 100% 
libgcc-ng-9.3.0      | 7.8 MB    | ########## | 100% 
certifi-2020.12.5    | 143 KB    | ########## | 100% 
libstdcxx-ng-9.3.0   | 4.0 MB    | ########## | 100% 
zlib-1.2.11          | 106 KB    | ########## | 100% 
libffi-3.3           | 51 KB     | ########## | 100% 
ncurses-6.2          | 985 KB    | ########## | 100% 
libgomp-9.3.0        | 376 KB    | ########## | 100% 
tk-8.6.10            | 3.2 MB    | ########## | 100% 
cudatoolkit-11.0.221 | 953.0 MB  |            |   0% 
ld_impl_linux-64-2.3 | 618 KB    | ########## | 100% 
sqlite-3.35.5        | 1.4 MB    | ########## | 100% 
pip-21.1.1           | 1.1 MB    | ########## | 100% 
python_abi-3.7       | 4 KB      | ########## | 100% 
_openmp_mutex-4.5    | 22 KB     | ########## | 100% 
TypeError('not all arguments converted during string formatting')

any clues? I am still getting the 'no numpy' error with
deepfacelab-docker:latest

Can't get a bash shell

Hi Chelsea, I didn't realise your Twitch stream was showing you coding this, congratulations on 15k subscribers! Looks like you're building a wonderful community.

I'm trying to figure out why I don't get a shell with the container. I'm on CentOS and have no problems with other Docker images. I use
docker run --gpus all --rm -it -d -v workspace:/usr/local/deepfacelab/workspace deepfacelab:latest-gpu
...I must be fundamentally misunderstanding something, because I only get a token, like
d21aa890152af6ad81e267c8dae1f6bfc178fdb7ff3c25ec79766c1fb66cf756 and no bash shell. So I use
docker exec -it my_container bash
to get a shell, but then I have all kinds of permission issues.

Documentation please!

Perhaps I'm misunderstanding something fundemental here about how to use these shell scripts - could you please write some very brief instructions on what do do after running the Docker image to run the scripts?

ModuleNotFoundError: No module named 'tensorflow'

That old missing tensorflow problem. Bit stumped by this one. I ran:

(deepfacelab) anaconda@554ac43cdd60:~/scripts$ ./4_data_src_extract_faces_S3FD.sh
[wf] Face type ( f/wf/head ?:help ) : ?
Full face / whole face / head. 'Whole face' covers full area of face include forehead. 'head' covers full head, but requires XSeg for src and dst faceset.
[wf] Face type ( f/wf/head ?:help ) : head
head
[0] Max number of faces from image ( ?:help ) : ?
If you extract a src faceset that has frames with a large number of faces, it is advisable to set max faces to 3 to speed up extraction. 0 - unlimited
[0] Max number of faces from image ( ?:help ) : 1
1
[768] Image size ( 256-2048 ?:help ) : ?
Output image size. The higher image size, the worse face-enhancer works. Use higher than 512 value only if the source image is sharp enough and the face does not need to be enhanced.
[768] Image size ( 256-2048 ?:help ) : 512
512
[90] Jpeg quality ( 1-100 ?:help ) : 90
90
[n] Write debug images to aligned_debug? ( y/n ) : y
Extracting faces...
Error while subprocess initialization: Traceback (most recent call last):
  File "/usr/local/deepfacelab/core/joblib/SubprocessorBase.py", line 62, in _subprocess_run
    self.on_initialize(client_dict)
  File "/usr/local/deepfacelab/mainscripts/Extractor.py", line 68, in on_initialize
    nn.initialize (device_config)
  File "/usr/local/deepfacelab/core/leras/nn.py", line 79, in initialize
    import tensorflow
ModuleNotFoundError: No module named 'tensorflow'

Not sure how to proceed. Any help welcomed!

ModuleNotFoundError: No module named 'ffmpeg'

My love-hate relationship with ffmpeg continues!

I build the container with the downloaded and unzipped deepfacelab-docker-0.1.3
inside that folder I ran
docker build -t deepfacelab:latest-gpu-jupyter -f Dockerfile.nvidia .
After a really decadent cup of Yorkshire Gold (one sugar, no milk, thanks), I got:
Successfully tagged deepfacelab:latest-gpu

I didn't see any discernable errors regarding ffmpeg. However, inside the container, I cd to the scripts directory and try to run:

(deepfacelab) anaconda@aaab3d68ee7e:~/scripts$ ./2_extract_PNG_from_video_data_src.sh 
Traceback (most recent call last):
  File "/usr/local/deepfacelab/main.py", line 324, in <module>
    arguments.func(arguments)
  File "/usr/local/deepfacelab/main.py", line 181, in process_videoed_extract_video
    from mainscripts import VideoEd
  File "/usr/local/deepfacelab/mainscripts/VideoEd.py", line 3, in <module>
    import ffmpeg
ModuleNotFoundError: No module named 'ffmpeg'

I wonder if the hard work you've done in compiling ffmpeg has broken a python library?
(deepfacelab) anaconda@aaab3d68ee7e:~/scripts$ which ffmpeg
reports it exists in Bash world:
/usr/local/ffmpeg-nvenc/bin/ffmpeg

ModuleNotFoundError: No module named 'colorama'

I tear the cellophane off a shiny new Jupyter notebook, and run

!cd ~/scripts && ./4.2_data_src_util_add_landmarks_debug_images.sh
It's feeling better, but I get:

Traceback (most recent call last):
  File "/usr/local/deepfacelab/main.py", line 6, in <module>
    from core.leras import nn
  File "/usr/local/deepfacelab/core/leras/__init__.py", line 1, in <module>
    from .nn import nn
  File "/usr/local/deepfacelab/core/leras/nn.py", line 26, in <module>
    from core.interact import interact as io
  File "/usr/local/deepfacelab/core/interact/__init__.py", line 1, in <module>
    from .interact import interact
  File "/usr/local/deepfacelab/core/interact/interact.py", line 8, in <module>
    import colorama
ModuleNotFoundError: No module named 'colorama'

Let me know if you'd like me to investigate further.

CUDNN not installed

Again, this might be a problem with how I'm using the repo - there's no instructions - so I'm attaching a BASH terminal.

All is working great (ffmpeg, sorting scripts) until I try to train :

(deepfacelab) anaconda@8edf84a647e9:~/scripts$ ./6_train_Quick96_no_preview.sh 
Running trainer.

[new] No saved models found. Enter a name of a new model : 
new

Model first run.

Choose one or several GPU idxs (separated by comma).

[CPU] : CPU
  [0] : Quadro RTX 5000

[0] Which GPU indexes to choose? : 
0

Initializing models:   0%|                                                                                                               | 0/5 [00:00<?, ?it/s]
Error: No OpKernel was registered to support Op 'DepthToSpace' used by node DepthToSpace (defined at /deepfacelab/core/leras/ops/__init__.py:336)  with these attrs: [data_format="NCHW", block_size=2, T=DT_FLOAT]
Registered devices: [CPU]
Registered kernels:
  device='GPU'; T in [DT_QINT8]
  device='GPU'; T in [DT_HALF]
  device='GPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_VARIANT]; data_format in ["NHWC"]
  device='CPU'; T in [DT_RESOURCE]; data_format in ["NHWC"]
  device='CPU'; T in [DT_STRING]; data_format in ["NHWC"]
  device='CPU'; T in [DT_BOOL]; data_format in ["NHWC"]
  device='CPU'; T in [DT_COMPLEX128]; data_format in ["NHWC"]
  device='CPU'; T in [DT_COMPLEX64]; data_format in ["NHWC"]
  device='CPU'; T in [DT_DOUBLE]; data_format in ["NHWC"]
  device='CPU'; T in [DT_FLOAT]; data_format in ["NHWC"]
  device='CPU'; T in [DT_BFLOAT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_HALF]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT32]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT8]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT8]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT32]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT64]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT64]; data_format in ["NHWC"]

	 [[DepthToSpace]]

Errors may have originated from an input operation.
Input Source operations connected to node DepthToSpace:
 LeakyRelu_4 (defined at /deepfacelab/core/leras/archis/DeepFakeArchi.py:58)
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1358, in _run_fn
    self._extend_graph()
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1398, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'DepthToSpace' used by {{node DepthToSpace}} with these attrs: [data_format="NCHW", block_size=2, T=DT_FLOAT]
Registered devices: [CPU]
Registered kernels:
  device='GPU'; T in [DT_QINT8]
  device='GPU'; T in [DT_HALF]
  device='GPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_VARIANT]; data_format in ["NHWC"]
  device='CPU'; T in [DT_RESOURCE]; data_format in ["NHWC"]
  device='CPU'; T in [DT_STRING]; data_format in ["NHWC"]
  device='CPU'; T in [DT_BOOL]; data_format in ["NHWC"]
  device='CPU'; T in [DT_COMPLEX128]; data_format in ["NHWC"]
  device='CPU'; T in [DT_COMPLEX64]; data_format in ["NHWC"]
  device='CPU'; T in [DT_DOUBLE]; data_format in ["NHWC"]
  device='CPU'; T in [DT_FLOAT]; data_format in ["NHWC"]
  device='CPU'; T in [DT_BFLOAT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_HALF]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT32]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT8]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT8]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT32]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT64]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT64]; data_format in ["NHWC"]

	 [[DepthToSpace]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/deepfacelab/mainscripts/Trainer.py", line 46, in trainerThread
    model = models.import_model(model_class_name)(
  File "/usr/local/deepfacelab/models/ModelBase.py", line 189, in __init__
    self.on_initialize()
  File "/usr/local/deepfacelab/models/Model_Quick96/Model.py", line 222, in on_initialize
    model.init_weights()
  File "/usr/local/deepfacelab/core/leras/layers/Saveable.py", line 104, in init_weights
    nn.init_weights(self.get_weights())
  File "/usr/local/deepfacelab/core/leras/ops/__init__.py", line 48, in init_weights
    nn.tf_sess.run (ops)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'DepthToSpace' used by node DepthToSpace (defined at /deepfacelab/core/leras/ops/__init__.py:336)  with these attrs: [data_format="NCHW", block_size=2, T=DT_FLOAT]
Registered devices: [CPU]
Registered kernels:
  device='GPU'; T in [DT_QINT8]
  device='GPU'; T in [DT_HALF]
  device='GPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_VARIANT]; data_format in ["NHWC"]
  device='CPU'; T in [DT_RESOURCE]; data_format in ["NHWC"]
  device='CPU'; T in [DT_STRING]; data_format in ["NHWC"]
  device='CPU'; T in [DT_BOOL]; data_format in ["NHWC"]
  device='CPU'; T in [DT_COMPLEX128]; data_format in ["NHWC"]
  device='CPU'; T in [DT_COMPLEX64]; data_format in ["NHWC"]
  device='CPU'; T in [DT_DOUBLE]; data_format in ["NHWC"]
  device='CPU'; T in [DT_FLOAT]; data_format in ["NHWC"]
  device='CPU'; T in [DT_BFLOAT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_HALF]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT32]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT8]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT8]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT16]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT32]; data_format in ["NHWC"]
  device='CPU'; T in [DT_INT64]; data_format in ["NHWC"]
  device='CPU'; T in [DT_UINT64]; data_format in ["NHWC"]

	 [[DepthToSpace]]

Errors may have originated from an input operation.
Input Source operations connected to node DepthToSpace:
 LeakyRelu_4 (defined at /deepfacelab/core/leras/archis/DeepFakeArchi.py:58)

I have CUDA in Docker:

(deepfacelab) anaconda@8edf84a647e9:~/scripts$ nvidia-smi
Thu Jun  3 08:51:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     Off  | 00000000:65:00.0  On |                  Off |
| 34%   34C    P8    18W / 230W |    853MiB / 16124MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|

And before I run the above command, I installed CUDNN inside Docker:

(deepfacelab) anaconda@8edf84a647e9:~/scripts$ conda install -c conda-forge cudnn
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.9.2
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /usr/local/anaconda3/envs/deepfacelab

  added / updated specs:
    - cudnn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2021.5.30  |       ha878542_0         136 KB  conda-forge
    certifi-2021.5.30          |   py38h578d9bd_0         141 KB  conda-forge
    cudatoolkit-11.2.2         |       he111cf0_8       877.3 MB  conda-forge
    cudnn-8.1.0.77             |       h90431f1_0       634.8 MB  conda-forge
    openssl-1.1.1k             |       h7f98852_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        1.48 GB

The following NEW packages will be INSTALLED:

  cudatoolkit        conda-forge/linux-64::cudatoolkit-11.2.2-he111cf0_8
  cudnn              conda-forge/linux-64::cudnn-8.1.0.77-h90431f1_0

The following packages will be UPDATED:

  ca-certificates                      2020.12.5-ha878542_0 --> 2021.5.30-ha878542_0
  certifi                          2020.12.5-py38h578d9bd_1 --> 2021.5.30-py38h578d9bd_0
  openssl                                 1.1.1j-h7f98852_0 --> 1.1.1k-h7f98852_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
cudatoolkit-11.2.2   | 877.3 MB  | #################################################################################################################### | 100% 
openssl-1.1.1k       | 2.1 MB    | #################################################################################################################### | 100% 
certifi-2021.5.30    | 141 KB    | #################################################################################################################### | 100% 
ca-certificates-2021 | 136 KB    | #################################################################################################################### | 100% 
cudnn-8.1.0.77       | 634.8 MB  | #################################################################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: \ By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html

- By downloading and using the cuDNN conda packages, you accept the terms and conditions of the NVIDIA cuDNN EULA -
  https://docs.nvidia.com/deeplearning/cudnn/sla/index.html

done

I believe the repo installs CUDA correctly, but doesn't install CUDNN. These are requirements - I was looking here for the error. Any advice/help appreciated!

python 3.8 required?

Vanilla docker starts but cant run any scripts because apperantly its set to grab the wrong version of python, tried building the container and changing python version in dockerfile but that breaks everything. please update this to the last version since deepfacelab is read-only these days.

No GUI

Hi Chelsea,

Imagine my shock and surprise when I ran

bash 6_train_Quick96.sh
and got the error:

: cannot connect to X server
I managed to solve this with x11 forwarding. There are some security concerns but if you're behind a firewall, you should be safe to put:

xhost + in the start.sh script then ran docker with the options (also in the start.sh script):

--net=host --ipc=host \
-e DISPLAY=$DISPLAY \
-v /tmp/.X11-unix:/tmp/.X11-unix \

Then you can monitor your training, which made me VERY HAPPY:
image

...I'd be happy to run through this with you - thank you again for laying some fantastic groundwork in this repo.

Sean

jupyter notebook not running in deepfacelab-docker-0.1.3

Maybe I don't know what I'm doing, but I downloaded and unzipped deepfacelab-docker-0.1.3
inside that folder I ran
docker build -t deepfacelab:latest-gpu-jupyter -f Dockerfile.nvidia-jupyter .
After a few cups of tea, I got:
Successfully tagged deepfacelab:latest-gpu-jupyter
I then ran
docker run --gpus all --rm -it -d -v workspace:/usr/local/deepfacelab/workspace -p 8888:8888 deepfacelab:latest-gpu-jupyter
..and I see a token
When I go to http://0.0.0.0:8888/ I see no notebook. I run:
docker exec -it <MY_CONTAINER> jupyter notebook list
and get:
OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: "jupyter": executable file not found in $PATH": unknown

..this is new, in previous versions of this docker repo, I could get the token and jupyter notebook ran.

chmod: cannot access '/usr/local/deepfacelab/scripts/*.sh': No such file or directory

Thank you for providing this repo. However, I've had no luck with it so far.

I do:
git clone https://github.com/xychelsea/deepfacelab-docker
cd deepfacelab-docker/
docker build -t deepfacelab:latest-gpu-jupyter -f Dockerfile.nvidia-jupyter .

and I get:
...

Cloning into '/usr/local/deepfacelab'...
Updating files: 100% (203/203), done.
Removing intermediate container a233826f65dc
---> 3f3dc3e8728d
Step 17/22 : USER root
---> Running in 566a3149f694
Removing intermediate container 566a3149f694
---> 434e98af705f
Step 18/22 : RUN fix-permissions ${DEEPFACELAB_WORKSPACE} && chmod +x ${DEEPFACELAB_SCRIPTS}/.sh && fix-permissions ${DEEPFACELAB_SCRIPTS} && ln -s ${DEEPFACELAB_PATH} ${HOME}/deepfacelab && ln -s ${DEEPFACELAB_WORKSPACE} ${HOME}/workspace && ln -s ${DEEPFACELAB_SCRIPTS} ${HOME}/scripts
---> Running in 1aa3a05f6ebb
chmod: cannot access '/usr/local/deepfacelab/scripts/
.sh': No such file or directory
The command '/bin/bash -o pipefail -c fix-permissions ${DEEPFACELAB_WORKSPACE} && chmod +x ${DEEPFACELAB_SCRIPTS}/*.sh && fix-permissions ${DEEPFACELAB_SCRIPTS} && ln -s ${DEEPFACELAB_PATH} ${HOME}/deepfacelab && ln -s ${DEEPFACELAB_WORKSPACE} ${HOME}/workspace && ln -s ${DEEPFACELAB_SCRIPTS} ${HOME}/scripts' returned a non-zero code: 1

ModuleNotFoundError: No module named 'PIL'

Hi Chelsea, and apologies if I have misunderstood everything but... I was looking for a way to run the DeepFaceLab repo in Docker, started coding my own, figured out I'd better have a look around first, and chanced upon your Docker images.

I now realise you have a Twitch stream with 15.2K subscribers - amazing! Very entertaining and worth a watch. So I don't know if this repo is even meant to be used, but in the spirit of trying to assist, I respectfully present my latest finding!

I was trying to launch script 4.2_data_src_sort.sh :

(deepfacelab) anaconda@aaab3d68ee7e:~/scripts$ ./4.2_data_src_sort.sh 
Traceback (most recent call last):
  File "/usr/local/deepfacelab/main.py", line 324, in <module>
    arguments.func(arguments)
  File "/usr/local/deepfacelab/main.py", line 68, in process_sort
    from mainscripts import Sorter
  File "/usr/local/deepfacelab/mainscripts/Sorter.py", line 14, in <module>
    from core import imagelib, mathlib, pathex
  File "/usr/local/deepfacelab/core/imagelib/__init__.py", line 5, in <module>
    from .text import get_text_image, get_draw_text_lines
  File "/usr/local/deepfacelab/core/imagelib/text.py", line 3, in <module>
    from PIL import Image, ImageDraw, ImageFont
ModuleNotFoundError: No module named 'PIL'

I'm going to try and debug, and if I find a solution, would a pull request be useful to you? I imagine you could solve this a lot faster than me!

How to change defaults

In the read me, it says I can change the defaults:
ANACONDA_GID=100
ANACONDA_PATH=/usr/local/anaconda3
ANACONDA_UID=1000
ANACONDA_USER=anaconda

Where do I change these? Also, if I change the GID and UID to match the user on the bare metal running the container, will this solve the problem of root owning files produced from the docker container?

Just a (thank you) word!

Hello!
I am starting out with tensorflow, and Each time I start out I try to find projects that inspire me and helps me out, and this repo is the first I came across! Wanted to make a issue just explaining my appreciation!

Thanks for making this Repo!
Wishes from Sweden!

//Will.

pull docker shows error

error message:
Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?

Why docker doesn't use all the VRAM on EC2 instance?

I launched an g4dn.4xlarge (with Ubuntu) instance on AWS hoping to use DeepFaceLab on it.

Instance Size GPU vCPUs Memory (GiB) GPU memory (GiB)
g4dn.4xlarge 1 16 64 16

I successfully installed the nvidia-container-toolkit, and launched a docker (which contains all the dependencies to run Deepfacelab) on it with this command:

docker run --gpus all --rm -it
     -v workspace:/usr/local/deepface/workspace \
     xychelsea/deepfacelab:latest-gpu /bin/bash

The container launch successfully. I start extracting images and faceset. Then comes the training phase, here is the output:

$ bash scripts/6_train_Quick96.sh 
Running trainer.

[new] No saved models found. Enter a name of a new model : quick96
quick96

Model first run.

Choose one or several GPU idxs (separated by comma).

[CPU] : CPU
  [0] : Tesla T4

[0] Which GPU indexes to choose? : 
0

Initializing models: 100%|###################################################################################################################################################| 5/5 [00:01<00:00,  3.31it/s]
Loading samples: 100%|################################################################################################################################################| 1222/1222 [00:02<00:00, 436.58it/s]
Loading samples: 100%|################################################################################################################################################| 1217/1217 [00:02<00:00, 523.04it/s]
============ Model Summary =============
==                                    ==
==        Model name: quick96_Quick96 ==
==                                    ==
== Current iteration: 0               ==
==                                    ==
==---------- Model Options -----------==
==                                    ==
==        batch_size: 4               ==
==                                    ==
==------------ Running On ------------==
==                                    ==
==      Device index: 0               ==
==              Name: Tesla T4        ==
==              VRAM: 0.02GB          ==
==                                    ==
========================================
Starting. Press "Enter" to stop training and save model.

Trying to do the first iteration. If an error occurs, reduce the model parameters.

Error: 2 root error(s) found.
  (0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'decoder_dst/res0/conv1/weight/read':
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/deepfacelab/mainscripts/Trainer.py", line 58, in trainerThread
    debug=debug)
  File "/deepfacelab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/deepfacelab/models/Model_Quick96/Model.py", line 73, in on_initialize
    self.src_dst_trainable_weights = self.encoder.get_weights() + self.inter.get_weights() + self.decoder_src.get_weights() + self.decoder_dst.get_weights()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 77, in get_weights
    self.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 35, in _build_sub
    layer.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 33, in _build_sub
    layer.build_weights()
  File "/deepfacelab/core/leras/layers/Conv2D.py", line 61, in build_weights
    self.weight = tf.get_variable("weight", (self.kernel_size,self.kernel_size,self.in_ch,self.out_ch), dtype=self.dtype, initializer=kernel_initializer, trainable=self.trainable )
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1593, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1336, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 591, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 543, in _true_getter
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 961, in _get_single_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2634, in default_variable_creator
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1668, in __init__
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1861, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 287, in identity
    ret = gen_array_ops.identity(input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3943, in identity
    "Identity", input=input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[{{node decoder_dst/res0/conv1/weight/read}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[{{node decoder_dst/res0/conv1/weight/read}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/deepfacelab/mainscripts/Trainer.py", line 129, in trainerThread
    iter, iter_time = model.train_one_iter()
  File "/usr/local/deepfacelab/models/ModelBase.py", line 474, in train_one_iter
    losses = self.onTrainOneIter()
  File "/usr/local/deepfacelab/models/Model_Quick96/Model.py", line 276, in onTrainOneIter
    warped_dst, target_dst, target_dstm)
  File "/usr/local/deepfacelab/models/Model_Quick96/Model.py", line 178, in src_dst_train
    self.target_dstm:target_dstm,
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'decoder_dst/res0/conv1/weight/read':
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/deepfacelab/mainscripts/Trainer.py", line 58, in trainerThread
    debug=debug)
  File "/deepfacelab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/deepfacelab/models/Model_Quick96/Model.py", line 73, in on_initialize
    self.src_dst_trainable_weights = self.encoder.get_weights() + self.inter.get_weights() + self.decoder_src.get_weights() + self.decoder_dst.get_weights()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 77, in get_weights
    self.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 35, in _build_sub
    layer.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 33, in _build_sub
    layer.build_weights()
  File "/deepfacelab/core/leras/layers/Conv2D.py", line 61, in build_weights
    self.weight = tf.get_variable("weight", (self.kernel_size,self.kernel_size,self.in_ch,self.out_ch), dtype=self.dtype, initializer=kernel_initializer, trainable=self.trainable )
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1593, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1336, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 591, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 543, in _true_getter
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 961, in _get_single_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2634, in default_variable_creator
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1668, in __init__
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1861, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 287, in identity
    ret = gen_array_ops.identity(input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3943, in identity
    "Identity", input=input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

I have an error telling me that Resource exhausted while I have 64Go memory and 16VRAM. If you look the Model Summary it says VRAM: 0.02GB I have way more VRAM on the g4dn.4xlarge instance.

What's the problem here ?

Why can't I use all the VRAM available from the host in my docker ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.