thefoundryvisionmongers / nuke-ml-server Goto Github PK

View Code? Open in Web Editor NEW

133.0 25.0 36.0 76.21 MB

A Nuke client plug-in which connects to a Python server to allow Machine Learning inference in Nuke.

License: Apache License 2.0

CMake 1.31% Python 61.42% C++ 35.36% Dockerfile 1.90%

nuke-ml-server's People

Contributors

Stargazers

Watchers

nuke-ml-server's Issues

Cannot execute MLClient node from Python script

Hi! Many thanks for this great project.

I'm trying to write some integration tests, but without much success. Here is what I tried so far, but I get the errors below.

image_node = nuke.nodes.Read(file = "/path/to/file.jpg")
ml_node = nuke.nodes.MLClient()

# Maybe something like this would trigger the node to connect to the server?
# ml_node.knobs()['connect'].execute()

ml_node.setInput( 0, image_node )
nuke.execute(ml_node, 1,2)
# RuntimeError: MLClient1 cannot be executed
# MLClient1 cannot be executed

Any advice is highly appreciated

Installation without super user permission?

Is it possible to use these tools without uid 0 being set by the user?

Add the ability to load and unload models

Add the ability to load and unload models on the fly, while the server is running

Dynamic knobs lose their values if loading a Nuke script with MLClient nodes

Whenever I save a Nuke script with MLClient nodes and then load that Nuke back, I get errors messages: "MLClientXX.YY: no such knob". Where YY is the custom attribute in my ML model.

As a result all the fields go back to their default value, including the input that specifies the path for the pre-trained model..

Is that behaviour expected?
Is there something I can do to avoid losing information?
Or perhaps can you fix on your end?

Thank you!!

CMake install of protobuf isnt relocatable by default

Just leaving this here to help people out.

protocolbuffers/protobuf#1919

Resolve hostname to IP

When I type the name of my machine, or "localhost" in the "host" Input for the MLClient node it fails to connect giving me the error message:
Hostname is invalid

If I type the typical localhost IP 127.0.0.1 it fails with the message:
Could not connect to server. Please check your host / port numbers.

It will ONLY work if I type the IP address returned by the command line:
hostname -I | awk '{print $1}'

It would be great to resolve the host name.

Destructive effects when pressing "Connect" buttom.

If I have a few MLClient nodes, all pointing to the same class model which has custom knobs, and then I hit the "Connect" button, than things change in unpredictable ways:

If I have a knob that is keyframed or has an expression on it, it will lose all that
It may change the knob values from the other MLClient node onto the second one ( I guess it is taking the values from the single instance in the server ).

Ideally it would be great if it was not that destructive and try first retain the knobs and their values or expressions.

Python3 and Protobuf

Hi folks,

I'm modifying the server to run with Python3, as my ML environments require it.

I think I may have ran into some issues with protobuf. In my server output, I see the following:

Server -> Listening on port: 55555
Server -> Receiving message of size: 6
Server -> 6 bytes read
Server -> Message parsed
Server -> Received info request
Server -> Serializing message
Server -> Sending response message of size: 98
Server -> -----------------------------------------------

But the client output says, reporting it read data of size 0, instead of 98

Client -> Connected to 172.17.0.2
Client -> Sending info request
Client -> Created message
Client -> Serialized message
Client -> Created char array of length 6
Client -> Copied to char array
Client -> Message sent
Client -> Reading header data
Client -> Reading data of size: 0
Client -> Deserializing message
Client -> Closed connection

This is my first time using protobuf, so I thought I'd just enquire if there's anything obvious I should be looking for in terms of why it works with Python2 and not Python3,

cheers

Add support for complex types: color, 2D, 3D coordinates, matrix

Hi!
Is there a plan to add support for other kinds of native Nuke knobs?
I listed a few in the title just as example.
I wonder if it could check for the attribute "shape" and if it's a tuple than perhaps map to the appropriate knob? That way it would be compatible with pytorch or numpy tensors for example. Just an idea.
Also along these lines, it would be great to, instead of use the data type as the only input, perhaps there could be a way to specify metadata about each input, which could distinguish between RGB vs 3D position types, as well as provide min/max values for a float input.
I think all that could go a long way to create intuitive interfaces for ML models embedded in Nuke.

undefined symbol error when loading plugin (Nuke 11.2v3)

First of all, thanks for this plugin! I managed to build and install it (I think), but when I try to drop the node into Nuke I get the following error:

/usr/local/home/fcole/.nuke/MLClient.so: undefined symbol: _ZNK2DD5Image2Op15input_longlabelB5cxx11Ei

I checked the build paths and it does seem like it is built against the Nuke version I am running (11.2v3). I've had a couple other Nuke installs on this machine, though, so wondering if this error could be caused by finding a stale library somewhere.

Support for TCL expressions in custom knobs

TCL expressions is another common trick used in comp nuke scripts and it would be great to also work on MLClient nodes.
That didn't work for my custom string knob, which would always receive the literal string containing the TCL expression.
Is that something that would need to happen in the server side? If so... could that be provided as an API to make it simpler to write wrappers for pre-trained models?

Error during building the server docker container on ubuntu 18.04

Hello,

During server installation for nuke-ml-server on ubuntu 18.04, I get the following error when I run this commad :

sudo docker build -t mlserver -f Dockerfile .

ERROR :

WARNING: Discarding https://files.pythonhosted.org/packages/4a/85/db5a2df477072b2902b0eb892feb37d88ac635d36245a72a6a69b23b383a/PyYAML-3.12.tar.gz#sha256=592766c6303207a20efc445587778322d7f73b161bd994f227adaa341ba212ab (from https://pypi.org/simple/pyyaml/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement pyyaml==3.12 (from versions: 3.10, 3.11, 3.12, 3.13b1, 3.13rc1, 3.13, 4.2b1, 4.2b2, 4.2b4, 5.1b1, 5.1b3, 5.1b5, 5.1, 5.1.1, 5.1.2, 5.2b1, 5.2, 5.3b1, 5.3, 5.3.1, 5.4b1, 5.4b2, 5.4, 5.4.1, 6.0b1, 6.0, 6.0.1)
ERROR: No matching distribution found for pyyaml==3.12

This error comes as part of the requirements installation for the 'detectron' repository : https://github.com/facebookresearch/Detectron/blob/main/requirements.txt

If you can give me any pointers on how to solve this that would be very helpful!

Kind regards,
Shashwat

Support multiple instances of a model in the same Nuke script

I just realized that if I instantiate more than one MLClient node for my custom class it seems that both Nuke nodes will talk to the a single instance of my BaseModel-derived class on the server side.
I was hoping that each instance of my BaseModel class would somehow be associated with one nuke node, so I could do things like, keep a reference to a pre-trained model and re-use it for each new input.
But now, it seems like I would have to keep in my BaseModel-derived class some notion of cache that keeps alive any model that the user is using in that same Nuke script.. For example, let's say the user is comparing the outputs of two pre-trained models.
Would it be possible to send messages to the server when nodes are removed/added so that cache can purge some items?
Or better yet, would it be possible for the Server to instantiate one BaseModel object per nuke node?

Offline installation?

Is it possible to install the software without access to the internet if the files are downloaded in advance?

CPU only inferrence?

Many visual effect facilities have large investments in CPU only render farms, is it possible to do inference on a distributed CPU render farm?

Live or progressive style render for MLClient

Cross posting from the community forum where I posted a simpler repro: https://community.foundry.com/discuss/topic/159878/continuous-render-for-planariop

I'm trying to modify the MLClient / Server so that the server can pass progressive updates to the PlanarIop in Nuke. The reason for this is my model is an optimisation based style transfer, which can take a few minutes to run through sufficient epochs. My hope was to pass an update through for say every 10 epochs, so that the user gets some kind of color output quickly which then refines in front of them.

The only alternative I could think of was a button the user can press repeatedly to progressively refine the result, but that's not a great user experience.

Just tagging @ringdk as I'm not sure if this repo is still actively maintained. For what it's worth, I attempted to use Torchscript / Copycat, however a limitation there is that you cannot initialise an optimiser in Torchscript, which leaves my model dead in the water.

Can't embed MLClient node with custom knobs in a Group node

I have a custom ML model that has inputs, which MLClient created as dynamic knobs.
If I select that node and hit Ctrl+G then I get a pop-up with several error messages of the type:
<node>.<custom_knob>: no such knob

I open the group and introspect the MLClient node and I see it has lost all dynamic knobs and I have to click "Connect" to get them back, with all the values lost.

Software license conflicts

Hi Folks

The license for the project is listed as Apache License, Version 2.0. At the same time the readme and the MLClient node reads: "This is strictly non-commercial". These are legally in conflict. The Apache license you assigned to the code means I can pretty much do whatever I want with the code and sell it to whomever I want. Also the license is irrevocable.

Copied from the include Apache License 2.0 in this project:

Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.

You really should remove the "This is strictly non-commercial" part from the readme and the node.

Training Capabilities?

Is it possible to do training from the the nuke-ML-server?

The examples are inference in Caffe2 from Facebook

Is it possible to pass label or other ground truth data to the ML model and have it learn from the toolset, it seems that is a inference only.

Where would the model checkpoints be stored?

Would the data rate be adequate?

Turing based card compatibility

Hey I know that CUDA 10 and CUDNN 7.4 at least are required for Turing based Cards the RTX range, has this been tested on Turing cards?

Help managing GPU resources

I'm sure this problem can get very complicated and may require custom implementations, but I was wondering whether you have intentions or ideas on how to manage the limited GPU resources across all MLClient nodes instantiated in a nuke scene.

Nuke MLClient nodes could be talking with different or same classes in the MLServer side, using any kind of back end (pytorch, tensorflow, ..).
This is somewhat related to issue #21 but it goes beyond that because it deals with all the classes used in a Nuke session.

Here's a broad idea that may be a good discussion starter:

Each instance of MLServer creates a LRU-cache object that is supposed to hold pre-trained models. It could have some options, like how many models the cache can hold, or how much GPU Ram it wants to guarantee to be available at any given time.
The MLServer provides this cache object as an API to the model classes, which they use to register their model-loading-method that is supposed to be called only if the model is not in the cache. This custom method should return a reference to the pre-trained model and also the location of the model (ie: "gpu0")
The LRU-cache will call the custom function and will catch Memory exceptions during the construction of the model. The exception could trigger the purge of less recent items in the cache and allow retrying the model-loading-method.
The LRU-cache object holds a reference to the object returned by the custom model-loading-method (be it pytorch, tensorflow, whatever).
It would also verify what's the remaining GPU memory (of the corresponding location) after loading the model and would prune more models to guarantee enough free memory according to the options in 1.

It feels that this more general approach would make issue #21 irrelevant and it would deal with complex scenarios, including multi-gpu.

Do you see a benefit adding something like that to the MLServer API?

Copy-and-paste MLClient nodes loses all custom knob values.

I'm just trying to use the MLClient node pretending it is a regular Nuke node and I this is another common need: to copy and paste and get same knob values. They are not copied, I guess because they are dynamic knobs. I think simple things like that would be expected in order to consider this approach viable for production.

Dense Pose crashes

Server -> Receiving message of size: 24883378
Server -> 24883378 bytes read
Server -> Message parsed
Server -> Received inference request
Server -> Requesting inference on model: densepose
Server -> Starting inference
WARNING:root:[====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
Server -> Exception caught on inference on model:
Server -> Serializing message
Server -> Sending response message of size: 18
Server -> ----------------------------------------------

But in the good news pile, MaskRCNN works.

dockerfile centos7

I am using centos.

dockerfile is for Ubuntu. Can you make it for centos?

CMake 3 is not part of CentOS 7

Is there a reason why CMake 3 is a requirement?

Docker image crashes without AVX instructions

[kognat@vxfhost Server]$ sudo docker run --runtime=nvidia -v /home/kognat/dev/nuke-ML-server/Models:/workspace/ml-server/models:ro -it nuke-ml-magic:latest
[sudo] password for kognat: 
root@a1b2b7bf4646:/workspace/ml-server# python
Python 2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 21:00:58) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2019-05-25 07:49:46.583944: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
Aborted (core dumped)

Party over dude.

Plugin Not loading under CentOS 7.4 compiled against protobuf 3.5.1

Sorry about being lazy but a picture says it all

protobuf-3.5.1 was compiled as follows

cd ~/dev/
wget https://github.com/protocolbuffers/protobuf/releases/download/v3.5.1/protobuf-cpp-3.5.1.tar.gz
tar -zxvf protobuf-cpp-3.5.1.tar.gz
cd protobuf-3.5.1/cmake
mkdir build && cd build
cmake3 .. -DCMAKE_INSTALL_PREFIX=~/opt/protobuf-3.5.1 -DCMAKE_POSITION_INDEPENDENT_CODE=ON
make -j12
make install

Then the Plugin was compiled as follows

cd ~/dev
git clone https://github.com/TheFoundryVisionmongers/nuke-ML-server
cd nuke-ML-server/build/
cmake3 .. -DCMAKE_INSTALL_PREFIX=~/opt/protobuf-3.5.1 -DNUKE_INSTALL_PATH=/usr/local/Nuke11.3v4/
make -j12

Then Nuke was run

export NUKE_PATH=/home/kognat/dev/nuke-ML-server/build/Plugins/Client
/usr/local/Nuke11.3v4/Nuke11.3

See screenshot attached.

Add server shutdown option

Add a shutdown server option, for example with a keyboard shortcut or/and other way

Support multiple locations for models

Hi there! thank you so much for sharing this implementation.

The current code lists the models under the directory in which the server was launched, and that seems somewhat restrictive.

Would it be possible for you to add support to multiple locations for the models that the server can see?

Ideally I would like to launch the server and specify perhaps in an environment variable, multiple directories where it could search for models, separated by ":" (to follow Linux conventions).

And perhaps "baseModel.py" should be in a separate dir, so that one could point to it's location using PYTHONPATH when launching the server and that would make possible to all custom nodes, wherever they are, to import the base class.

How do you like that?

Dynamic knobs in loaded scripts don't show up until node is evaluated

The contents and definition of the dynamic knobs are serialized in Nuke script alright, and they are retrieved during load, but they aren't applied to the node until the node is evaluated (i.e. plugged to the viewer)

I believe this is not ideal, because:

users selecting a node may think that there's no models selected or input parameters set.
if users save the script without evaluating the nodes first than they will lose the information about their dynamic knobs (note that the to_script function retrieves the current status of the UI, not the internal _parameters object)

In my mind, what should change is the moment where UI is updated. To me, it should only occur in three situations:

when user clicks "connect"
when user selects a different model
when script is loaded

I don't think UI should change at all during evaluation. If for any reason the UI is out-of-date with the actual model in the server (i.e. the names/types of input parameters are different), that should be detected during evaluation, then an error should be displayed to user, letting him know he should click "connect" to refresh his UI.

That way one can load a Nuke script with MLClient nodes, and even if the server is down, all the knobs will remain with their values or expressions. It won't be a destructive operation to load the script, try to infer, see errors in the viewer and save back to file.

Inference using Blink API

Thanks for sharing this great project and its plugin to introducing deep learning into Nuke with simple steps.

I am also doing some similar work in Nuke, but my solution is Blink API. Blink API has ability to use cross-platform GPU and CPU computing, and can work like OpenGL which has been used for cross-platform inference (Tensorflow.js using WebGL). So it is suitable for building cross-platform inference engine and is easy for real world deployment. As for using full pipeline deep learning framework, installing full deep learning environment is still hard nowadays, especially for windows user. Also if users do not have Nvidia graphic card, it might be a problem for them to use GPU. Because of these, I think Blink API (OpenGl/OpenCl or inference engine like TensorRT, TVM...) is more suitable for near future use.

I have done a small experiment using Blink API for inference, and it worked. The model is a sequential convolutional neural network with 10 conv2D layers, which can achieve most low-level visual tasks such as super-resolution, deblur and denoise. The model was hard coded in C++ plugin. Complex model support can be enabled by introducing computational graph. The implement of conv2D is the simplest convolution, but the speed is acceptable.

Here is the example code for the kernel (conv2D 9x9 64 with batch nomalization and relu).

// Copyright (c) 2019 Hepesu Animation Toolkits Project. All Rights Reserved.

#define epsilon 1e-7

inline float batchNorm(float x, float mean, float var, float gamma, float beta){
  return (x - mean) / (sqrt(var) + epsilon) * gamma + beta;
}

inline float relu(float x){
  return max(0.0f, x);
}

kernel ConvBlockAKernel : public ImageComputationKernel<ePixelWise>
{
  Image<eRead, eAccessRanged2D, eEdgeConstant> src;
  Image<eWrite> dst;

param:
  float weight[81];
  float bias;
  
  float gamma;
  float beta;
  float mean;
  float var;
  
  int outputChannel;
  
local:
  int2 _filterOffset;
  int2 _kernelSize;

  void init()
  {
    _kernelSize[0] = 9;
    _kernelSize[1] = 9;

   int2 filterRadius(_kernelSize[0] / 2, _kernelSize[1] / 2);

    _filterOffset[0] = -filterRadius[0];
    _filterOffset[1] = -filterRadius[1];

    src.setRange(-filterRadius[0], -filterRadius[1], filterRadius[0], filterRadius[1]);
  }

  void process() {
    // Init value with 0.0
    float value = 0.0;
    
    for (int in_channel = 0; in_channel < src.kComps; in_channel++){
      // Iterate in ks x ks range
        for(int j = 0; j < _kernelSize[1]; j++) {
          for(int i = 0; i < _kernelSize[0]; i++) {
            value += weight[j * 9 + i] * src(i + _filterOffset[0], j + _filterOffset[1], in_channel);
        }
      }
    }
    
    // Add bias then bn and relu
    dst(outputChannel >= dst.kComps ? dst.kComps - 1 : outputChannel) = relu(batchNorm(value + bias, mean, var, gamma, beta));
  }
};

Here is the code for calling this kernel in plugin. I am using setParamValue to pass weights of the model. If the weights are huge, this can be done by passing them as an image source.

// Copyright (c) 2019 Hepesu Animation Toolkits Project. All Rights Reserved.

Blink::Kernel blockA(_convBlockAWideProgram, computeDevice, imagesIA, kBlinkCodegenDefault);
for (int outChan = 0; outChan < 64; ++outChan){
	float weights[81];
	for (int i = 0; i < 81; ++i)
		weights[i] = dequantize(wide_block1_weight[outChan][i], wide_block1_weight_max, wide_block1_weight_min);
	blockA.setParamValue("weight", weights, 81);
	blockA.setParamValue("bias", dequantize(wide_block1_bias[outChan], wide_block1_bias_max, wide_block1_bias_min));

	blockA.setParamValue("mean", dequantize(wide_block1_mean[outChan], wide_block1_mean_max, wide_block1_mean_min));
	blockA.setParamValue("var", dequantize(wide_block1_var[outChan], wide_block1_var_max, wide_block1_var_min));
	blockA.setParamValue("gamma", dequantize(wide_block1_gamma[outChan], wide_block1_gamma_max, wide_block1_gamma_min));
	blockA.setParamValue("beta", dequantize(wide_block1_beta[outChan], wide_block1_beta_max, wide_block1_beta_min));

	blockA.setParamValue("outputChannel", outChan);

	blockA.iterate();
}

The performance can be further improved by using winograd or GEMM which is used in most deep learning framework. With these, the inference can be supported by all platform, even for Nuke 9, and users do not need to installing any other software or drivers. Also the inference is just done by stripe, so it looks like most node, users do not need to wait for whole image which may cost very long time and huge memory. But due to limitations of Blink API, the implementation of general inference engine (with computational graph) is not easy. The GEMM and a complex computational graph might exceed the limit of Blink API. So for latest nuke, I prefer to use inference engine TensorRT to do the jobs.

The server solution is great for testing newest deep learning technology, and I think this is the main stream for the future. But it would be great, if the inference can be done in the local machine. customers can use deep learning tools just like other simple nodes.

CentOS 6.x compatibility?

I was looking at this issue

https://forums.docker.com/t/libc-incompatibilities-when-will-they-emerge/9895

Will this Dockerfile image run on the glibc 2.12 found in CentOS 6 images?

thefoundryvisionmongers / nuke-ml-server Goto Github PK

nuke-ml-server's People

Contributors

Stargazers

Watchers

Forkers

nuke-ml-server's Issues

Recommend Projects

Recommend Topics

Recommend Org