This project contains fused CUDA kernels for evaluating scene representation networks using custom activation functions, various input encodings and flexible layer specifications.
It is the successor of fV-SRN (https://github.com/shamanDevel/fV-SRN), my previous project; and adapts further idea from tiny-cuda-nn (https://github.com/NVlabs/tiny-cuda-nn), a concurrent development of such fused MLPs.
Features
QuickMLP (this project) | fV-SRN | tiny-cuda-nn |
---|---|---|
full CUDA code generation and compilation during runtime | partial CUDA runtime compilation | static CUDA compilation |
different sizes at every layer supported | limited layer sizes, all must be identical | flexible layer sizes, but all must be identical and limited to a pre-defined set |
supports bias in the fully-connected layers | supports bias | no bias supported |
fully-customizable and user-definable activation functions, can be different at every layer | fixed set of activation functions | fixed set of activation functions |
supports fused training+inference | only inference is fused | supports fused training+inference |
single kernel for input encoding + network evaluation | single kernel for input encoding + network evaluation | one kernel per input encoding, one kernel for the network |
In other words, this project builds upon the runtime compilation idea of fV-SRN to provide fully fused kernels with maximal flexibility. The user can freely customize the network layer sizes and activation functions and arbitrarily plug them together. To support training as well, we adapt ideas of tiny-cuda-nn how the weight update and backpropagation can be realized.
Example network specification
TBA
Performance
TBA
- CUDA 11.x (tested with 11.6)
- C++17-compatible compiler (tested with MSVC2019 and GCC 9.4.0)
- PyTorch for the bindings, tested with version 1.11, but should work with newer as well
Don't forget to clone this repository with submodules (git clone --recurse-submodules
). Or if you forgot to clone with submodules, you can initialize them afterwards with git submodule init & git submodule update
.
The C++ library is located in src_cpp/include
and src_cpp/src
.
To compile, include the root folder of QuickMLP as a subdirectory in CMake.
Then, link against qmlp::qmlp-library
.
Compile the PyTorch extension: Use setup.py
!
- Activate your python environment, if desired (virtualenv or conda)
- Go to the root directory of QuickMLP.
- Call
pip -e .
- Enjoy!
Note: right now, compilation is only possible in developer-mode, i.e. the files in this folder are directly used and not copied to the python installation. I haven't figured out yet how to copy the resource files (kernel sources) to the installation target in setup.py. Ideas, issues, PRs are welcome!
The following documentation is written for the Python bindings, but it holds true
for the C++ library as well. Just change the class names from snake_case
to CamelCase
and you'll have the associated C++ class / method.
Example json specification of the network and encoding:
{
"num_inputs": 3,
"num_outputs": 1,
"activation_specification": [
"qmlp/builtin-activations.json"
],
"encodings": [
{
"id": "identity",
"start_in": 0,
"n_in": 3
}
],
"network": [
{
"n_out": 16,
"bias": false,
"activation": "relu"
},
{
"n_out": 1,
"bias": false,
"activation": "relu"
}
],
"options": {} //<-- optional
}
TODO: json documentation for encoding+network
The compile options are key-value pairs in the options
field of the network specification.
The following options are available:
- overwrite_blocksize_inference [int]: overwrites the kernel block size for the network inference. If unspecified (or negative), the maximal size is used
- overwrite_blocksize_forward [int]: overwrites the kernel block size for the network forward kernel, i.e. inference with added storing of intermediate results for the backward kernels. If unspecified (or negative), the maximal size is used
- overwrite_blocksize_backward [int]: overwrites the kernel block size for the network backward kernel. If unspecified (or negative), the maximal size is used
- overwrite_blocksize_weight_update [int]: overwrites the kernel block size for the weight update kernels during the backward pass. If unspecified (or negative), the maximal size is used
- skew_shared_memory [bool]: If true, skew the shared memory. This reduces the bank conflicts but requires 1.5x the amount of shared memory. Default: false
- parallel_weight_update [bool]: If true, the weight update kernels are launched in separate channels, allowing for possible parallel execution. Default: true
Encodings:
- Identity
- 1D-6D Dense & Hash Grid
- Line Integration
- Spherical Harmonics
- Fourier Features
Activations:
- ReLU, CeLU, Sine, Identity
- Snake and other trigonometric ones
- sigmoid, ...
Network
- Fused forward evaluation
- Input Encoding + Network fusion
- Proper padding if input/output channels are not a multiple of 16
- Proper handling if the batch size is not a multiple of 16
- Gradients for the input
- Gradients for the weight matrices
- Gradients for the bias vector
QuickMLP is shipped under the permissive MIT license.
If you find bugs in the library, feel free to open an issue. I will continue to use this library in future projects and therefore continue to improve and extend this library. Of course, pull requests are more than welcome.