wolframrhodium / vapoursynth-bm3dcuda Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 6.0 279 KB

BM3D denoising filter for VapourSynth, implemented in CUDA, AVX2, HIP and SYCL

License: GNU General Public License v2.0

Cuda 8.54% C++ 89.44% CMake 1.76% C 0.26%

vapoursynth-bm3dcuda's People

Contributors

Stargazers

Watchers

Forkers

akarinvs blueswordm xusang77 jaded-encoding-thaumaturgy quietvoid quackdoc

vapoursynth-bm3dcuda's Issues

HIPify it/OpenCL release

is it possible we can get an HIP release or an opencl release for usage with non nvidia graphics cards?

Lacking group_size parameter

It can improve denoiseing quality sometimes, please consider implement it.
BTW, is there any technical difficult point with making it adjustable? There should be a similar value internally, right?

Deterministic output

Greetings, thank you for this plugin.

I have noticed that this filter has some form of nondeterminism, unfortunately this complicates my workflow. Any chance this is something that could be looked into?

Linux 5.12.8
CUDA 11.3.0-2
Nvidia 465.31
GTX 1080

Apparently RC3-rtc build crashes

I just tried to use rtc build and it crashes vsedit silently.
gdb says that it's a segfault.

Starting program: C:\vs-r52.1-port\vsedit.exe
[New Thread 17896.0x4fc8]
...
[New Thread 17896.0x2a78]

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffeb5796984 in ?? ()

BM3D Hangs For Non-power-of-2 block_step Values on RTX 2070 Super

The following script hangs for me on a RTX 2070 Super when block_step=3, 5, 6, or 7. It would seem any value that's not a power of 2 is problematic:

import vapoursynth as vs
core = vs.get_core()

# Easier debug
core.num_threads = 1

clip = core.std.BlankClip(width=720, height=480, format=vs.YUV444PS)

# Radius doesn't seem to matter
bm3d_r = 1

# cudaStreamSynchronize stops returning when block_step=3, 5, 6, or 7 (1, 2, 4 and 8 are fine)
# fast=False for debug
vbasic = core.bm3dcuda.BM3D(clip, radius=bm3d_r, block_step=3, fast=False).bm3d.VAggregate(radius=bm3d_r, sample=1)

vbasic.set_output(0)

This same script works fine for all values 1-8 on a GTX 970 in the same machine.

I'm new to CUDA but did manage to do some debugging. Hopefully it's useful..

When hung I see that the CPU is waiting on cudaStreamSynchronize() in BM3DGetFrame(); apparently waiting for the GPU to finish. Inspection with cuda-gdb shows that at least some of the CUDA threads are waiting on __shfl_xor_sync() at kernel.cu:459 while some others are waiting at kernel.cu:516 for __shfl_sync().

(cuda-gdb) where         
#0  0x0000555559715ad0 in __cuda_sm70_shflsync_idx_p ()
#1  0x000055555976f860 in _INTERNAL_53_tmpxft_00005f71_00000000_11_kernel_compute_86_cpp1_ii_d1368b98::__shfl_sync (mask=4294967295, 
    var=705, srcLane=0, width=8) at /usr/local/cuda-11.4/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp:373
#2  0x000055555977ce20 in bm3d<true, true, true><<<(60,160,1),(32,1,1)>>> (res=0x7ffef8000000, src=0x7ffefe000000, width=720, height=480, 
    stride=768, sigma=1.69411767, block_step=3, bm_range=12, _radius=4, ps_num=2, ps_range=6, sigma_u=0.318892747, sigma_v=0.0600268729, 
    extractor=8) at kernel.cu:516
...
(cuda-gdb) where         
#0  0x0000555559753ad0 in __cuda_sm70_shflsync_bfly_p ()
#1  0x0000555559774500 in _INTERNAL_53_tmpxft_00005f71_00000000_11_kernel_compute_86_cpp1_ii_d1368b98::__shfl_xor_sync (mask=4294967295, 
    var=7.70065981e-07, laneMask=1, width=8) at /usr/local/cuda-11.4/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp:449
#2  0x000055555977b530 in bm3d<true, true, true><<<(60,160,1),(32,1,1)>>> (res=0x7ffef8000000, src=0x7ffefe000000, width=720, height=480, 
    stride=768, sigma=1.69411767, block_step=3, bm_range=12, _radius=4, ps_num=2, ps_range=6, sigma_u=0.318892747, sigma_v=0.0600268729, 
    extractor=8) at kernel.cu:459

Experimentation reveals that compiling the CUDA kernel using arch=compute_60,code=sm_75 causes the problem to go away (as was recommended for diagnosis here https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/). The problem returns when arch is changed to compute_70. Additionally, modifying kernel.cu to use __activemask(), rather than the membermask variable, when __CUDA_ARCH__ >= 700 also causes the problem to stop.

Possibly related I see in cuda-gdb that there appear to be fewer than 32 threads active in the hung warps:

(cuda-gdb) info cuda warps
 Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel   BlockIdx First Active ThreadIdx 
Device 0 SM 19
  0        0x00000000           0x00000000                n/a    n/a        n/a                    n/a 
  1        0x00ffffff           0xff000000 0x00000000000000d0      0  (58,12,0)                (0,0,0) 
* 2        0x00ffffff           0xff000000 0x00000000000000d0      0   (58,0,0)                (0,0,0) 
  3        0x00000000           0x00000000                n/a    n/a        n/a                    n/a 
  4        0x00ffffff           0xff000000 0x00000000000000d0      0   (58,4,0)                (0,0,0) 
  5        0x00000000           0x00000000                n/a    n/a        n/a                    n/a 
  6        0x00ffffff           0xff000000 0x00000000000000d0      0 (58,106,0)                (0,0,0) :

But the membermask value in these threads is -1. It would seem there's a problem in the calculation of membermask but I've not yet worked out what the calculation should be.

Please let me know what other info or assistance I may provide in diagnosing this.

Thank for your work on this project!

Config Info
NVIDIA Driver Version: 460.91.03
CUDA Version: 11.2
nvcc version: 11.4.100

Operating System: Linux Mint 19
Kernel: Linux 5.4.0-80-generic
Architecture: x86-64

VapourSynth Video Processing Library
Copyright (c) 2012-2020 Fredrik Mellbin
Core R52
API R3.6
Options: -

Bm3dSycl no devices found

EDIT: using bm3d = core.bm3dsycl.BM3Dv2(tools.depth(clip, 32), device_id=0, fast=True, radius=2) instead worked

I tried to compile and run the sycl stuff on Arch linux. and it seems to not be working. my intel gpu (Arch A380) will run at 100% compute workload for a couple seconds before spitting out this error.

Error: Failed to retrieve frame 0 with error: BM3D: Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)
Output 12 frames in 14.05 seconds (0.85 fps)

The full pkgbuild and patch I used will be linked at the bottom. Im not sure if I compiled wrong or if it's a bug. or if this is a configuration issue with arch.

To compile I did apply this patch

diff --git a/sycl_source/CMakeLists.txt b/sycl_source/CMakeLists.txt
index 6a86aeb..5152460 100644
--- a/sycl_source/CMakeLists.txt
+++ b/sycl_source/CMakeLists.txt
@@ -6,6 +6,8 @@ endif()
 
 project(BM3DSYCL LANGUAGES CXX)
 
+find_package(IntelDPCPP REQUIRED)
+
 find_package(IntelSYCL REQUIRED CONFIG)
 
 add_library(bm3dsycl SHARED source.cpp kernel.cpp)

then compile with

  export PATH=/opt/intel/oneapi/compiler/2023.2.0/linux/bin:$PATH
  export IntelSYCL_DIR=/opt/intel/oneapi/compiler/2023.2.0/linux/IntelSYCL

  #clear cflags 
  unset CFLAGS 
  unset CXXFLAGS
  unset LDFLAGS
  unset LTOFLAGS
  unset MAKEFLAGS
  unset DEBUG_CFLAGS
  unset DEBUG_CXXFLAGS

  cmake -S "${_plug}" -B build \
    -DCMAKE_BUILD_TYPE=None \
    -DCMAKE_INSTALL_PREFIX=/usr \
    -DCMAKE_INSTALL_LIBDIR=lib/vapoursynth \
    -DCMAKE_SKIP_RPATH=ON \
    -DVAPOURSYNTH_INCLUDE_DIRECTORY="$(pkg-config --cflags vapoursynth | sed 's|-I||g')" \
    -DENABLE_HIP=Off \
    -DENABLE_CPU=Off \
    -DENABLE_SYCL=ON \
    -DENABLE_CUDA=OFF \
    -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

  cmake --build build

I then tried to use this

import vapoursynth as vs
import vstools as tools
core = vs.core
clip = core.lsmas.LWLibavSource(source="Pacific-rim.webm")

bm3d = core.bm3dsycl.BM3Dv2(tools.depth(clip, 32), None, .3, 4, 8, 1, device_id=1, fast=False)

bm3d.set_output(0)

and then finally run and get outout

vspipe -c y4m bm3d.vpy out.y4m
Error: Failed to retrieve frame 0 with error: BM3D: Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)
Output 12 frames in 14.05 seconds (0.85 fps)

Misc:
/opt/intel/oneapi/dev-utilities/2021.9.0/bin/oneapi-cli version
v0.2.0-7-gab4eb7c822

PKGBUILD.txt
sycl-patch.txt

Linux Build Fails from AUR

When building from AUR, I get this build error

==> Starting build()...
-- The CXX compiler identification is GNU 11.1.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:179 (message):
  Failed to find nvcc.

  Compiler requires the CUDA toolkit.  Please set the CUDAToolkit_ROOT
  variable.
Call Stack (most recent call first):
  source/CMakeLists.txt:3 (project)

AVS+ Support

This issue tracks information related to AviSynth+ support. Please discuss in the doom9's thread.

RGB2OPP with radius?

With radius=0, I can give a RGB clip that auto-gets converted to OPP format. With radiu=1, it fails. If I try to manually convert with RGB2OPP, it throws: "Python exception: BM3D_RTC: only constant format 32bit float input supported"

Is it possible to run in OPP format with radius=1?

The plugin does not get detected

Installed the plugin but it does not get detected, tried all versions.
Getting

There is no attribute or namespace named...

Is HIP support planned for cpu only?

Or will hip capable gpu cards will be able to use it?

Possible lower than 32 bit??

Your plugin very fast but only if you use single bm3d. When bm3d use to mask for other denoiser (other denoiser only have cpu ver), it's very slow cause bm3d only support 32 bit. If you add support other bits lower than 32 bit, i think it's will faster

Assertion error on VS destruction when BM3D is not called

I have the core.bm3dhip plugin and it seems to cause this error when I close my previewer:

vspreview-rs: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/rocclr/cmake/../thread/monitor.hpp:180: bool amd::Monitor::tryLock(): Assertion `thread != NULL && "cannot lock() from (null)"' failed.

Followed by a SIGABRT.

Whatever debug symbols I include, I haven't been able to get the trace from libbm3dhip.so, so all I know is this:

#6  0x00007fffa2a3ea62 in __hipUnregisterFatBinary () at /opt/rocm/lib/libamdhip64.so.5
#7  0x00007fffa42eeb02 in  () at /usr/lib/vapoursynth/libbm3dhip.so

And it only happens when closing the previewer when a clip 1) requested a frame 2) didn't use core.bm3dhip.BM3D at all.
If I add in a BM3D filter in the graph, then there's no error.

My current workaround is just to remove the plugin when not in use.

Reducing blocking?

During some simple tests, I noticed that higher value of block_step will make more blocking when sigma is high, and block_step=8 will cause a few visible blocking even when sigma is relatively low (~5).

As mawen mentioned in this issue: HomeOfVapourSynthEvolution/VapourSynth-BM3D#18 (comment), using smaller block_step in final estimate will decrease some artifacts (including blockiness).

Is it possible to do in this implementation?

ParallelFilter race condition

Sometimes, when using BM3DCUDA along with other CUDA filters, I go thru the following error:

x265 [WARN]: detected ParallelFilter race condition on last row

I have read that it usually happens with CUDA, in some rare threads on web.

As example, when using the script:

SetMemoryMax()
SetFilterMTMode("DEFAULT_MT_MODE", 2)
LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
DGSource("F:\In\Cowboy Bebop\26.dgi",ct=0,cb=0,cl=236,cr=236)
DGTelecide(mode=1, pthresh=3.5)
DGDecimate()
ConvertBits(32)
BM3D_CUDA(sigma=3, radius=2)
BM3D_VAggregate(radius=2)
fmtc_bitdepth (bits=10,dmode=8)
neo_f3kdb(range=15, Y=65, Cb=40, Cr=40, grainY=0, grainC=0, sample_mode=2, blur_first=true, dynamic_grain=false, mt=false, keep_tv_range=true)
Prefetch(8)

Is there anything that you can do?

Sigma arg array type error with new avs+ ver

Error here:
https://forum.doom9.org/showthread.php?p=1951979#post1951979

New ver avs+:
https://forum.doom9.org/showthread.php?p=1951911#post1951911

Issue with bm3dhip

I am unsure of the exact reasoning but there is a major issue present with bm3dhip at the moment. The nature of the artifacting varies but I was able to replicate it with Fast=True and false, as well as with and without a reference clip. Tested on a 6900xt with the latest AMD drivers. I believe it might be some sort of memory or cache issue but I'm not sure. Simply outputting a video clip as a lossless file I was able to find numerous times where it broke, with some of them being a single frame and some being several frames

bm3d = core.bm3dhip.BM3Dv2(depth(source, 32), None, .3, 4, 8, 1, device_id=1, fast=True)