wolframrhodium / vapoursynth-bm3dcuda Goto Github PK
View Code? Open in Web Editor NEWBM3D denoising filter for VapourSynth, implemented in CUDA, AVX2, HIP and SYCL
License: GNU General Public License v2.0
BM3D denoising filter for VapourSynth, implemented in CUDA, AVX2, HIP and SYCL
License: GNU General Public License v2.0
is it possible we can get an HIP release or an opencl release for usage with non nvidia graphics cards?
It can improve denoiseing quality sometimes, please consider implement it.
BTW, is there any technical difficult point with making it adjustable? There should be a similar value internally, right?
Greetings, thank you for this plugin.
I have noticed that this filter has some form of nondeterminism, unfortunately this complicates my workflow. Any chance this is something that could be looked into?
Linux 5.12.8
CUDA 11.3.0-2
Nvidia 465.31
GTX 1080
I just tried to use rtc build and it crashes vsedit silently.
gdb says that it's a segfault.
Starting program: C:\vs-r52.1-port\vsedit.exe
[New Thread 17896.0x4fc8]
...
[New Thread 17896.0x2a78]
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffeb5796984 in ?? ()
The following script hangs for me on a RTX 2070 Super when block_step=3, 5, 6, or 7. It would seem any value that's not a power of 2 is problematic:
import vapoursynth as vs
core = vs.get_core()
# Easier debug
core.num_threads = 1
clip = core.std.BlankClip(width=720, height=480, format=vs.YUV444PS)
# Radius doesn't seem to matter
bm3d_r = 1
# cudaStreamSynchronize stops returning when block_step=3, 5, 6, or 7 (1, 2, 4 and 8 are fine)
# fast=False for debug
vbasic = core.bm3dcuda.BM3D(clip, radius=bm3d_r, block_step=3, fast=False).bm3d.VAggregate(radius=bm3d_r, sample=1)
vbasic.set_output(0)
This same script works fine for all values 1-8 on a GTX 970 in the same machine.
I'm new to CUDA but did manage to do some debugging. Hopefully it's useful..
When hung I see that the CPU is waiting on cudaStreamSynchronize() in BM3DGetFrame(); apparently waiting for the GPU to finish. Inspection with cuda-gdb shows that at least some of the CUDA threads are waiting on __shfl_xor_sync() at kernel.cu:459 while some others are waiting at kernel.cu:516 for __shfl_sync().
(cuda-gdb) where
#0 0x0000555559715ad0 in __cuda_sm70_shflsync_idx_p ()
#1 0x000055555976f860 in _INTERNAL_53_tmpxft_00005f71_00000000_11_kernel_compute_86_cpp1_ii_d1368b98::__shfl_sync (mask=4294967295,
var=705, srcLane=0, width=8) at /usr/local/cuda-11.4/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp:373
#2 0x000055555977ce20 in bm3d<true, true, true><<<(60,160,1),(32,1,1)>>> (res=0x7ffef8000000, src=0x7ffefe000000, width=720, height=480,
stride=768, sigma=1.69411767, block_step=3, bm_range=12, _radius=4, ps_num=2, ps_range=6, sigma_u=0.318892747, sigma_v=0.0600268729,
extractor=8) at kernel.cu:516
...
(cuda-gdb) where
#0 0x0000555559753ad0 in __cuda_sm70_shflsync_bfly_p ()
#1 0x0000555559774500 in _INTERNAL_53_tmpxft_00005f71_00000000_11_kernel_compute_86_cpp1_ii_d1368b98::__shfl_xor_sync (mask=4294967295,
var=7.70065981e-07, laneMask=1, width=8) at /usr/local/cuda-11.4/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp:449
#2 0x000055555977b530 in bm3d<true, true, true><<<(60,160,1),(32,1,1)>>> (res=0x7ffef8000000, src=0x7ffefe000000, width=720, height=480,
stride=768, sigma=1.69411767, block_step=3, bm_range=12, _radius=4, ps_num=2, ps_range=6, sigma_u=0.318892747, sigma_v=0.0600268729,
extractor=8) at kernel.cu:459
Experimentation reveals that compiling the CUDA kernel using arch=compute_60,code=sm_75 causes the problem to go away (as was recommended for diagnosis here https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/). The problem returns when arch is changed to compute_70. Additionally, modifying kernel.cu to use __activemask(), rather than the membermask variable, when __CUDA_ARCH__ >= 700 also causes the problem to stop.
Possibly related I see in cuda-gdb that there appear to be fewer than 32 threads active in the hung warps:
(cuda-gdb) info cuda warps
Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx First Active ThreadIdx
Device 0 SM 19
0 0x00000000 0x00000000 n/a n/a n/a n/a
1 0x00ffffff 0xff000000 0x00000000000000d0 0 (58,12,0) (0,0,0)
* 2 0x00ffffff 0xff000000 0x00000000000000d0 0 (58,0,0) (0,0,0)
3 0x00000000 0x00000000 n/a n/a n/a n/a
4 0x00ffffff 0xff000000 0x00000000000000d0 0 (58,4,0) (0,0,0)
5 0x00000000 0x00000000 n/a n/a n/a n/a
6 0x00ffffff 0xff000000 0x00000000000000d0 0 (58,106,0) (0,0,0) :
But the membermask value in these threads is -1. It would seem there's a problem in the calculation of membermask but I've not yet worked out what the calculation should be.
Please let me know what other info or assistance I may provide in diagnosing this.
Thank for your work on this project!
Config Info
NVIDIA Driver Version: 460.91.03
CUDA Version: 11.2
nvcc version: 11.4.100
Operating System: Linux Mint 19
Kernel: Linux 5.4.0-80-generic
Architecture: x86-64
VapourSynth Video Processing Library
Copyright (c) 2012-2020 Fredrik Mellbin
Core R52
API R3.6
Options: -
EDIT: using bm3d = core.bm3dsycl.BM3Dv2(tools.depth(clip, 32), device_id=0, fast=True, radius=2)
instead worked
I tried to compile and run the sycl stuff on Arch linux. and it seems to not be working. my intel gpu (Arch A380) will run at 100% compute workload for a couple seconds before spitting out this error.
Error: Failed to retrieve frame 0 with error: BM3D: Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)
Output 12 frames in 14.05 seconds (0.85 fps)
The full pkgbuild and patch I used will be linked at the bottom. Im not sure if I compiled wrong or if it's a bug. or if this is a configuration issue with arch.
To compile I did apply this patch
diff --git a/sycl_source/CMakeLists.txt b/sycl_source/CMakeLists.txt
index 6a86aeb..5152460 100644
--- a/sycl_source/CMakeLists.txt
+++ b/sycl_source/CMakeLists.txt
@@ -6,6 +6,8 @@ endif()
project(BM3DSYCL LANGUAGES CXX)
+find_package(IntelDPCPP REQUIRED)
+
find_package(IntelSYCL REQUIRED CONFIG)
add_library(bm3dsycl SHARED source.cpp kernel.cpp)
then compile with
export PATH=/opt/intel/oneapi/compiler/2023.2.0/linux/bin:$PATH
export IntelSYCL_DIR=/opt/intel/oneapi/compiler/2023.2.0/linux/IntelSYCL
#clear cflags
unset CFLAGS
unset CXXFLAGS
unset LDFLAGS
unset LTOFLAGS
unset MAKEFLAGS
unset DEBUG_CFLAGS
unset DEBUG_CXXFLAGS
cmake -S "${_plug}" -B build \
-DCMAKE_BUILD_TYPE=None \
-DCMAKE_INSTALL_PREFIX=/usr \
-DCMAKE_INSTALL_LIBDIR=lib/vapoursynth \
-DCMAKE_SKIP_RPATH=ON \
-DVAPOURSYNTH_INCLUDE_DIRECTORY="$(pkg-config --cflags vapoursynth | sed 's|-I||g')" \
-DENABLE_HIP=Off \
-DENABLE_CPU=Off \
-DENABLE_SYCL=ON \
-DENABLE_CUDA=OFF \
-DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build
I then tried to use this
import vapoursynth as vs
import vstools as tools
core = vs.core
clip = core.lsmas.LWLibavSource(source="Pacific-rim.webm")
bm3d = core.bm3dsycl.BM3Dv2(tools.depth(clip, 32), None, .3, 4, 8, 1, device_id=1, fast=False)
bm3d.set_output(0)
and then finally run and get outout
vspipe -c y4m bm3d.vpy out.y4m
Error: Failed to retrieve frame 0 with error: BM3D: Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)
Output 12 frames in 14.05 seconds (0.85 fps)
Misc:
/opt/intel/oneapi/dev-utilities/2021.9.0/bin/oneapi-cli version
v0.2.0-7-gab4eb7c822
When building from AUR, I get this build error
==> Starting build()...
-- The CXX compiler identification is GNU 11.1.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:179 (message):
Failed to find nvcc.
Compiler requires the CUDA toolkit. Please set the CUDAToolkit_ROOT
variable.
Call Stack (most recent call first):
source/CMakeLists.txt:3 (project)
This issue tracks information related to AviSynth+ support. Please discuss in the doom9's thread.
With radius=0, I can give a RGB clip that auto-gets converted to OPP format. With radiu=1, it fails. If I try to manually convert with RGB2OPP, it throws: "Python exception: BM3D_RTC: only constant format 32bit float input supported"
Is it possible to run in OPP format with radius=1?
Installed the plugin but it does not get detected, tried all versions.
Getting
There is no attribute or namespace named...
Or will hip capable gpu cards will be able to use it?
Your plugin very fast but only if you use single bm3d. When bm3d use to mask for other denoiser (other denoiser only have cpu ver), it's very slow cause bm3d only support 32 bit. If you add support other bits lower than 32 bit, i think it's will faster
I have the core.bm3dhip
plugin and it seems to cause this error when I close my previewer:
vspreview-rs: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/rocclr/cmake/../thread/monitor.hpp:180: bool amd::Monitor::tryLock(): Assertion `thread != NULL && "cannot lock() from (null)"' failed.
Followed by a SIGABRT
.
Whatever debug symbols I include, I haven't been able to get the trace from libbm3dhip.so
, so all I know is this:
#6 0x00007fffa2a3ea62 in __hipUnregisterFatBinary () at /opt/rocm/lib/libamdhip64.so.5
#7 0x00007fffa42eeb02 in () at /usr/lib/vapoursynth/libbm3dhip.so
And it only happens when closing the previewer when a clip 1) requested a frame 2) didn't use core.bm3dhip.BM3D
at all.
If I add in a BM3D filter in the graph, then there's no error.
My current workaround is just to remove the plugin when not in use.
During some simple tests, I noticed that higher value of block_step will make more blocking when sigma is high, and block_step=8 will cause a few visible blocking even when sigma is relatively low (~5).
As mawen mentioned in this issue: HomeOfVapourSynthEvolution/VapourSynth-BM3D#18 (comment), using smaller block_step in final estimate will decrease some artifacts (including blockiness).
Is it possible to do in this implementation?
Sometimes, when using BM3DCUDA along with other CUDA filters, I go thru the following error:
x265 [WARN]: detected ParallelFilter race condition on last row
I have read that it usually happens with CUDA, in some rare threads on web.
As example, when using the script:
SetMemoryMax()
SetFilterMTMode("DEFAULT_MT_MODE", 2)
LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
DGSource("F:\In\Cowboy Bebop\26.dgi",ct=0,cb=0,cl=236,cr=236)
DGTelecide(mode=1, pthresh=3.5)
DGDecimate()
ConvertBits(32)
BM3D_CUDA(sigma=3, radius=2)
BM3D_VAggregate(radius=2)
fmtc_bitdepth (bits=10,dmode=8)
neo_f3kdb(range=15, Y=65, Cb=40, Cr=40, grainY=0, grainC=0, sample_mode=2, blur_first=true, dynamic_grain=false, mt=false, keep_tv_range=true)
Prefetch(8)
Is there anything that you can do?
I am unsure of the exact reasoning but there is a major issue present with bm3dhip at the moment. The nature of the artifacting varies but I was able to replicate it with Fast=True and false, as well as with and without a reference clip. Tested on a 6900xt with the latest AMD drivers. I believe it might be some sort of memory or cache issue but I'm not sure. Simply outputting a video clip as a lossless file I was able to find numerous times where it broke, with some of them being a single frame and some being several frames
bm3d = core.bm3dhip.BM3Dv2(depth(source, 32), None, .3, 4, 8, 1, device_id=1, fast=True)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.