overv / vramfs Goto Github PK

VRAM based file system for Linux

C++ 98.46% Shell 0.31% Makefile 1.24%

vramfs's Introduction

vramfs

Unused RAM is wasted RAM, so why not put some of that VRAM in your graphics card to work?

vramfs is a utility that uses the FUSE library to create a file system in VRAM. The idea is pretty much the same as a ramdisk, except that it uses the video RAM of a discrete graphics card to store files. It is not intented for serious use, but it does actually work fairly well, especially since consumer GPUs with 4GB or more VRAM are now available.

On the developer's system, the continuous read performance is ~2.4 GB/s and write performance 2.0 GB/s, which is about 1/3 of what is achievable with a ramdisk. That is already decent enough for a device not designed for large data transfers to the host, but future development should aim to get closer to the PCI-e bandwidth limits. See the benchmarks section for more info.

Requirements

Linux with kernel 2.6+
FUSE development files
A graphics card with support for OpenCL 1.2

Building

First, install the OpenCL driver for your graphics card and verify that it's recognized as an OpenCL device by running clinfo. Then install the libfuse3-dev package or build it from source. You will also need pkg-config and OpenCL development files, (opencl-dev, opencl-clhpp-headers package or equivalent), with version 1.2 of the OpenCL headers at least.

Just run make to build vramfs.

If you want to debug with valgrind, you should compile with the minimal fake OpenCL implementation to avoid filling your screen with warnings caused by the OpenCL driver:

valgrind: make DEBUG=1

Mounting

Mount a disk by running bin/vramfs <mountdir> <size>. The mountdir can be any empty directory. The size is the disk size in bytes. For more information, run bin/vramfs without arguments.

The recommended maximum size of a vramdisk is 50% of your VRAM. If you go over that, your driver or system may become unstable because it has to start swapping. For example, webpages in Chrome will stop rendering properly.

If the disk has been inactive for a while, the graphics card will likely lower its memory clock, which means it'll take a second to get up to speed again.

Implementation

The FUSE library is used to implement vramfs as a user space file system. This eases development and makes working with APIs such as OpenCL straightforward.

Basic architecture

When the program is started, it checks for an OpenCL capable GPU and attempts to allocate the specified amount of memory. Once the memory has been allocated, the root entry object is created and a global reference to it is stored.

FUSE then forwards calls like stat, readdir and write to the file system functions. These will then locate the entry through the root entry using the specified path. The required operations will then be performed on the entry object. If the entry is a file object, the operation may lead to OpenCL cvEnqueueReadBuffer or cvEnqueueWriteBuffer calls to manipulate the data.

When a file is created or opened, a file_session object is created to store the reference to the file object and any other data that is persistent between an fopen and fclose call.

VRAM block allocation

OpenCL is used to allocate blocks of memory on the graphics card by creating buffer objects. When a new disk is mounted, a pool of disk size / block size buffers is created and initialised with zeros. That is not just a good practice, but it's also required with some OpenCL drivers to check if the VRAM required for the block is actually available. Unfortunately Nvidia cards don't support OpenCL 1.2, which means the cvEnqueueFillBuffer call has to be simulated by copying from a preallocated buffer filled with zeros. Somewhat interestingly, it doesn't seem to make a difference in performance on cards that support both.

Writes to blocks are generally asynchronous, whereas reads are synchronous. Luckily, OpenCL guarantees in-order execution of commands by default, which means reads of a block will wait for the writes to complete. OpenCL 1.1 is completely thread safe, so no special care is required when sending commands.

Block objects are managed using a shared_ptr so that they can automatically reinsert themselves into the pool on deconstruction.

File system

The file system is a tree of entry_t objects with members for attributes like the parent directory, mode and access time. Each type of entry has its own subclass that derives from it: file_t, dir_t and symlink_t. The main file that implements all of the FUSE callbacks has a permanent reference to the root directory entry.

The file_t class contains extra write, read and size methods and manages the blocks to store the file data.

The dir_t class has an extra unordered_map that maps names to entry_t references for quick child lookup using its member function find.

Finally, the symlink_t class has an extra target string member that stores the pointer of the symlink.

All of the entry objects are also managed using shared_ptr so that an object and its data (e.g. file blocks) are automatically deallocated when they're unlinked and no process holds a file handle to them anymore. This can also be used to easily implement hard links later on.

The classes use getter/setter functions to automatically update the access, modification and change times at the appropriate moment. For example, calling the children member function of dir_t changes the access time and change time of the directory.

Thread safety

Unfortunately most of the operations are not thread safe, so all of the FUSE callbacks share a mutex to ensure that only one thread is mutating the file system at a time. The exceptions are read and write, which will temporarily release the lock while waiting for a read or write to complete.

Benchmarks

The system used for testing has the following specifications:

OS: Ubuntu 14.04.01 LTS (64 bit)
CPU: Intel Core i5-2500K @ 4.0 Ghz
RAM: 8GB DDR3-1600
GPU: AMD R9 290 4GB (Sapphire Tri-X)

Performance of continuous read, write and write+sync has been measured for different block allocation sizes by creating a new 2GiB disk for each new size and reading/writing a 2GiB file.

The disk is created using:

bin/vramfs /tmp/vram 2G

And the file is written and read using the dd command:

# write
dd if=/dev/zero of=/tmp/vram/test bs=128K count=16000

# write+sync
dd if=/dev/zero of=/tmp/vram/test bs=128K count=16000 conv=fdatasync

# read
dd if=/tmp/vram/test of=/dev/null bs=128K count=16000

These commands were repeated 5 times for each block size and then averaged to produce the results shown in the graph. No block sizes lower than 32KiB could be tested because the driver would fail to allocate that many OpenCL buffers. This may be solved in the future by using subbuffers.

Although 128KiB blocks offers the highest performance, 64KiB may be preferable because of the lower space overhead.

Future ideas

Implement RAID-0 for SLI/Crossfire setups

License

The MIT License (MIT)

Copyright (c) 2014 Alexander Overvoorde

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to
deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.

vramfs's People

Contributors

Stargazers

Watchers

vramfs's Issues

Access from GPU kernel

This project is really cool, I like the idea to use as much as possible fast resources.

If I understand it correctly If I will write implementation to access the file directly from OpenCL or CUDA, I will need to go trough host as it is using FUSE kernel integration, correct? So this doesn't make sense much.

Is there anyway or easy extension which will make the file accessible directly from GPU? Maybe it is even not good idea at all, but just came to my head.

Thanks for discussion.

Ladislav

Ubuntu 22.04 and permissions

On Ubuntu 22.04 it is not possible to load Vramfs without elevation. You have to raise the limits in „/etc/security/limits.conf", i.ex.:

User hard memlock unlimited
User soft memlock unlimited
User hard rtprio unlimited
User soft rtprio unlimited

Where "User" is the name of your user account.

This is a suggestion. The fuse directory is not accessible by other users. Therefore you ran into problems if you start Vramfs without elevation and a tool runs under the root account in this directory (or the other way round). In this situation it is helpful to add

fuse_opt_add_arg(&args, "-oallow_other");

at the end of vramfs.cpp. In "/etc/fuse.conf" you have to remove the "#" before "user_allow_other". After this the Vramfs-Directory is accessible by everyone. Maybe there are security concerns. But on a private computer i see no problems.

Resizable bar

Does enabling resizable bar impact performance?
Can the support for it be added?

Multiple process usage

Is this thread-safe in itself? What happens if OS writes/reads myfile.txt while I have a process also reading/writing myfile.txt?

What if different files are accessed concurrently?

Do we get graphics card's asynchronous features (like pci-e both ways usage, read+write concurrently)?

I have 2x K420 cards and a GT1030 card. Can I use 3 instances and somehow join all for 6GB(2 each) memory without losing threadsafeness? Can I have 2-3 instances on 1 graphics card to have more threading?

Use as SWAP

Was wondering if it could be possible to host a swap partition within vramfs or somehow patch vramfs to make it work as a swap partition?

My drive is encrypted, therefore I don't use SWAP partitions... but if this thing could give me 3GB or so of a swap-like fs, we could be onto something...

Do you think it could work without fuse, natively?

Oh, and great idea behind vramfs, really neat!

High memory usage

I'm observing about 6.59% of ram usage by vramfs process in a 16 GiB system. I'm allocating 8 GiB of vram. Is this normal?

Can allocate more memory than my GPU has

I have a Lenovo ideapad 720-15IKB with 16 GB of internal storage and an Radeon RX 560 with 4 GB of dedicated GDDR5 RAM.

I'm using amdgpu and opencl-mesa as drivers.

I installed vramfs on Manjaro via the AUR.

Stangely I can allocate more than 4GB of storage to a ramdisk. htop shows the elevated RAM usage even though it shouldn't show GPU RAM usage.

When I allocate less then 5GB to the ramdisk still elevated RAM usage is shown in htop. Is this expected behavior?

[Bug] vramfs deadlock caused by vramfs getting swapped to vram swap

Hello,

I have set-up swap on VRAM using vramfs and the guide from https://wiki.archlinux.org/title/Swap_on_video_RAM#FUSE_filesystem .

The problem appears when, under high memory pressure, vramfs gets swapped to VRAM, causing a complete deadlock.

Solution: use the mlockall(MCL_CURRENT | MCL_FUTURE) syscall to lock vramfs's memory and not have it be swapped.

Workaround: #3 (comment)

error: could not allocate more than 0 bytes

Hi, what does that mean?

$ bin/vramfs vram1/ 2G allocating vram... error: could not allocate more than 0 bytes cleaning up...

`$ bin/vramfs
usage: vramfs [-d ] [-f]

mountdir - directory to mount file system, must be empty
size - size of the disk in bytes
-d - specifies identifier of device to use
-f - flag that forces mounting, with a smaller size if needed

The size may be followed by one of the following multiplicative suffixes: K=1024, KB=1000, M=10241024, MB=10001000, G=102410241024, GB=100010001000. It's rounded up to the nearest multiple of the block size.

device list:
0: GeForce GT 730
`

File permissions lost when copying giles or when unpacking tarballs

The vramfs is mounted to ~/.cache/vramfs

$ ls -l test.sh
-rwxr-xr-x 1 user user 32 14. Feb 11:31 test.sh*
$ cp test.sh ~/.cache/vramfs
$ ls -l ~/.cache/vramfs/
-rw-r--r-- 1 user user 32 14. Feb 12:03 test.sh
$ mkdir ~/.cache/testdir
$ cp test.sh ~/.cache/testdir
-rwxr-xr-x 1 user user 32 14. Feb 12:05 test.sh*

cp -p preserves the permissions, but why do they get lost in the first place? I wanted to do some testing on how fast a vramfs would be as a build directory compared to an SSD and a tmpfs in real RAM.

I'm using an RX480 with 8 GiB GDDR5, on an up-to-date arch linux using the opencl-amdgpu-pro-orca opencl driver.

Unrelated testing results you might find interesting, and I don't know where else to put them:
Interestingly, I'm getting really good write speed and random write IOPS on the drive, but the read speed isn't great (around 1GiB/s using 128k blocks) and especially the random read IOPS is atrocious, worse than my 6 year old sata SSD.

command to test:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread

Result:
4k blocksize read: IOPS=19.2k, bw=74.9MiB/s
128k blocksize read: IOPS=4354, BW=544MiB/s
4k blocksize write: IOPS=73.5k, BW=287MiB/s
128k blocksize write: IOPS=12.2k, BW=1521MiB/s

and on my SSD:
4k blocksize read: IOPS=75.4k, BW=294MiB/s
128k blocksize read: IOPS=4145, BW=518MiB/s
4k blocksize write: IOPS=6184, BW=24.2MiB/s
128k blocksize write: IOPS=609, BW=76.2MiB/s

Using the basic dd test I also get worse read than write speed.

Truncate

Hi,
Nice work, I never user FUSE before, anyway the truncate function in vmram.cpp does:

file->size(size);

And it it's just nice for this function to delegate the class, anyway the file function size() in file class:

    void file_t::size(size_t new_size) {
        if (new_size < _size) {
            free_blocks(new_size);
        }

        _size = new_size;

        mtime(util::time());
    }

Suppose you are using the standard shell command truncate on a new file or to increase the size of any existing one, as you only check the new size is less the _size member field you will finally get the new or bigger "truncated" file without the correct blocks allocated. So then, when using the file again, you will get a segfault. Try by yourself.
Is it solution writing a while loop to allocate all the blocks needed with alloc_block function?

Regards Massimo.

Vramfs get's stuck

When i run "vramfs /tmp/vram 3000MB -f", the output of the program is:
allocating vram...
mounted.
but the application never ends, it's stuck there forever i have to terminate the application with CTRL + C.
And the vram doesn't get allocated.

I am using a vega 56 on arch linux with opencl-amd installed.

Does not compile

g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse  -I include/ -march=native -O2 -flto -c -o build/memory.o src/memory.cpp
src/memory.cpp: In function 'int vram::memory::clear_buffer(const cl::Buffer&)':
src/memory.cpp:20:30: error: 'class cl::CommandQueue' has no member named 'enqueueFillBuffer'
                 return queue.enqueueFillBuffer(buf, 0, 0, block::size, nullptr, nullptr);
                              ^
src/memory.cpp: In function 'bool vram::memory::init_opencl()':
src/memory.cpp:42:35: error: 'getPlatformVersion' is not a member of 'cl::detail'
                 cl_uint version = cl::detail::getPlatformVersion(platform());
                                   ^
src/memory.cpp: In function 'int vram::memory::clear_buffer(const cl::Buffer&)':
src/memory.cpp:23:9: error: control reaches end of non-void function [-Werror=return-type]
         }
         ^
cc1plus: all warnings being treated as errors
Makefile:18: recipe for target 'build/memory.o' failed
make: *** [build/memory.o] Error 1

Distro: Arch Linux
opencl-headers ver.: 2:1.1.20110526-1
libcl ver.: 1.1-4

OpenCL 1.1 doesn't seem to contain enqueueFillBuffer (see https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf).
cl::detail looks like it doesn't exist at all (see https://www.khronos.org/files/OpenCLPP12-reference-card.pdf).
Resetting to commit 1bdcab1 makes it work again.

Guess what the problem is: @Oblomov's changes require OpenCL 1.2 headers even if they work with older OpenCL implementations.

100% CPU usage

I got vramfs working nicely, but it continuously uses 100% CPU (one core) after I write something to the filesystem.

Here is a stack trace of the running thread:

#0  0x00007ffff6bb3a87 in sched_yield () from /usr/lib/libc.so.6
#1  0x00007ffff589a15f in ?? () from /usr/lib/libnvidia-opencl.so.1
#2  0x00007ffff59a4ca0 in ?? () from /usr/lib/libnvidia-opencl.so.1
#3  0x00007ffff6e9408c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007ffff6bcbe7f in clone () from /usr/lib/libc.so.6

No suitable devices found.

I'm having this output

usage: vramfs <mountdir> <size> [-d <device>] [-f]

  mountdir    - directory to mount file system, must be empty
  size        - size of the disk in bytes
  -d <device> - specifies identifier of device to use
  -f          - flag that forces mounting, with a smaller size if needed

The size may be followed by one of the following multiplicative suffixes: K=1024, KB=1000, M=1024*1024, MB=1000*1000, G=1024*1024*1024, GB=1000*1000*1000. It's rounded up to the nearest multiple of the block size.

No suitable devices found.

Output of lspci -vvv

...
00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 0b) (prog-if 00 [VGA controller])
	Subsystem: Acer Incorporated [ALI] Device 0866
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 46
	Region 0: Memory at b0000000 (64-bit, non-prefetchable) [size=4M]
	Region 2: Memory at a0000000 (64-bit, prefetchable) [size=256M]
	Region 4: I/O ports at 4000 [size=64]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: i915
	Kernel modules: i915
...

booting initrd to vram disk?

be cool to get linux live runs booting from the 8GB+ modern gpu vram.

terminate called after throwing an instance of 'std::regex_error'

/vramfs# make
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/util.o src/util.cpp
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/memory.o src/memory.cpp
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/entry.o src/entry.cpp
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/file.o src/file.cpp
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/dir.o src/dir.cpp
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/symlink.o src/symlink.cpp
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/vramfs.o src/vramfs.cpp
g++ -o bin/vramfs build/util.o build/memory.o build/entry.o build/file.o build/dir.o build/symlink.o build/vramfs.o -pthread -lfuse -l OpenCL

root@ubuntu:~/vramfs# bin/vramfs /tmp/vram 2G
terminate called after throwing an instance of 'std::regex_error'
what(): regex_error
Aborted

Bit-flip error detection or correction?

Apparently consumer GPU's are known to be susceptible to bit-flip errors:
http://cs.stanford.edu/people/ihaque/papers/gpuser.pdf

Adding a per block hash might be advisable. It might even be possible to add full-blown error correction:
http://saahpc.ncsa.illinois.edu/09/sessions/day2/session2/Maruyama_presentation.pdf

Crash on attempt to create FS

           PID: 7356 (vramfs)
           UID: 1000 (krutonium)
           GID: 1000 (krutonium)
        Signal: 11 (SEGV)
     Timestamp: Sat 2020-04-11 19:12:38 EDT (7min ago)
  Command Line: vramfs /tmp/vram 2G
    Executable: /usr/bin/vramfs
 Control Group: /user.slice/user-1000.slice/[email protected]/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-ee2e31b6-bcf7-4c0b-b57b-329fa7f74056.scope
          Unit: [email protected]
     User Unit: vte-spawn-ee2e31b6-bcf7-4c0b-b57b-329fa7f74056.scope
         Slice: user-1000.slice
     Owner UID: 1000 (krutonium)
       Boot ID: 215d1126f3114ff89e94226d040712f1
           PID: 7356 (vramfs)
           UID: 1000 (krutonium)
           GID: 1000 (krutonium)
        Signal: 11 (SEGV)
     Timestamp: Sat 2020-04-11 19:12:38 EDT (7min ago)
  Command Line: vramfs /tmp/vram 2G
    Executable: /usr/bin/vramfs
 Control Group: /user.slice/user-1000.slice/[email protected]/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-ee2e31b6-bcf7-4c0b-b57b-329fa7f74056.scope
          Unit: [email protected]
     User Unit: vte-spawn-ee2e31b6-bcf7-4c0b-b57b-329fa7f74056.scope
         Slice: user-1000.slice
     Owner UID: 1000 (krutonium)
       Boot ID: 215d1126f3114ff89e94226d040712f1
    Machine ID: 0de0781cf270482390fdc3f23193e730
      Hostname: GamingPC
       Storage: /var/lib/systemd/coredump/core.vramfs.1000.215d1126f3114ff89e94226d040712f1.7356.1586646758000000000000.lz4
       Message: Process 7356 (vramfs) of user 1000 dumped core.
                
                Stack trace of thread 7356:
                #0  0x00007f4d9e8b0c4d n/a (pipe_radeonsi.so + 0x18ec4d)
                #1  0x00007f4d9e8b14f3 n/a (pipe_radeonsi.so + 0x18f4f3)
                #2  0x00007f4d9e8a7cea n/a (pipe_radeonsi.so + 0x185cea)
                #3  0x00007f4da7c12303 n/a (libMesaOpenCL.so.1 + 0x50303)
                #4  0x00007f4da7c04ca3 n/a (libMesaOpenCL.so.1 + 0x42ca3)
                #5  0x00007f4da7c05421 n/a (libMesaOpenCL.so.1 + 0x43421)
                #6  0x00007f4da7c02235 n/a (libMesaOpenCL.so.1 + 0x40235)
                #7  0x00007f4da7c003f2 n/a (libMesaOpenCL.so.1 + 0x3e3f2)
                #8  0x00007f4da820ab7e clEnqueueCopyBuffer (libOpenCL.so.1 + 0xab7e)
                #9  0x00005605b4c8b294 n/a (vramfs + 0x1f294)
                #10 0x00005605b4c76028 n/a (vramfs + 0xa028)
                #11 0x00007f4da7cf4023 __libc_start_main (libc.so.6 + 0x27023)
                #12 0x00005605b4c7690e n/a (vramfs + 0xa90e)
                
                Stack trace of thread 7364:
                #0  0x00007f4da7ea4cf5 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfcf5)
                #1  0x00007f4d9e805cec n/a (pipe_radeonsi.so + 0xe3cec)
                #2  0x00007f4d9e8058e8 n/a (pipe_radeonsi.so + 0xe38e8)
                #3  0x00007f4da7e9e46f start_thread (libpthread.so.0 + 0x946f)
                #4  0x00007f4da7dcc3d3 __clone (libc.so.6 + 0xff3d3)
                
                Stack trace of thread 7357:
                #0  0x00007f4da7ea4cf5 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfcf5)
                #1  0x00007f4d9e805cec n/a (pipe_radeonsi.so + 0xe3cec)
                #2  0x00007f4d9e8058e8 n/a (pipe_radeonsi.so + 0xe38e8)
                #3  0x00007f4da7e9e46f start_thread (libpthread.so.0 + 0x946f)
                #4  0x00007f4da7dcc3d3 __clone (libc.so.6 + 0xff3d3)
                
                Stack trace of thread 7363:
                #0  0x00007f4da7ea4cf5 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfcf5)
                #1  0x00007f4d9e805cec n/a (pipe_radeonsi.so + 0xe3cec)
                #2  0x00007f4d9e8058e8 n/a (pipe_radeonsi.so + 0xe38e8)
                #3  0x00007f4da7e9e46f start_thread (libpthread.so.0 + 0x946f)
                #4  0x00007f4da7dcc3d3 __clone (libc.so.6 + 0xff3d3)
                
                Stack trace of thread 7365:
                #0  0x00007f4da7ea4cf5 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfcf5)
                #1  0x00007f4d9e805cec n/a (pipe_radeonsi.so + 0xe3cec)
                #2  0x00007f4d9e8058e8 n/a (pipe_radeonsi.so + 0xe38e8)
                #3  0x00007f4da7e9e46f start_thread (libpthread.so.0 + 0x946f)
                #4  0x00007f4da7dcc3d3 __clone (libc.so.6 + 0xff3d3)
                
                Stack trace of thread 7367:
                #0  0x00007f4da7ea4cf5 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfcf5)
                #1  0x00007f4d9e805cec n/a (pipe_radeonsi.so + 0xe3cec)
                #2  0x00007f4d9e8058e8 n/a (pipe_radeonsi.so + 0xe38e8)
                #3  0x00007f4da7e9e46f start_thread (libpthread.so.0 + 0x946f)
                #4  0x00007f4da7dcc3d3 __clone (libc.so.6 + 0xff3d3)
                
                Stack trace of thread 7366:
                #0  0x00007f4da7ea4cf5 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfcf5)

make fails

Output of make:

david@Ubuntu-Main:~/vramfs$ make
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/util.o src/util.cpp
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -march=native -O2 -flto -c -o build/memory.o src/memory.cpp
In file included from src/memory.cpp:1:0:
include/memory.hpp:14:14: fatal error: CL/cl2.hpp: No existe el archivo o el directorio
     #include <CL/cl2.hpp>

Output of make DEBUG=1:

david@Ubuntu-Main:~/vramfs$ make DEBUG=1
g++ -Wall -Wpedantic -Werror -std=c++11 -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -I include/ -g -DDEBUG -Wall -Werror -std=c++11 -c -o build/memory.o src/memory.cpp
src/memory.cpp: In function ‘int vram::memory::clear_buffer(const cl::Buffer&)’:
src/memory.cpp:24:61: error: binding reference of type ‘cl::Buffer&’ to ‘const cl::Buffer’ discards qualifiers
                 return queue.enqueueCopyBuffer(zero_buffer, buf, 0, 0, block::size, nullptr, nullptr);
                                                             ^~~
In file included from include/memory.hpp:12:0,
                 from src/memory.cpp:1:
include/CL/debugcl.hpp:80:13: note:   initializing argument 2 of ‘int cl::CommandQueue::enqueueCopyBuffer(const cl::Buffer&, cl::Buffer&, int, int, int, std::vector<cl::Event>*, cl::Event*)’
         int enqueueCopyBuffer(const Buffer& src, Buffer& dst, int offSrc, int offDst, int size, std::vector<cl::Event>* events, cl::Event* event) {
             ^~~~~~~~~~~~~~~~~
src/memory.cpp: In function ‘bool vram::memory::init_opencl()’:
src/memory.cpp:44:17: error: ‘cl_uint’ was not declared in this scope
                 cl_uint version = cl::detail::getPlatformVersion(platform());
                 ^~~~~~~
src/memory.cpp:44:17: note: suggested alternative: ‘cl_int’
                 cl_uint version = cl::detail::getPlatformVersion(platform());
                 ^~~~~~~
                 cl_int
src/memory.cpp:46:21: error: ‘version’ was not declared in this scope
                 if (version >= (1 << 16 | 2))
                     ^~~~~~~
src/memory.cpp:52:55: error: ‘CL_MEM_READ_ONLY’ was not declared in this scope
                     zero_buffer = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, block::size, zero_data, &r);
                                                       ^~~~~~~~~~~~~~~~
src/memory.cpp:52:55: note: suggested alternative: ‘CL_MEM_READ_WRITE’
                     zero_buffer = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, block::size, zero_data, &r);
                                                       ^~~~~~~~~~~~~~~~
                                                       CL_MEM_READ_WRITE
src/memory.cpp:52:74: error: ‘CL_MEM_COPY_HOST_PTR’ was not declared in this scope
                     zero_buffer = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, block::size, zero_data, &r);
                                                                          ^~~~~~~~~~~~~~~~~~~~
src/memory.cpp: In function ‘std::vector<std::__cxx11::basic_string<char> > vram::memory::list_devices()’:
src/memory.cpp:87:51: error: ‘class cl::Device’ has no member named ‘getInfo’
                     device_names.push_back(device.getInfo<CL_DEVICE_NAME>());
                                                   ^~~~~~~
src/memory.cpp:87:59: error: ‘CL_DEVICE_NAME’ was not declared in this scope
                     device_names.push_back(device.getInfo<CL_DEVICE_NAME>());
                                                           ^~~~~~~~~~~~~~
src/memory.cpp:87:59: note: suggested alternative: ‘CL_DEVICE_TYPE_GPU’
                     device_names.push_back(device.getInfo<CL_DEVICE_NAME>());
                                                           ^~~~~~~~~~~~~~
                                                           CL_DEVICE_TYPE_GPU
src/memory.cpp:87:75: error: expected primary-expression before ‘)’ token
                     device_names.push_back(device.getInfo<CL_DEVICE_NAME>());
                                                                           ^
Makefile:18: recipe for target 'build/memory.o' failed
make: *** [build/memory.o] Error 1

Installed ocl-icd-opencl-dev (2.2.11-1ubuntu1) & libfuse-dev (2.9.7-1ubuntu1)
OS: Ubuntu 18.04
GPU: GTX 970
GPU drivers: Nvidia 390.116
Output of clinfo:

david@Ubuntu-Main:~/vramfs$ clinfo
Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 9.1.84
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     GeForce GTX 970
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  390.116
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               13
  Max clock frequency                             1329MHz
  Compute Capability (NV)                         5.2
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              4234018816 (3.943GiB)
  Error Correction support                        No
  Max memory allocation                           1058504704 (1009MiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        212992 (208KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             4096x4096x4096 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1

compile error

ubuntu 16.04
gcc 4.9
nvidia gtx 960
clinfo: http://paste.ubuntu.com/23541229/
compile error: http://paste.ubuntu.com/23541234/

thnx!