rogerallen / raytracinginoneweekendincuda Goto Github PK

This project forked from pfranz/raytracinginoneweekend

The code for the ebook Ray Tracing in One Weekend by Peter Shirley translated to CUDA by Roger Allen. This work is in the public domain.

C++ 98.26% Makefile 1.74%

raytracinginoneweekendincuda's Introduction

Ray Tracing in One Weekend in CUDA

This is yet another Ray Tracing in One Weekend clone, but this time using CUDA instead of C++. CUDA can be used to speed up the code. For example, on my machine, the C++ code renders the test image in 90 seconds. The CUDA accelerated code renders the image in about 7 seconds.

Initial coding started in May, 2018 and was posted to the NVIDIA Developer blog November 5, 2018: https://devblogs.nvidia.com/accelerated-ray-tracing-cuda/

Background

Peter Shirley has written a few ebooks about Ray Tracing. You can find out more at http://in1weekend.blogspot.com/2016/01/ray-tracing-in-one-weekend.html Note that as of April, 2018 the books are pay what you wish and 50% of the proceeds go towards not-for-profit programming education organizations. They are also available for $3 each on Amazon as a Kindle download.

This repository contains code for converting the first ray tracer ebook "Ray Tracing in one Weekend" from C++ to CUDA. By changing to CUDA, depending on your CPU and GPU you can see speedups of 10x or more! UPDATE: see Issue #2 for a further 2x improvement!

Before coding the ray tracer in CUDA, I recommend that you code the ray tracer in C++, first. You should understand the concepts presented in a serial language well, then translate this knowledge to CUDA. In fact, since CUDA uses C++, much of your code can be reused.

The C++ code that this repository is based on is at https://github.com/petershirley/raytracinginoneweekend. As of January, 2020, the book and code are being updated and improved at https://github.com/RayTracing/raytracing.github.io/. This repository has not been changed to match these changes. The code matches the original book from 2016 which you can still download from http://in1weekend.blogspot.com/2016/01/ray-tracing-in-one-weekend.html. Further, I am basing this on the repo at https://github.com/pfranz/raytracinginoneweekend which has each chapter as a separate git branch. This is very handy for checking out the code at each chapter.

Chapter List

Here are links to the git branch for each Chapter. If you look at the README.md you'll see some hints about what needed to be done. See the Makefile for the standard targets. Note that you'll want to adjust the GENCODE_FLAGS in the CUDA Makefiles for your specific graphics card architecture.

The master branch has the code as Peter Shirley presented it in C++. I added a Makefile so you can make out.jpg and compare the runtime to CUDA. To build variants that use CUDA, check out one of these branches. E.g. git checkout ch01_output_cuda

Chapter 1 - Basic Output: git checkout ch01_output_cuda
Chapter 2 - Vectors: git checkout ch02_vec3_cuda
Chapter 3 - Rays: git checkout ch03_rays_cuda
Chapter 4 - Spheres: git checkout ch04_sphere_cuda
Chapter 5 - Normals: git checkout ch05_normals_cuda
Chapter 6 - Antialiasing: git checkout ch06_antialiasing_cuda
Chapter 7 - Diffuse: git checkout ch07_diffuse_cuda
Chapter 8 - Metal: git checkout ch08_metal_cuda
Chapter 9 - Dielectrics: git checkout ch09_dielectrics_cuda
Chapter 10 - Camera: git checkout ch10_camera_cuda
Chapter 11 - Defocus Blur: git checkout ch11_defocus_blur_cuda
Chapter 12 - Where Next: git checkout ch12_where_next_cuda

Colophon

Basic process (after Chapter 3) was:

# checkout original code & create a cuda branch
git checkout origin/chyy_yyy
git checkout chyy_yyy
git branch -m chyy_yyy_cuda
git mv main.cc main.cu
<checkin>
# grab previous chapters code as a starting point
cp chapterxx/* .
# edit & fix code
# checkin code
# save current code for next chapter
mkdir chapteryy
cp * chapteryy

raytracinginoneweekendincuda's People

Contributors

Stargazers

Watchers

raytracinginoneweekendincuda's Issues

Why curand_init() with seed 1984 ?

I'm really curious about why the parameter 1984 in function curand_init() get the same result as rand48() in CPU.How do you find the seed? Do they use the same algorithm to generate the random numbers? @rogerallen

Error when initialize 'rand_state' at the start of render() kernel

I tried to initialize the rand_state at the start of render kernel but caused CUDA Error 799.Here is the code.

define M_PI

From CUDA Education on the blog forum:

One last thing in regards to the code. On the later chapters, the CUDA code uses M_PI constant in the camera.h code. Please add the following snippet before the camera.h include so there are no build-time errors:

#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif

Error 716 keeps popping up

I have just started learning the CUDA framework , and was going through this project. The solution works perfectly fine for low samples per pixel<=10 but at higher samples per pixel , the solution almost never works. Now the reason I am using "almost" is because , these error messages are very sporadic in nature, like for lower no of samples of pixel too I get this kind of error but it is much less frequent.
I am not able to point out why exactly this is happening.
I tried running the program with compute-sanitizer and I got this kind of message -

========= COMPUTE-SANITIZER
 ========= Invalid __local__ write of size 4 bytes
 =========     at 0x3e60 in D:/CudaProjects/raytracinginoneweekendincuda- 
 ch12_where_next_cuda/material.h:52:render(vec3 *, int, int, int, camera 
 **, hitable **, curandStateXORWOW *)
 =========     by thread (4,1,0) in block (128,1,0)=========     Address 
 0xfffb7a is misaligned
 =========     Saved host backtrace up to driver entry point at kernel 
 launch time
 =========     Host Frame:cuEventRecordWithFlags [0x7ffbc0ecc7b5]
 =========                in 
 C:\WINDOWS\system32\DriverStore\FileRepository
 \nvam.inf_amd64_4c9ded46d0fbe1 
 f8\nvcuda64.dll
 =========     Host Frame: [0x1d46]
 =========                in 
 D:\CudaProjects\raytracinginoneweekendincuda-ch12_where_next_cuda\a.exe
 -- More  --

I am assuming that this is some kind of misalignment error, although I think a TDR timeout could also be in the picture , but if it was a case of a TDR timeout , why sometimes the same code has also worked for higher nos of sample per pixel?
I tried using align(16) annotations on my classes and even changed my float[3] variable in the vec3 class to float3 variable to try to avoid any misalignment issues , but still I am facing the same problem.
I have been trying to debug this error for quite some time now, and any help would be really appreciated.

I have CUDA 11.2 installed in my system. I am on Windows 11 and using NVIDIA GTX 1660ti graphics card.

About the design of ray tracing program (not a specific implementation issue)

Hi! Recently I've been reading your blog on the nvidia-developer.com, and I found this repo. When I take a look into the code of yours, I found something that contradicts mt understanding of designing and building a fast parallel GPU program. Here is my confusion: When I learned CUDA several months ago, I was told that GPU generally can not handle heavy logical branches (if-else structure) since even in CPU, things are a little bit trick when it comes to branch prediction and scheduling. When there is a if-else structure in the code, it might lead to the so-called warp-divergence, i.e. the threads entering the true branch and the false branch will be serialized, causing time overhead. Interestingly, I believed this so much that I have been trying to avoid writing if-else and for loops inside __device__ code, just to make my code (potentially) faster. However, your code seems to possess A LOT of branches and for-loops, here I present you an example (__device__ function color in main.cu):'

for(int i = 0; i < 50; i++) {
  hit_record rec;
    if ((*world)->hit(cur_ray, 0.001f, FLT_MAX, rec)) {
        ray scattered;
        vec3 attenuation;
        if(rec.mat_ptr->scatter(cur_ray, rec, attenuation, scattered, local_rand_state)) {
            cur_attenuation *= attenuation;
            cur_ray = scattered;
        }
        else {
            return vec3(0.0,0.0,0.0);
        }
    }
    else {
        vec3 unit_direction = unit_vector(cur_ray.direction());
        float t = 0.5f*(unit_direction.y() + 1.0f);
        vec3 c = (1.0f-t)*vec3(1.0, 1.0, 1.0) + t*vec3(0.5, 0.7, 1.0);
        return cur_attenuation * c;
    }
}

Since I intend to build my own customized ray tracer, I am confused whether (and how) this will work (be fast). Do you think I have some misunderstanding over warp-divergence and the efficient design of GPU ray tracing programs? (Actually, I thought there would be heavy logic and tons of branches possible in the code of a ray tracer, therefore at first I thought GPU acceleration for ray tracer would be very hard). You can consider me as a CUDA newbie, and I really want to know how you think of these problems. Thanks!

ptxas' died due to "signal 11" (Invalid memory reference)

Hi,

When I attempted to write my own ray tracer using this code as a guide I encountered an error where ptxas would crash.

Unfortunately it is not reproducing now, I encountered the bug about a month ago. I have rewritten a lot of my code and I have not been using source control. This is embarrassing, it was so easy to reproduce just a month ago.

I went through my google search history and the error I was encountering was:
"ptxas' died due to "signal 11" (Invalid memory reference)"
"stack size for entry function cannot be statically determined"

Here is a similar error that I found when I was debugging it:
https://devtalk.nvidia.com/default/topic/808186/nvcc-error-ptxas-died-due-to-signal-11-invalid-memory-reference-/

It seemed to be caused by the dielectric material's scatter function. The scatter function you have is:
__device__ virtual bool scatter(const Ray& r_in, const HitRecord& rec, vec3& attenuation, Ray& scattered, curandState *local_rand_state) const = 0;

The scatter function I wrote was:
__device__ virtual bool scatter(const Ray& r_in, const HitRecord& rec, vec3& attenuation, Ray& scattered, curandState *local_rand_state, bool rand) const = 0;

Where "rand" is used in the dielectric function as:
rand < reflect_prob ? scattered = Ray(rec.p, reflected) : scattered = Ray(rec.p, refracted);

I am not sure why rand is a bool and not a float. It has been awhile since I touched this code but this is exactly what I did to work around this problem. I do have a record of that in a comment.

Here are the details on the system I am using:

I was able to debug this by looking at the ptxas output. To do this I use the following compile command:
nvcc -ccbin g++ -I. -I/usr/local/apps/cuda/cuda-9.2/include -gencode arch=compute_35,code=sm_35 --ptxas-options="-v" -c $< -o $@ --verbose

CUDA doesn't accelerate my ray tracer

Hi Roger, sorry to bother you in such a way.

I like this project a lot and trying to port another even simpler ray tracer smallpt into CUDA in basically the same way as this project. The results look correct but the rendering time are basically the same for CPU version and CUDA version on my laptop.

reproduce results:

build and run the CPU version:

$ git clone --recurse-submodules https://github.com/w3ntao/smallpt-cpu.git

# build
$ cd smallpat-cpu
$ mkdir build; cd build
$ cmake ..; make -j

# render
$ ./smallpt-cpu
rendering (64 spp) took 10.600 seconds.
image saved to `smallpt_cpu_64.png`

build and run the CUDA version:

$ git clone --recurse-submodules https://github.com/w3ntao/smallpt-megakernel.git

# build
$ cd smallpt-megakernel
$ mkdir build; cd build
$ cmake ..; make -j

# render
$ ./smallpt-megakernel
rendering (64 spp) took 11.214 seconds.
image saved to `smallpt_megakernel_64.png`

It took around 10 sec for the CPU ray tracer to render and around 11 sec for the CUDA version. Both rendered images look visually correct.

My laptop setup: Intel Core i7-10750H (6 cores, 12 threads), NVIDIA GeForce GTX 1650

core function trace() (in main.cpp and main.cu)

This function is not complex, its pseudocode can be summarized as:

Vec3 trace(Ray ray) {
    Vec3 radiance(0.0, 0.0, 0.0);
    Vec3 throughput(1.0, 1.0, 1.0);

    Ray ray <- camera_ray;
    for (int depth = 0;; ++depth) {
        hit_point, sphere = intersect(ray, scene)
        // compute hit_point and sphere by intersecting the ray with the scene

        if (hit_nothing) {
            // stop the loop if hit nothing
            break;
        }

        update(radiance, sphere.emission)        // update radiance according to sphere's emission
        update(throughput, sphere.material_type) // update throughput according to sphere's material_type

        extension_ray = spawn_ray(ray, hit_point, sphere.material_type)
        // compute extension_ray from previous ray, hit_point and sphere.material_type

        ray <- extension_ray
        // start next round bouncing with extension_ray
    }

    return radiance;
}

There are some mathematical and trigonometric details behind it but I don't think they are that much relevant in current discussion.

Other than different memory management, the code are virtually identical for CPU and CUDA implementation. For CPU, each thread would render a bunch of pixels, roughly 1/N of all pixels for a N-threads CPU. For GPU, each thread render 1 pixel, and block dimension is 8x8.

First I thought (different) branches in code could hurt kernel funcion performance, so I change 2 spheres (a mirror and a glass) to diffuse materials (main.cu: line 332 and main.cu: line 336), rendering time improved to just 9.9 second (from 11 second), still far from reaching full GPU potential.

I understand that such naive porting (from CPU functions) to CUDA is not optimized, but just like ray-tracing-weekend-cuda project, I believe with such sub-optimized solution, the CUDA ray tracer should still outperform the CPU version by several magnitude. So I'm asking what have I possibly done wrong with CUDA to get it rendering in such low performance?

Best,
Wentao

Modify curand_init for 2x performance improvement

Change

curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);

curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);

for 2x speedup. Some info at: https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes

The first call has a fixed random seed, different sequence ids and a fixed offset into that sequence. That creates a different sequence per thread (more overhead).

The second call has different seeds and the same sequence & offset. I think this means it only generates one sequence for all the threads & the different seeds allow for enough randomness without so much overhead.

I had tried this out when I originally created the code, but read the instructions too quickly & messed up. I modified the 3rd parameter, not the first. Doh!

Chapter 12-main.cu is missing memory deallocation

Chapter 12-main.cu should also free d_rand_state2:
checkCudaErrors(cudaFree(d_rand_state2));

CUDA error = 999at main.cu:196'cudaDeviceSynchronize()/n with High Resolution

After completing ch12,

int nx = 640;
int ny = 480;
int ns = 10;

finishes rendering in 5 seconds with no errors, but

int nx = 1200;
int ny = 800;
int ns = 10;

crashes after a few seconds with "CUDA error = 999at main.cu:196'cudaDeviceSynchronize()/n".
(Also my screen goes black for a second)
This is from checkCudaErrors(cudaGetLastError()); after render<<<blocks, threads>>>(...)

Also higher number of samples crashes the program as well.

int nx = 640;
int ny = 480;
int ns = 100;

I use NVIDIA QuadroK4200.
Is this the performance limit of my GPU?