Error occurs on Cuda accelerator, CPU is working perfectly fine. Below calculation

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Incorrect calculations,about m4rs-mt/ilgpu

Comments (20)

m4rs-mt commented on May 19, 2024 2

@Wowo51 Oh, this is really interesting. Without a detailed analysis of your program this will be hard to detect, since ILGPU produces essentially the same output as the Cuda compiler regarding PTX instructions:

__global__ void Kernel(float *c, const float *a, const float *b)
{
    int i = ...
    c[i] = a[i] + b[i];
    c[i] = a[i] - b[i];
    c[i] = a[i] * b[i];
    c[i] = a[i] / b[i];
}

using NVCC yields:

	add.f32 	%f3, %f1, %f2;
        ...
	sub.f32 	%f6, %f4, %f5;
        ...
	mul.f32 	%f9, %f7, %f8;
        ...
	div.rn.f32 	%f12, %f10, %f11;
        ...

and

__global__ void Kernel(double *c, const double *a, const double *b)
{
    int i = ...
    c[i] = a[i] + b[i];
    c[i] = a[i] - b[i];
    c[i] = a[i] * b[i];
    c[i] = a[i] / b[i];
}

using NVCC yields:

	add.f64 	%fd3, %fd1, %fd2;
        ...
	sub.f64 	%fd6, %fd4, %fd5;
        ...
	mul.f64 	%fd9, %fd7, %fd8;
        ...
	div.rn.f64 	%fd12, %fd10, %fd11;
        ...

ILGPU produces the same instructions. Maybe the problem is related to expressions that can be evaluated at compile time? If you are using kernel compile-time constant expressions, you can experiment with the context flag DisableConstantPropagation.

from ilgpu.

adkruk commented on May 19, 2024 1

Hello Marcel, Precision was good on version 3. After upgrade to version 7 it's really worse. Actually is far worse than 10^-12. I wrote tests with matrix 1000x1000 and run sigmoid function. It turns out that sometimes the difference between CPU and GPU is between 10^-09 to 10^-08 All my calculations are between 0 and 1 so, precision is really important. I don't understand why it was good on version 3 and know its not. I turns out that also multiplication and addition have worse precision on version 7. I multiplied 2 matrices 1000x100 * 100x1000 - result matrix 1000x1000 and compared results on CPU/GPU on version 3 and 7. On version 3 results was exactly the same, no difference on any digit on doubles. However on version 7 I have differences on 15th place after dot. Why it this? Its only addition and multiplication in kernel, so why results are not the same as were on version 3? Comparison of logistic function CPU/GPU (a few examples) Expected item[0,0] to approximate 0.42571720843194016 +/- 3E-16, but 0.42571720003741248 differed by 8.394527684085773E-09. Expected item[0,1] to approximate 0.84012778453062786 +/- 3E-16, but 0.84012779673024884 differed by 1.2199620980091197E-08. Expected item[0,2] to approximate 0.62082250225088442 +/- 3E-16, but 0.62082252726762344 differed by 2.5016739013672407E-08. Expected item[0,3] to approximate 0.347580149533511 +/- 3E-16, but 0.34758013010023819 differed by 1.9433272824276315E-08. Expected item[0,4] to approximate 0.458355061487401 +/- 3E-16, but 0.45835505735852972 differed by 4.12887130041284E-09. Expected item[0,5] to approximate 0.52932136621435644 +/- 3E-16, but 0.5293213831980027 differed by 1.6983646267121344E-08. Expected item[0,6] to approximate 0.606021059597356 +/- 3E-16, but 0.60602107800648253 differed by 1.8409126556484523E-08. Expected item[0,7] to approximate 0.568381283655018 +/- 3E-16, but 0.56838127895354962 differed by 4.7014684367852055E-09. Multiplication of 2 matrices Expected item[0,0] to approximate 0.28268237427574283 +/- 0.0, but 0.28268237427574455 differed by 1.7208456881689926E-15. Expected item[0,1] to approximate -7.4314067568123958 +/- 0.0, but -7.4314067568123949 differed by 8.8817841970012523E-16. Expected item[0,2] to approximate -2.074000959689934 +/- 0.0, but -2.0740009596899367 differed by 2.6645352591003757E-15. Expected item[0,3] to approximate 4.25567927865156 +/- 0.0, but 4.255679278651562 differed by 1.7763568394002505E-15. Expected item[0,4] to approximate 5.4011195303335322 +/- 0.0, but 5.4011195303335366 differed by 4.4408920985006262E-15. Expected item[0,5] to approximate -1.9417762038465107 +/- 0.0, but -1.9417762038465116 differed by 8.8817841970012523E-16. Expected item[0,6] to approximate 0.66065965695742079 +/- 0.0, but 0.66065965695741891 differed by 1.8873791418627661E-15. Expected item[0,7] to approximate 12.033667126091908 +/- 0.0, but 12.033667126091904 differed by 3.5527136788005009E-15. Expected item[0,8] to approximate 2.9823329829505139 +/- 0.0, but 2.982332982950513 differed by 8.8817841970012523E-16. Expected item[0,9] to approximate 6.54944296240564 +/- 0.0, but 6.5494429624056369 differed by 2.6645352591003757E-15. Expected item[0,10] to approximate 11.891729613614306 +/- 0.0, but 11.891729613614302 differed by 3.5527136788005009E-15. Expected item[0,12] to approximate 6.7034795800394669 +/- 0.0, but 6.703479580039466 differed by 8.8817841970012523E-16. Expected item[0,13] to approximate -4.2065218801376032 +/- 0.0, but -4.2065218801376023 differed by 8.8817841970012523E-16. Expected item[0,15] to approximate 8.4450085958099237 +/- 0.0, but 8.4450085958099255 differed by 1.7763568394002505E-15. Expected item[0,16] to approximate 1.7480519705034778 +/- 0.0, but 1.7480519705034783 differed by 4.4408920985006262E-16. Expected item[0,17] to approximate 3.5232782818193948 +/- 0.0, but 3.5232782818193935 differed by 1.3322676295501878E-15. Expected item[0,18] to approximate 8.8169097982321656 +/- 0.0, but 8.8169097982321709 differed by 5.3290705182007514E-15. sob., 11 sty 2020 o 12:23 Marcel Koester <[email protected]> napisał(a):

…

@adkruk <https://github.com/adkruk>: First of all: Happy New Year! There have been many different customizations to the ILGPU framework to support new intrinsic implementations for all platforms. One design decision was to enable full cross-platform portability without the need for an external SDK (like the CUDA SDK). However, this SDK provides a libdevice library of special intrinsic mathematical implementations for NVIDIA GPUs that is compatible with a specific LLVM version. Since native LLVM support has been removed from the main project, it may be difficult to include these low-level implementations in the current compilation pipeline. Most users of ILGPU care about portability, flexibility and performance. Numerical precision in terms of calculations that are far beyond 10^-12 have not been an issue so far. Anyway; since you are requesting additional precision we should focus on a way of improving the precision of the currently implemented math functions in the scope of the ILGPU.Algorithms library. It is totally possible to implement the functions in a way that match the intended ULP precision. However, I would suggest that we should do this in a portable way. Otherwise, kernels on AMD hardware (for example) will yield different results. To be honest, this is not the highest priority on my roadmap. But if you are interested in implementing this functionality in the `ILGPU.Algorithms' library, I would be happy to accept PRs. Alternatively, I plan to add support for an AOT-compiler that may use an LLVM backend for high-performance code generation. This pipeline could include support for the Cuda SDK, as well. However, this will require some time to implement ;) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56?email_source=notifications&email_token=AGDF6MS57UKIDB2VN542K5LQ5GT37A5CNFSM4J6YWZAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIV7YXQ#issuecomment-573307998>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGDF6MRFROW3ALPENIZBD6LQ5GT37ANCNFSM4J6YWZAA> .

-- Pozdrawiam, Adrian Kruk

from ilgpu.

Costigan commented on May 19, 2024 1

I have a similar problem with XMath.Atan2().

I'm seeing a 0.8% error in the returned value for x/y inputs around 45, 135, 225 and 315 degrees.

This error occurs with the Cuda accelerator type and not with the CPU type. That is, the results are different and the results with the CPU type are correct to 6 or 7 places.

This 0.8% error is large enough to matter in my situation, causing errors in the results of my program starting last summer when I upgraded 0.6. I'm using 0.7.1 now. I think there was some issue that forced me to upgrade or I would downgrade now, but I currently don't remember what that was.

In the future, we would like to implement mathematical functions that have the same accuracy with respect to ULPs on all accelerators.

First, ILGPU is great, and I'm very grateful that it exists. I understand this goal, but the cost in accuracy of the trigonometric functions is a breaking change for some applications. Could there perhaps be a way for the user to select the NVIDIA implementations?

from ilgpu.

MoFtZ commented on May 19, 2024 1

hi @m4rs-mt, what are your thoughts on using CORDIC as an implementation for the XMath functions? Apparently it is straight-forward to implement, and you can increase accuracy by running more iterations. I'm not sure how performant it would be on the GPU, however.

UPDATE: I have implemented a console application to calculate Sin/Cos with kernels for single and double precision. Running it will output the error margin between the system implementation and the CORDIC implmenetation.

Cordic.zip

from ilgpu.

MoFtZ commented on May 19, 2024 1

@m4rs-mt, I found another libm implementation called OpenLibm.

It would still need to be ported to C#, but it appears to have a suitable/permissive license - it has already been ported to Rust. Even better, it also has implementations for both single and double precision.

As a trial, I was successfully able to port the sine function, and was able to get it working on the CPU accelerator. However, it uses local memory allocations for arrays of temp numbers, so it would not run on the Cuda accelerator - and no doubt, many other issues later down the track.

From another issue on this project, you said:

Please note that the next version will include 1D arrays in local memory 🔢

Do you have any further information about this?

What operations will be supported?
e.g. var temp = new double[20];
Will it also support passing arguments to functions?
e.g. public static double SomeFunction(double[] values) { /* do some calculations */ }

from ilgpu.

m4rs-mt commented on May 19, 2024 1

@MoFtZ I also saw this library. I played around with several possible workarounds to use the NVIDIA specific functions in Cuda kernels: And finally... it seems to work 🔢

However, it could be really nice to have a software implementation for all functions that can be used on any accelerator. Because of different copyright and legal "challenges", it might be a good idea to convert these functions as part of another project (assuming we want to convert them).

from ilgpu.

m4rs-mt commented on May 19, 2024 1

@MoFtZ The upcoming array features will allow you to declare and instantiate 1D local arrays (as in 1.). You can also pass references to specific array elements and the whole array to other functions.

from ilgpu.

MoFtZ commented on May 19, 2024 1

hi @Wowo51, have you tried updating to the latest ILGPU.Algorithms? Also, which functions are you using?

The XMath class was updated in ILGPU.Algorithms v0.8.0 to support greater accuracy.

from ilgpu.

Wowo51 commented on May 19, 2024 1

I'm just using +-*/ at this point. I'm just using these as you normally would in code, I'm not using XMath. I'm using version 0.8. I'm going to have to dig into this further so I'll post more info when I get it.

from ilgpu.

m4rs-mt commented on May 19, 2024

@adkruk Thank you for your report. The "correctness" of floating point numbers has always been a topic on various platforms. Therefore I think the title of the issue report might be misleading: Floating point numbers are by definition "wrong". I think the title should read insufficient accuracy ;)

Note that the actual math implementations have been moved to the ILGPU.Algorithms library, as they are not directly related to the actual compiler functionality. I also wonder how you compare the output? It looks like you check that they are actually equal on all platforms. In general, this can't work because the ULP precision guarantees are different on different target platforms.

Regarding your precision concerns: The old ILGPU version used a special NVIDIA library to implement special math functions. The current ILGPU.Algorithms library uses its own intrinsic implementations for all target platforms. You might ask yourself: why is this so? In the future, we would like to implement mathematical functions that have the same accuracy with respect to ULPs on all accelerators. However, ILGPU.Algorithms does not (currently) guarantee a specific accuracy in ULPs of the available special functions.

A more detailed analysis of your use cases shows the following findings:

The inaccuracy is caused by the 'Exp' function, which is not surprising since it uses a fast approximation under the hood on the GPU (see summary).
The inaccuracy could be caused by the Atomic.Add function, which should not be used for floats or doubles in general, since there is no guarantee about the order of operations (this is not true in general for a particular target GPU). Since floating-point operations are generally not commutative, you should never use atomic functions if you really want extremely accurate results. Changing your kernel code to

public static void SumOfSquares2(Index index, ArrayView<double> weights, ArrayView<double> target)
{
    var quantity = weights [Index] * weights [Index];
    target[index] = quantity;
}

shows that the results are actually the same on CPU and GPU. Again, this is not surprising since the PTX specification guarantees an ULP error of 0 in this case (and ILGPU issues the desired commands). Using an Atomic.Add operation currently creates a software loop to add the doubled values. I suspect that the older LLVM backend may have issued a native PTX statement for your target GPU (this is just a wild guess). However, I cannot reproduce your value of 4.9106901777409995 on CPU and GPU, which is not surprising given the parallel nature of the calculations in general.

Summary:
You should ask yourself if this precision is good enough or not. In your show cases, it looks like the precision is in the range of 10^-10 to 10^-12, which should be fine for almost all use cases. If you generally need higher precision for scientific calculations, either use higher precision mathematical functions that you can control (custom implementation), deterministic calculation orders, or use integer fixed point arithmetic. It also looks like you are using these kernels for neural networks. If this is the case, the accuracy should be good enough.

from ilgpu.

adkruk commented on May 19, 2024

I compared accuracy with Octave and Python. In previous version of ILGPU computations were the same on ILGPU-GPU comparing to Octave and Python. Therefore, ILGPU looked very optimistic for me.

Please correct me if I'm wrong, the reason that accuracy is worse on NVIDIA platform is that you want in the future to have the same accuracy with respect to ULPs on all accelerators?

In my opinion it's always better to have better accuracy than worse, no matter what the future plans are, isn't it?
Especially, taking into account that for some developers accuracy was better and now it's worse.

Regarding Atomic.Add, I can live with it how it's working right now, and accuracy is good enough.
But, I expect that basic mathematical functions like exp, sin, tan, etc will have as good accuracy as possible. Therefore, if NVIDIA delivers own, accurate implementation I prefer to use it.

It could be better to let user/developer decide which one implementation (NVIDIA/intrinsic) he want to use. Is it possible to add some switch?

from ilgpu.

m4rs-mt commented on May 19, 2024

@adkruk: First of all: Happy New Year! There have been many different customizations to the ILGPU framework to support new intrinsic implementations for all platforms. One design decision was to enable full cross-platform portability without the need for an external SDK (like the CUDA SDK). However, this SDK provides a libdevice library of special intrinsic mathematical implementations for NVIDIA GPUs that is compatible with a specific LLVM version. Since native LLVM support has been removed from the main project, it may be difficult to include these low-level implementations in the current compilation pipeline.

Most users of ILGPU care about portability, flexibility and performance. Numerical precision in terms of calculations that are far beyond 10^-12 have not been an issue so far. Anyway; since you are requesting additional precision we should focus on a way of improving the precision of the currently implemented math functions in the scope of the ILGPU.Algorithms library. It is totally possible to implement the functions in a way that match the intended ULP precision. However, I would suggest that we should do this in a portable way. Otherwise, kernels on AMD hardware (for example) will yield different results. To be honest, this is not the highest priority on my roadmap. But if you are interested in implementing this functionality in the `ILGPU.Algorithms' library, I would be happy to accept PRs.

Alternatively, I plan to add support for an AOT-compiler that may use an LLVM backend for high-performance code generation. This pipeline could include support for the Cuda SDK, as well. However, this will require some time to implement ;)

from ilgpu.

m4rs-mt commented on May 19, 2024

@adkruk Thank you for your very detailed analysis. To better understand what is causing these problems in terms of precision, we need to examine the generated instructions. Can you publish your evaluated code snippets? The main difference between both versions is that the new version uses a new backend and (probably) selects different instructions for all operations.

from ilgpu.

m4rs-mt commented on May 19, 2024

@Costigan Thank you very much for your feedback. The integration of the libdevice library could be very problematic (as mentioned above) and would bring many different limitations that we are trying to overcome with this project. First of all, portability across all platforms 🔢. The 'high-level' math functions are currently being implemented in the ILGPU.Algorithms project.

There are some conceptual solutions to the problem regarding trigonometric functions:

Wait for a managed implementation of all math functions to appear in .Net.
Implement these functions in a much nice and proper way following one of the available implementations (e.g. here and here (without violating legal constraints).
Add an LLVM-based backend to support libdevice integration (might come in the future but requires some additional time to develop).
Use the upcoming SPIR-V backend that offer these intrinsic functions by design.
I recently discovered a way to "access" the internal mathematical functions without linking to the libdevice library. It seems to be an ugly workaround, but it seems to work on all NVIDIA GPUs I have tested so far.

I will spend some time evaluating option 5, as it seems most realistic in terms of the programming effort required and the success of integration. I will publish some updates once I have evaluated these potential workarounds. However, the ILGPU project is always looking for volunteers to implement such features as a new math library 🥇

from ilgpu.

Costigan commented on May 19, 2024

@m4rs-mt Thank you for your reply. I've gotten going again by replacing calls to XMath.Atan2() with the implementation here: https://developer.download.nvidia.com/cg/atan2.html . I haven't checked the accuracy of the Sin and Cos functions yet. (Those three are all I'm using right now.)

I think what's going on is that the approximation you're using from https://de.wikipedia.org/wiki/Arkustangens_und_Arkuskotangens is a little too inaccurate. I'm using ILGPU for scientific calculations with floats, and the error was enough to cause problems.

I'm not sure what to suggest as the best solution. You could use an approximation that's a little better like the one above from NVIDIA, but that might cause problems for another user doing different scientific calculations. I would volunteer to help implement a solution, but I'm not a numerical methods person nor a GPU expert and could easily steer you in the wrong direction. Maybe I could help test a solution, though.

I didn't really focus on it above, but the values returned by the Atan2() returned by the GPU implementation differ from the values returned by the CPU implementation. I said that I understood the desire to have the same results when running via Cuda and CPU, but I don't really. I imagine everyone using ILGPU cares much more about the Cuda version and uses the CPU version to debug. I think that non-determinism due to parallel execution could be at least as significant a cause of different results as differences close to float.Epsilon. When debugging my application (admittedly an example of only one) I don't care if the results returned by the Cuda and CPU trig functions are slightly different. If I'm using floats, I expect there to be small errors. But I would like those errors to be as small as possible. If I wanted quicker, lower accuracy versions of those functions, I would expect to look for them in a separate library that's documented as being low accuracy.

from ilgpu.

m4rs-mt commented on May 19, 2024

@MoFtZ Great work 🥇 I am happy to accept PRs for the project ILGPU.Algorithms which implements your functions 🔢 . To answer your question: Yes, that would make sense.

@MoFtZ @Costigan I have found out how to get access to most of NVIDIA's math function implementations without having to link the libdevice library. This workaround seems to work on my test machines. Anyway, in general it would be much nicer to have "better" implementations.

from ilgpu.

MoFtZ commented on May 19, 2024

@m4rs-mt, with regards to making improvements to the XMath functions, what are your thoughts consistency versus using the available implementation?

For example, Tan(double) on Cuda currently provides a minimum of 2 decimal places of accuracy across the range of input values that I tested.

I can implement a software alternative (using CORDIC) to provide a minimum of 12 decimal places of accuracy across the same input range. However, OpenCL provides better, at 14 decimal places of accuracy.

Should I be forcing all accelerators to use the CORDIC implementation, for consistency? Or, should I be "patching the holes" and use the CORDIC implementation only when it is better?

Also, I'm guessing that the accuracy of OpenCL is probably also driver dependent. Should we have a software fallback, with a way for the user to configure it at runtime?

from ilgpu.

Wowo51 commented on May 19, 2024

Accuracy is essential for my use case. It's likely to be for many types of AI cases. It's not the error on a single calculation that's the problem, it's the way that error accumulates when you perform many calculations. Consider a table with many independent variable columns and a dependent variable column. Your objective is to form a model that predicts the dependent column for each of the many rows of independent variables. So your app proposes a model, calculates a forecast and compares it to the dependent column. The total error of the model is calculated by summing the errors for each row and averaging. If there is a little bit of round off problem with each row then those round off problems will compound horridly when you do the summation. This is going to be especially true if the data set is large. This is going to be the case for many types of AI. I'm only getting about 4-5 significant digits of agreement with Microsoft's C# calcs for a smallish data set and I'm using doubles I might have to use NOpenCL as an optional feature at this point as my app is likely to fail altogether for larger data sets. My app is a multi-GPU symbolic regression app so I would really be clipping it's wings if I can't turn it loose on large data sets. I like your library, it's very easy to use and seems to work well in all regards but this one. Is it possible to bump up accuracy on your list of things to do? What do you think the easiest solution for me would be?

from ilgpu.

Wowo51 commented on May 19, 2024

@m4rs-mt, thanks for your time. I'll dig into it further...

from ilgpu.

m4rs-mt commented on May 19, 2024

@Wowo51 After analyzing several other questions, we believe that the problem could also be related to #221. I will close this issue for now, as the originally described problem is solved.

from ilgpu.

Incorrect calculations about ilgpu HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent