Giter VIP home page Giter VIP logo

check_cuda_numerical_stability's Introduction

check_cuda_numerical_stability

A script uses the principle of IREVNET to detect whether your CUDA computing card is normal.

How to check

This tool works by the reversibility principle of IREVNET.
In the current implementation, the result is calculated by the IREV module first, and then the result is inverted by the IREV module to restore the input.
By detecting the difference between the maximum value of the input and the reconstructed input, it can be judged whether there is a numerical error in the CUDA card.

In normal graphics cards, the numerical error will continue to be less than 1e-5. On abnormal graphics cards, the numerical error will occasionally exceed 1e-3.
In my local test, when using a normal graphics card, it can pass the test for 2 hours without any errors, and I haven't tried for a longer time. When using abnormal graphics cards, the test often reports errors within 5-25 minutes.

Dependent

python3
pytorch >= 1.1
argparse

Command

python _check_cuda_numerical_stability.py -h
usage: _check_cuda_numerical_stability.py [-h] [-i I] [-t T] [-bs BS]

Used to detect CUDA numerical stability problems.

optional arguments:
  -h, --help  show this help message and exit
  -i I        card id. Which cuda card do you want to test. default: 0
  -t T        minute. Test duration. When the setting is less than or equal to 0, it will not stop automatically.
              defaule: 30
  -bs BS      Test batch size when testing. defaule: 20

How to use

python _check_cuda_numerical_stability.py

This command will start a test immediately. By default, card 0 will be detected for 30 minutes.
If there is no error within 30 minutes, "Test passed" will be output, which means your card may be no problem.
Otherwise, it will be interrupted prematurely and the "Test failure" will be output, which means that your CUDA card may have some problems or the slot is not securely inserted.

python _check_cuda_numerical_stability.py -i 1 -t 60

This command specifies that the card 1 will be tested, and the duration is 60 minutes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.