Giter VIP home page Giter VIP logo

Comments (7)

ttabi avatar ttabi commented on July 17, 2024

Since you said that the problem does not occur with the proprietary driver, please post an nvidia-bug-report.log.gz with the proprietary driver installed, so that we can compare the two.

from open-gpu-kernel-modules.

Reverier-Xu avatar Reverier-Xu commented on July 17, 2024

Using the proprietary driver:

Sun Apr 21 01:33:56 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.76                 Driver Version: 550.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   55C    P0             13W /   80W |       2MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvidia-bug-report.log.gz

from open-gpu-kernel-modules.

RashadGasimli avatar RashadGasimli commented on July 17, 2024

same, i can't get any information about power usage with latest drivers (both of proprieatry and open gpu kernel modules)

from open-gpu-kernel-modules.

mtijanic avatar mtijanic commented on July 17, 2024

HI there! The nvidia-bug-report.log from the original post shows:

4月 19 22:36:49 Reverier-Arch kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.67  Release Build  (archlinux-builder@)  
[...snip...]
4月 20 19:55:29 Reverier-Arch kernel: NVRM: API mismatch: the client has the version 550.76, but
                                       NVRM: this kernel module has the version 550.67.  Please
                                       NVRM: make sure that this kernel module and all NVIDIA driver
                                       NVRM: components have the same version.

but your nvidia-smi output shows Driver Version: 550.76. Are you sure these are from the same run? The mismatched kernelmode/usermode version could easily cause this error. It can happen if you built the wrong version of the open driver from source. dkms shows you built 550.76, but you're still loading 550.67 somehow.

from open-gpu-kernel-modules.

Reverier-Xu avatar Reverier-Xu commented on July 17, 2024

HI there! The nvidia-bug-report.log from the original post shows:

4月 19 22:36:49 Reverier-Arch kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.67  Release Build  (archlinux-builder@)  
[...snip...]
4月 20 19:55:29 Reverier-Arch kernel: NVRM: API mismatch: the client has the version 550.76, but
                                       NVRM: this kernel module has the version 550.67.  Please
                                       NVRM: make sure that this kernel module and all NVIDIA driver
                                       NVRM: components have the same version.

but your nvidia-smi output shows Driver Version: 550.76. Are you sure these are from the same run? The mismatched kernelmode/usermode version could easily cause this error. It can happen if you built the wrong version of the open driver from source. dkms shows you built 550.76, but you're still loading 550.67 somehow.

Previously the driver versions were inconsistent probably due to distro packaging issues, sorry for that. so far I can confirm that the driver versions are consistent, but the issue remains. I re-generated a log file using nvidia-bug-report.sh:

nvidia-bug-report.log.gz

I could confirm the dkms driver has the right version 550.76 in package manager.

$ pacman -Qi nvidia-open-dkms 
Name            : nvidia-open-dkms
Version         : 550.76-3
Description     : NVIDIA open kernel modules
Architecture    : x86_64
URL             : https://github.com/NVIDIA/open-gpu-kernel-modules
Licenses        : GPL
Groups          : None
Provides        : nvidia-open  NVIDIA-MODULE
Depends On      : nvidia-utils=550.76  libglvnd  dkms
Optional Deps   : None
Required By     : None
Optional For    : None
Conflicts With  : nvidia-open  NVIDIA-MODULE
Replaces        : None
Installed Size  : 77.26 MiB
Packager        : Jan Alexander Steffens (heftig) <[email protected]>
Build Date      : Mon Apr 29 00:27:24 2024
Install Date    : Tue Apr 30 22:05:55 2024
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

from open-gpu-kernel-modules.

mtijanic avatar mtijanic commented on July 17, 2024

Thanks for the update! Looking at the new log, this really stands out:

[    6.999219] NVRM: GPU at PCI:0000:01:00: GPU-e8108ab1-bcb6-22ff-7cab-c21072716616
[    6.999223] NVRM: Xid (PCI:0000:01:00): 62, pid='<unknown>', name=<unknown>, 20262044 2027f08e 2027df5c 2022ec3e 20281296 2022adbe 00000000 00000000

Xid 62 is PMU_HALT_ERROR which would make a lot of the power readings unavailable, but could also lead to bigger system instability. GSP logs confirm as much. I've filed bug 4630466 so that our PMU experts can look into it.

In the meantime, could you please load nvidia.ko with NVreg_RmMsg=":" and try one more time? That should flood your dmesg with a lot of debug info, and hopefully some of it can help us narrow it down.

from open-gpu-kernel-modules.

Reverier-Xu avatar Reverier-Xu commented on July 17, 2024

Archlinux has pushed nvidia-open 550.78 into the repository, seems that this issue is solved, thanks!

from open-gpu-kernel-modules.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.