Giter VIP home page Giter VIP logo

Comments (4)

aritger avatar aritger commented on July 17, 2024

Thank you for the report. I've filed NVIDIA internal bug 4706166 to track this.

If you're willing to rebuild the open kernel modules, could you please apply this patch, and then upload the system log after the problem reproduces again? Thanks!

$ cat 0001-instrumentation-for-suspend-crash.patch 
From 44afc9067af6df0671724e37b8f2c2cde7386590 Mon Sep 17 00:00:00 2001
From: Andy Ritger <aritger@nvidia.com>
Date: Mon, 17 Jun 2024 15:03:14 -0700
Subject: [PATCH] instrumentation for suspend crash
X-NVConfidentiality: public

---
 kernel-open/nvidia/nv.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel-open/nvidia/nv.c b/kernel-open/nvidia/nv.c
index 99792de96307..bc003399cd83 100644
--- a/kernel-open/nvidia/nv.c
+++ b/kernel-open/nvidia/nv.c
@@ -3111,7 +3111,8 @@ nv_map_guest_pages(nv_alloc_t *at,
     if (pages == NULL)
     {
         nv_printf(NV_DBG_ERRORS,
-                  "NVRM: failed to allocate vmap() page descriptor table!\n");
+                  "NVRM: failed to allocate vmap() page descriptor table! (page_count: %d)\n", page_count);
+        dump_stack();
         return 0;
     }
 
@@ -3604,7 +3605,8 @@ void* NV_API_CALL nv_alloc_kernel_mapping(
             if (pages == NULL)
             {
                 nv_printf(NV_DBG_ERRORS,
-                          "NVRM: failed to allocate vmap() page descriptor table!\n");
+                          "NVRM: failed to allocate vmap() page descriptor table! (page_count:%d)\n", page_count);
+                dump_stack();
                 return NULL;
             }
 
-- 
2.44.0

from open-gpu-kernel-modules.

urbenlegend avatar urbenlegend commented on July 17, 2024

Thanks for the patch. I am currently on the proprietary 555.58.02 module because I need to avoid the slowdowns in KDE caused by the GSP firmware, so I have not run into this sleep issue again. Once the GSP bug is resolved, I will switch to the open module again and apply the patch to see what's going on.

from open-gpu-kernel-modules.

abfipes12 avatar abfipes12 commented on July 17, 2024

I am on proprietary nvidia 555.58.02-1 driver and I have the same problems that are listed there, I use Arch Linux, NVIDIA GeForce RTX™ 3050 Laptop GPU

I have tried
linux 6.9.7 (or) linux-lts 6.6.37
NVreg_EnableS0ixPowerManagement (or) NVreg_PreserveVideoMemoryAllocations on /var/tmp (over 250GB space left)
nvidia_drm.modeset 0 (or) 1 as boot parameter
nvidia_drm.fbdev 0 (or) 1 as boot parameter
X11 (or) Xwayland

nothing helped, (except module_blacklist=nvidia)

Jul 06 03:35:17 archlinux kernel: NVRM: failed to allocate vmap() page descriptor table!
Jul 06 03:35:17 archlinux kernel: NVRM: GPU at PCI:0000:01:00: GPU-887a46df-29b2-be1c-8c55-e637117338ba
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080205b 0x4).
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) and data 0x000000002080205b 0x0000000000000004.
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Jul 06 03:35:17 archlinux kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Jul 06 03:35:17 archlinux kernel: NVRM:      0    76   GSP_RM_CONTROL        0x000000002080205b 0x0000000000000004 0x00061c8a2d4a3f2a 0x0000000000000000          y
Jul 06 03:35:17 archlinux kernel: NVRM:     -1    47   UNLOADING_GUEST_DRIVE 0x0000000000000000 0x0000000000000000 0x00061c8a2d32031f 0x00061c8a2d34fd27 195080us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -2    10   FREE                  0x00000000c1e016c0 0x0000000000000000 0x00061c8a2d320088 0x00061c8a2d3202fc    628us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -3    10   FREE                  0x000000000000000a 0x0000000000000000 0x00061c8a2d31fa32 0x00061c8a2d320087   1621us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -4    10   FREE                  0x000000000000000b 0x0000000000000000 0x00061c8a2d31f763 0x00061c8a2d31f943    480us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -5    10   FREE                  0x0000000000000006 0x0000000000000000 0x00061c8a2d31f52a 0x00061c8a2d31f75e    564us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -6    10   FREE                  0x0000000000000002 0x0000000000000000 0x00061c8a2d31e4fe 0x00061c8a2d31f4fd   4095us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -7    10   FREE                  0x0000000000000005 0x0000000000000000 0x00061c8a2d31da4b 0x00061c8a2d31e4fb   2736us  
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Jul 06 03:35:17 archlinux kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Jul 06 03:35:17 archlinux kernel: NVRM:      0    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2d32b15b 0x00061c8a2d32b15c      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -1    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000028 0x00061c8a2d324d79 0x00061c8a2d324d7b      2us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -2    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x00061c8a2d2fb48c 0x00061c8a2d2fb48c           
Jul 06 03:35:17 archlinux kernel: NVRM:     -3    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x00061c8a2d24eef8 0x00061c8a2d24eef9      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -4    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2ce35a01 0x00061c8a2ce35a01           
Jul 06 03:35:17 archlinux kernel: NVRM:     -5    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2ce357ff 0x00061c8a2ce35800      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -6    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000027 0x00061c8a2ce33db1 0x00061c8a2ce33db3      2us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -7    4098 GSP_RUN_CPU_SEQUENCER 0x000000000000060a 0x0000000000003fe2 0x00061c8a2ce27b5b 0x00061c8a2ce28c8b   4400us  
Jul 06 03:35:17 archlinux kernel: CPU: 4 PID: 8874 Comm: kworker/u48:11 Tainted: P           OE      6.9.7-arch1-1 #1 44783200744f92500e6484c6d93590bc19db4a83
Jul 06 03:35:17 archlinux kernel: Hardware name: Micro-Star International Co., Ltd. Thin GF63 12UC/MS-16R8, BIOS E16R8IMS.111 03/21/2024
Jul 06 03:35:17 archlinux kernel: Workqueue: async async_run_entry_fn
Jul 06 03:35:17 archlinux kernel: Call Trace:
Jul 06 03:35:17 archlinux kernel:  <TASK>
Jul 06 03:35:17 archlinux kernel:  dump_stack_lvl+0x5d/0x80
Jul 06 03:35:17 archlinux kernel:  _nv012672rm+0x437/0x4b0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv012592rm+0x74/0x330 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv046348rm+0x49f/0x7f0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv049583rm+0xa1/0x150 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv045638rm+0x19e/0x1b0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv047612rm+0x3fc/0x500 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv014430rm+0x42e/0x690 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv045777rm+0x26/0x30 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000751rm+0x55/0x70 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000750rm+0x21b/0x220 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000701rm+0x2ad/0x300 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  rm_power_management+0x22c/0x260 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  ? wait_for_completion+0x91/0x170
Jul 06 03:35:17 archlinux kernel:  nv_power_management+0x92/0x170 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  nvidia_suspend+0x6c/0x100 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  nv_pmops_suspend+0x15/0x30 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  pci_pm_suspend+0x7c/0x170
Jul 06 03:35:17 archlinux kernel:  ? __pfx_pci_pm_suspend+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  dpm_run_callback+0x47/0x150
Jul 06 03:35:17 archlinux kernel:  device_suspend+0x141/0x510
Jul 06 03:35:17 archlinux kernel:  ? try_to_wake_up+0x76/0x660
Jul 06 03:35:17 archlinux kernel:  async_suspend+0x1d/0x30
Jul 06 03:35:17 archlinux kernel:  async_run_entry_fn+0x31/0x140
Jul 06 03:35:17 archlinux kernel:  process_one_work+0x18b/0x350
Jul 06 03:35:17 archlinux kernel:  worker_thread+0x2eb/0x410
Jul 06 03:35:17 archlinux kernel:  ? __pfx_worker_thread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  kthread+0xcf/0x100
Jul 06 03:35:17 archlinux kernel:  ? __pfx_kthread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  ret_from_fork+0x31/0x50
Jul 06 03:35:17 archlinux kernel:  ? __pfx_kthread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  ret_from_fork_asm+0x1a/0x30
Jul 06 03:35:17 archlinux kernel:  </TASK>
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a81 0x4).
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a76 0x2).
Jul 06 03:35:17 archlinux kernel: NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:00 (printing 1 of every 30).  The GPU likely needs to be reset.
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to determine display capabilities
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to tear down Disp
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to determine display capabilities
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to tear down Disp
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: pci_pm_suspend(): nv_pmops_suspend+0x0/0x30 [nvidia] returns -5
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -5
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: failed to suspend async: error -5
Jul 06 03:35:17 archlinux kernel: PM: Some devices failed to suspend, or early wake event detected
Jul 06 03:35:17 archlinux kernel: iwlwifi 0000:00:14.3: WRT: Invalid buffer destination
Jul 06 03:35:17 archlinux kernel: done.

from open-gpu-kernel-modules.

belegdol avatar belegdol commented on July 17, 2024

Also seeing this on RTX 2070 with the proprietary 555.58.02 driver.

from open-gpu-kernel-modules.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.