nvidia 440xx Xid 61 breakpoint frequently hits

I am not sure exactly when this started happening, but roughly three weeks ago, I have frequently had to reboot my system after suddenly experiencing extreme mouse and keyboard problems (mouse barely moves, keystrokes take much longer than normal and become repeated on the command line). Unfortunately, I am not able to outline a series of steps that reliably reproduce this effect on my system.

The problem happens at its most frequent while watching videos (at which point the video stutters to a crawl while the audio is fine).

After some careful digging with journalctl, I see these three lines as the last error for those runs where I experience this:

kernel: NVRM: GPU at PCI:0000:03:00: GPU-a07257e8-bf8f-a3a9-9d79-b3c180f39e8c
kernel: NVRM: GPU Board Serial Number:
kernel: NVRM: Xid (PCI:0000:03:00): 61, pid=1275, 0cec(3098) 00000000 00000000

nvidia describes Xid 61 as:

Internal micro-controller breakpoint/warning
(newer drivers)

(according to: https://docs.nvidia.com/deploy/xid-errors/index.html)

I have the following relevant packages installed:

lib32-nvidia-440xx-utils          440.82-1                  multilib   101.0 MB
linux-latest-nvidia-440xx         5.6-2                     community  0 bytes
linux54-nvidia-440xx              440.82-13                 extra      14.8 MB
linux55-nvidia-440xx              440.82-6                  extra      14.8 MB
linux56-nvidia-440xx              440.82-14                 extra      14.8 MB
mhwd-nvidia-340xx                 340.108-1                 core       2.5 kB
mhwd-nvidia-390xx                 390.132-1                 core       1.9 kB
mhwd-nvidia-418xx                 418.113-1                 core       1.6 kB
mhwd-nvidia-430xx                 430.64-1.0                core       1.3 kB
mhwd-nvidia-435xx                 435.21-1.0                core       1.3 kB
mhwd-nvidia-440xx                 440.82-1                  core       1.4 kB
nvidia-440xx-utils                440.82-1                  extra      306.5 MB
opencl-nvidia-440xx               440.82-1                  extra      29.2 MB

I am currently using kernel 5.4, although I have experienced this in 5.6 as well (I stepped back to 5.4 in an attempt to step away from the problem, but it didn't help).

nvidia-smi provides the following:

Mon May 25 07:44:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:03:00.0  On |                  N/A |
|  0%   51C    P8    12W / 215W |    462MiB /  7979MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               

(Omitting the processes, as those aren't really that important to this problem, I suspect).

Is there a way to, I dunno, step back to some revision of the driver that won't do this, while still retaining all the speed, etc. that it had three weeks ago? Or at least get word to the appropriate developer that this is happening on some systems?

In the end, I addressed this by uninstalling the 440xx packages and installing the 410xx packages.

Since then, I haven't experienced the problem.

I should try to get a hold of someone at nvidia, I suspect, to help work through whatever may be wrong with the 440x driver.

Grr.. no, this didn't actually work. It just took a little longer for the problem to manifest... but it did come back.

Ahh, okay, I'm dealing with a known (and very elusive) problem according to the nvidia forums.

If anyone else comes here, and you have a recent GeForce card, and a Ryzen 3000 series CPU, you might have this problem, too (Windows or Linux). I am going to try what I saw in the forums there, and prevent the card from changing power settings.

Heh, if I can figure out how.

Hi @trey.van.riper!

I'm experiencing the same issues with a Ryzen 3700X and a new EVGA 2060 KO.

Would you mind posting a link to the NVidia forum where you found this references? I was able to dig up this thread https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/304019/ryzen-conflicting-with-geforce-drivers/ however that seems a wee bit out of date.

Also, any luck getting things reconfigured?

Best,
John

In the developer's forum, there's a lot of discussion:

https://forums.developer.nvidia.com/search?q=xid%2061

Specifically, this thread helped me the most:

I created a 'lock-gpu.service' file and put it in /usr/lib/systemd/system/ to work around the problem:

[Unit]
Description=Locks minimum gpu clock to 1620.

[Install]
WantedBy=multi-user.target

[Service]
Type=oneshot
RemainAfterExit=true
ExecStartPre=/usr/bin/nvidia-smi -pm 1
ExecStart=/usr/bin/nvidia-smi -lgc 1620,14000

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Forum kindly sponsored by