Vega overheating on Desktop

With all updates installed and only on desktop (xfce, stable, 4.15 kernel), my PC draws 30W more than under windows. And it all seem to come from the Vega card. It becomes very hot and finally crashes the computer. It is used as a secondary card. My main card with display is a 1080Ti. The Vega also has a display connected (the same display, because it is not recognised in programs if not connected).

That's a pretty big issue, because the fan doesn't go faster it stays at would the card be in power save mode although the overheating is clear (can't put my hands on it).

It would be great if a quick fix could prevent it from burning. It seems the kernel just freezes and let the CG overheat further.

The quickest fix: disconnect it.

What follows is supposition as I don't have the card.

You could check which driver the card is using, e.g. inxi -G. If it's the wrong driver it's possible the sensor data is wrong or it doesn't know which fans are available where. inxi -Fxxxz will provide useful information for others looking at the thread.

Check your TLP settings in /etc/default/tlp to see what power profile is set on the card; the current Manjaro defaults are:

# Radeon graphics clock speed (profile method): low, mid, high, auto, default;
# auto = mid on BAT, high on AC; default = use hardware defaults.

# Radeon dynamic power management method (DPM): battery, performance.

# Radeon DPM performance level: auto, low, high; auto is recommended.

I had high power draw and temperature with my Studio 1749 (HD5650) with the upstream default of RADEON_POWER_PROFILE_ON_AC=high; however, that uses the radeon driver, not amdgpu, though both share a common power management interface.

Result of inxi -G:

Resuming in non X mode: xdpyinfo not found. For package install advice run: inxi --recommends
Graphics:  Card-1: Intel HD Graphics 530
           Card-2: NVIDIA GP102 [GeForce GTX 1080 Ti]
           Card-3: Advanced Micro Devices [AMD/ATI] Vega 10 XT [Radeon RX Vega 64]
           Display Server: N/A driver: nvidia tty size: 80x24

and the other one inxi -Fxxxz :

Resuming in non X mode: xdpyinfo not found. For package install advice run: inxi --recommends
System:    Host: test-pc Kernel: 4.15.0-1-MANJARO x86_64 bits: 64 gcc: 7.2.1
           Desktop: N/A info: xfce4-panel dm: lightdm Distro: Manjaro Linux
Machine:   Device: desktop Mobo: MSI model: Z170A GAMING PRO CARBON (MS-7A12) v: 1.0 serial: N/A
           UEFI [Legacy]: American Megatrends v: 1.30 date: 05/12/2016
CPU:       Quad core Intel Core i7-6700K (-MT-MCP-) arch: Skylake-S rev.3 cache: 8192 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 32076
           clock speeds: min/max: 800/4200 MHz 1: 1292 MHz 2: 3729 MHz 3: 3739 MHz 4: 3397 MHz 5: 3742 MHz
           6: 3740 MHz 7: 3862 MHz 8: 3697 MHz
Graphics:  Card-1: Intel HD Graphics 530 bus-ID: 00:02.0 chip-ID: 8086:1912
           Card-2: NVIDIA GP102 [GeForce GTX 1080 Ti] bus-ID: 01:00.0 chip-ID: 10de:1b06
           Card-3: Advanced Micro Devices [AMD/ATI] Vega 10 XT [Radeon RX Vega 64]
           bus-ID: 04:00.0 chip-ID: 1002:687f
           Display Server: N/A driver: nvidia tty size: 109x24
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] Device aaf8
           driver: snd_hda_intel bus-ID: 04:00.1 chip-ID: 1002:aaf8
           Card-2 NVIDIA GP102 HDMI Audio Controller
           driver: snd_hda_intel bus-ID: 01:00.1 chip-ID: 10de:10ef
           Card-3 Intel Sunrise Point-H HD Audio driver: snd_hda_intel bus-ID: 00:1f.3 chip-ID: 8086:a170
           Sound: Advanced Linux Sound Architecture v: k4.15.0-1-MANJARO
Sensors:   System Temperatures: cpu: 41.5C mobo: 29.8C gpu: 0.0:53C
           Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 214 Uptime: 1:09 Memory: 996.0/64316.0MB Init: systemd v: 236 Gcc sys: 7.2.1 alt: 6
           Client: Shell (bash 4.4.121 running in xfce4-terminal) inxi: 2.3.56 

The power profiles are all like in your reply.

Hmm... it only shows the nvidia driver which isn't useful when we're talking about the AMD card...

lsmod will give a full list of loaded driver modules.

The other suggestion is to switch to the testing or unstable branch to see if a newer kernel introduces any fixes - IIRC Vega support was new with kernel 4.15, so 4.15.0 will be pretty "raw" still.

I only have AMDGPU and lsmod also reports amdgpu, nothing fancy installed, only the defaults regarding drivers.

Maybe there is a way to report the bug to upstream/AMD?

I am sure you already know so just as a reminder, the behavior you describe can quickly end the life of your gpu or even motherboard.
I know from personal experience, lost a pci gpu a while/way back in the day. :grinning:

Disconnecting the card is not enough, I have to completely remove it from the PC... Anyway, better report it upstream as fast as possible, before someone else burns his card?

Kernel 4.14 draws as much power or more, but at least let the fan at full speed all the time. So you loose your ears, but not your GPU at least :wink:

Forum kindly sponsored by