If going to sleep with high memory usage, OMM killer seems to crash graphics

My system crashes sometimes when coming out of sleep.

I could stillc onnect via SSH but monitors don't even work.

I investigated and the problem seems to be that the free memory goes below the allowed (by kernel) min value

Node 0 Normal free:346376kB min:346620kB low:408864kB high:471108kB reserved_highatomic:0KB active_anon:204kB inactive_anon:38003516kB active_file:156kB inactive_file:1060kB unevictable:632kB writepending:252kB >
Mai 20 08:58:12 I-KNOW-YOU.torgato.de kernel:

free is below min, which will freeze user space until fixed and starts the OOM Killer

Full trace of error

Mai 20 08:58:12 XXX: kworker/u32:34: page allocation failure: order:0, mode:0x104c02(GFP_NOIO|__GFP_HIGHMEM|__GFP_RETRY_MAYFAIL|__GFP_HARDWALL), nodemask=(null),cpuset=/,mems_allowed=0
Mai 20 08:58:12 XXX: CPU: 0 PID: 19898 Comm: kworker/u32:34 Tainted: P           OE     5.6.11-1-MANJARO #1
Mai 20 08:58:12 XXX: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Professional Gaming, BIOS P3.30 01/15/2018
Mai 20 08:58:12 XXX: Workqueue: events_unbound async_run_entry_fn
Mai 20 08:58:12 XXX: Call Trace:
Mai 20 08:58:12 XXX:  dump_stack+0x66/0x90
Mai 20 08:58:12 XXX:  warn_alloc.cold+0x78/0xdc
Mai 20 08:58:12 XXX:  __alloc_pages_slowpath+0xd91/0xdd0
Mai 20 08:58:12 XXX:  ? do_flush_tlb_all+0x16/0x30
Mai 20 08:58:12 XXX:  __alloc_pages_nodemask+0x2cd/0x320
Mai 20 08:58:12 XXX:  ttm_alloc_new_pages.isra.0+0x90/0x1e0 [ttm]
Mai 20 08:58:12 XXX:  ttm_page_pool_get_pages+0x16d/0x3b0 [ttm]
Mai 20 08:58:12 XXX:  ttm_pool_populate+0x1b2/0x450 [ttm]
Mai 20 08:58:12 XXX:  ttm_populate_and_map_pages+0x24/0x230 [ttm]
Mai 20 08:58:12 XXX:  ttm_tt_populate.part.0+0x1e/0x60 [ttm]
Mai 20 08:58:12 XXX:  ttm_tt_bind+0x48/0x60 [ttm]
Mai 20 08:58:12 XXX:  ttm_bo_handle_move_mem+0x29c/0x5a0 [ttm]
Mai 20 08:58:12 XXX:  ttm_bo_evict+0x189/0x200 [ttm]
Mai 20 08:58:12 XXX:  ? common_interrupt+0xa/0xf
Mai 20 08:58:12 XXX:  ttm_mem_evict_first+0x29c/0x3c0 [ttm]
Mai 20 08:58:12 XXX:  ttm_bo_force_list_clean+0x9e/0x1d0 [ttm]
Mai 20 08:58:12 XXX:  amdgpu_device_suspend+0x205/0x2e0 [amdgpu]
Mai 20 08:58:12 XXX:  pci_pm_suspend+0x74/0x160
Mai 20 08:58:12 XXX:  ? pci_pm_freeze+0xb0/0xb0
Mai 20 08:58:12 XXX:  dpm_run_callback+0x4f/0x180
Mai 20 08:58:12 XXX:  __device_suspend+0x121/0x4e0
Mai 20 08:58:12 XXX:  async_suspend+0x1b/0x90
Mai 20 08:58:12 XXX:  async_run_entry_fn+0x37/0x140
Mai 20 08:58:12 XXX:  process_one_work+0x1da/0x3d0
Mai 20 08:58:12 XXX:  worker_thread+0x4a/0x3d0
Mai 20 08:58:12 XXX:  kthread+0xfb/0x130
Mai 20 08:58:12 XXX:  ? process_one_work+0x3d0/0x3d0
Mai 20 08:58:12 XXX:  ? kthread_park+0x90/0x90
Mai 20 08:58:12 XXX:  ret_from_fork+0x22/0x40
Mai 20 08:58:12 XXX: Mem-Info:
Mai 20 08:58:12 XXX: active_anon:51 inactive_anon:9996810 isolated_anon:32
                                               active_file:39 inactive_file:267 isolated_file:0
                                               unevictable:164 dirty:61 writeback:2 unstable:0
                                               slab_reclaimable:61183 slab_unreclaimable:323204
                                               mapped:987253 shmem:491388 pagetables:54193 bounce:0
                                               free:152192 free_pcp:172 free_cma:0
Mai 20 08:58:12 XXX kernel: Node 0 active_anon:204kB inactive_anon:39987240kB active_file:156kB inactive_file:1068kB unevictable:656kB isolated(anon):128kB isolated(file):0kB mapped:3949012kB dirty:244kB writeback:8kB shmem:1965552kB shmem>
Mai 20 08:58:12 XXX kernel: Node 0 DMA free:15876kB min:16kB low:28kB high:40kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15892kB ml>
Mai 20 08:58:12 XXX kernel: lowmem_reserve[]: 0 3432 64218 64218 64218
Mai 20 08:58:12 XXX kernel: Node 0 DMA32 free:246516kB min:3608kB low:7120kB high:10632kB reserved_highatomic:0KB active_anon:0kB inactive_anon:1983724kB active_file:0kB inactive_file:8kB unevictable:24kB writepending:0kB present:3597696kB>
Mai 20 08:58:12 XXX kernel: lowmem_reserve[]: 0 0 60785 60785 60785
Mai 20 08:58:12 XXX kernel: Node 0 Normal free:346376kB min:346620kB low:408864kB high:471108kB reserved_highatomic:0KB active_anon:204kB inactive_anon:38003516kB active_file:156kB inactive_file:1060kB unevictable:632kB writepending:252kB >
Mai 20 08:58:12 XXX kernel: lowmem_reserve[]: 0 0 0 0 0
Mai 20 08:58:12 XXX kernel: Node 0 DMA: 1*4kB (U) 2*8kB (U) 1*16kB (U) 1*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15876kB
Mai 20 08:58:12 XXX kernel: Node 0 DMA32: 6579*4kB (ME) 1371*8kB (UME) 793*16kB (UME) 284*32kB (ME) 205*64kB (UME) 158*128kB (ME) 66*256kB (ME) 46*512kB (ME) 97*1024kB (UM) 7*2048kB (M) 0*4096kB = 246516kB
Mai 20 08:58:12 XXX kernel: Node 0 Normal: 86524*4kB (M) 11*8kB (M) 8*16kB (UM) 2*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 346376kB
Mai 20 08:58:12 XXX kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mai 20 08:58:12 XXX kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mai 20 08:58:12 XXX kernel: 491695 total pagecache pages
Mai 20 08:58:12 XXX kernel: 0 pages in swap cache
Mai 20 08:58:12 XXX kernel: Swap cache stats: add 34978, delete 34978, find 2936/3814
Mai 20 08:58:12 XXX kernel: Free swap  = 404948kB
Mai 20 08:58:12 XXX kernel: Total swap = 524284kB
Mai 20 08:58:12 XXX kernel: 16759935 pages RAM
Mai 20 08:58:12 XXX kernel: 0 pages HighMem/MovableOnly
Mai 20 08:58:12 XXX kernel: 293548 pages reserved
Mai 20 08:58:12 XXX kernel: 0 pages hwpoisoned
Mai 20 08:58:12 XXX kernel: [TTM] Buffer eviction failed
Mai 20 08:58:12 XXX kernel: Move buffer fallback to memcpy unavailable
Mai 20 08:58:12 XXX kernel: [TTM] Buffer eviction failed

This is followed by the graphics card crashing I assume

Mai 20 08:58:12 XXX kernel: ACPI: Waking up from system sleep state S3
Mai 20 08:58:12 XXX kernel: iommu ivhd0: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=00:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00]
Mai 20 08:58:12 XXX kernel: usb usb1: root hub lost power or was reset
Mai 20 08:58:12 XXX kernel: usb usb2: root hub lost power or was reset
Mai 20 08:58:12 XXX kernel: sd 7:0:0:0: [sde] Starting disk
Mai 20 08:58:12 XXX kernel: sd 4:0:0:0: [sdc] Starting disk
Mai 20 08:58:12 XXX kernel: sd 0:0:0:0: [sda] Starting disk
Mai 20 08:58:12 XXX kernel: sd 2:0:0:0: [sdb] Starting disk
Mai 20 08:58:12 XXX kernel: sd 5:0:0:0: [sdd] Starting disk
Mai 20 08:58:12 XXX kernel: serial 00:04: activated
Mai 20 08:58:12 XXX kernel: ata11: SATA link down (SStatus 0 SControl 300)
Mai 20 08:58:12 XXX kernel: ata9: SATA link down (SStatus 0 SControl 300)
Mai 20 08:58:12 XXX kernel: ata7: SATA link down (SStatus 0 SControl 300)
Mai 20 08:58:12 XXX kernel: ata10: SATA link down (SStatus 0 SControl 300)
Mai 20 08:58:12 XXX kernel: ata2: SATA link down (SStatus 0 SControl 300)
Mai 20 08:58:12 XXX kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mai 20 08:58:12 XXX kernel: usb 1-11: reset full-speed USB device number 4 using xhci_hcd
Mai 20 08:58:12 XXX kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Mai 20 08:58:12 XXX kernel: ata4.00: configured for UDMA/100
Mai 20 08:58:12 XXX kernel: nvme nvme0: 7/0/0 default/read/poll queues
Mai 20 08:58:12 XXX kernel: amdgpu 0000:0e:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Mai 20 08:58:12 XXX kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
Mai 20 08:58:12 XXX kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).
Mai 20 08:58:12 XXX kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -110
Mai 20 08:58:12 XXX kernel: PM: Device 0000:0e:00.0 failed to resume async: error -110
Mai 20 08:58:12 XXX kernel: usb 1-9: reset full-speed USB device number 3 using xhci_hcd
Mai 20 08:58:12 XXX kernel: OOM killer enabled.
Mai 20 08:58:12 XXX kernel: Restarting tasks ... done.
Mai 20 08:58:12 XXX kernel: thermal thermal_zone0: failed to read out thermal zone (-61)
Mai 20 08:58:12 XXX kernel: PM: suspend exit

I have 64BG memory. But also ZFS with these settings

type or paste coptions zfs zfs_arc_min=4000000000
options zfs zfs_arc_max=40000000000
options zfs zfs_arc_sys_free=8000000000

Meaning it should use a minimum of 4 and a maximum of 40GB for it's ARC file cache but leave 8GB of free memory.

I will start to play around with these parameters. But wonder, if this is a bug of some sort or my system is otherwise misconfigured.

Is there any way I can recover from this without restarting? Like re-initiating the graphics card?

Sleep, suspend and hibernate is a pest - you will never know when the memory written to disk becomes corrupted.

I usually avoid it - simply because you cannot trust it.

Thx for your opinion. I would love to discuss my disagreement with you on that, but I do not want to distract from my issue at hand.

I don't wanna discuss personal preferences - that was not my intention - I was merely stating the typical cause - memory corruption.

As an example - until recently it was difficult to get btrfs to work sleep or hibernation - and since you mention zfs - it is likely that something with relation to your filesystem is causing the memory corruption.

Do you have the necessary swap space?

On my desktop system - if I was using such functionality I would require 64G + the 5G for the graphics in allocated swap space.

As you mention your 64G of ram you would need an allocation at least 64G for swap + your graphic card memory.

Ok, I guess I misunderstood.

I am not hibernating. (have done that before, and worked great beside the loooooong time and wear on my NVMes. But since I made my encrypted NVMes work with sleep, I now do sleeping)

I'm merely sleeping (S3). So memory should stay in ram. Disk & swap should not matter much IMHO. And when my memory is not at the limit, everything works fine. Only if there is little unused memory something seems to violate the minimum free memory the kernel wants to have and the graphics crashes.

I doubt, but certainly can't prove, that memory corruption is involved here.

EDIT: and yes, ZFS is the most likely culprit as it somewhat circumvents kernel memory already.

Looking more closely now. It seems the Graphics card is the one trying to allocate memory beyond the allowed amount by the kerne. So it's natural that the grapics crashes when this is denied.

Question is, why is this happening?

Just to be clear. As I understand the situation:

There is still 346376kb free memory, when the graphics card tries to allocate. But the minimum free memory, the kernel demands is 346620kB, so the kernel refuses the allocation.

So if I get it right, the Graphics card tries to allocate from user space.

Seems like some kind of racing condition/order of execution when returning from sleep.

You are way above my league :slight_smile: when it comes to this kind of troubleshooting :+1:

This topic was automatically closed after 180 days. New replies are no longer allowed.

Forum kindly sponsored by