Network stops working after a while, sometimes bringing SATA (and other stuff) down with it

I'm posting here because I don't know what else to try and I kind of lost track of all the things I tried.

So, the main observable problem I have is that if I leave my computer (a desktop with all energy savings options turned off) running downloads for a while, when I come back sometimes the network is down. Restarting the network manager does not bring it back and I can only get it working again by rebooting.

Relevant lspci for the network card:

07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
        Subsystem: ASUSTeK Computer Inc. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
        Flags: bus master, fast devsel, latency 0, IRQ 37
        I/O ports at f000 [size=256]
        Memory at fcc04000 (64-bit, non-prefetchable) [size=4K]
        Memory at fcc00000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Endpoint, MSI 01
        Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Virtual Channel
        Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
        Capabilities: [170] Latency Tolerance Reporting
        Capabilities: [178] L1 PM Substates
        Kernel driver in use: r8169
        Kernel modules: r8169

I tried using the "hardware configuration" software to install "network-r8168" but the problem happens with that as well.

I'm currently running kernel 5.5.2-1 but I have tried 5.4.18-1 and 5.3.18-1 as well.

Now, sometimes, after the network drops (I don't know how long later) it seems like more of the system goes offline. I don't know if this is related or not to the network issue.

When the network goes down, nothing different shows up on dmesg - no error messages, no attempts to reconnect, nothing. It just stops pinging locally. Other devices on the same network work fine.

When more parts of the system go down usually my btrfs filesystem goes read-only and, sometimes, the entire system starts freezing for 2-3 seconds then running for 4-8 seconds, like a cycle of hiccups.

When more of the system goes down I get way more stuff on dmesg. I don't know what of it is relevant, so I'm posting the first stuff that shows up that is not just the usual audit messages:

[ 3611.548615] audit: type=1130 audit(1582413211.591:127): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 3621.799999] audit: type=1131 audit(1582413221.845:128): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 3930.215468] pcieport 0000:02:06.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3930.215504] pcieport 0000:02:05.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3930.215519] pcieport 0000:02:04.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3930.215534] pcieport 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3930.508779] enp7s0: cmd = 0xff, should be 0x07 
               .
[ 3930.508785] enp7s0: io_base_l = 0xffff, should be 0xf001 
               .
[ 3930.508789] enp7s0: mem_base_l = 0xffff, should be 0x4004 
               .
[ 3930.508793] enp7s0: mem_base_h = 0xffff, should be 0xfcc0 
               .
[ 3930.508797] enp7s0: resv_0x1c_l = 0xffff, should be 0x0000 
               .
[ 3930.508800] enp7s0: resv_0x1c_h = 0xffff, should be 0x0000 
               .
[ 3930.508804] enp7s0: resv_0x20_l = 0xffff, should be 0x0004 
               .
[ 3930.508807] enp7s0: resv_0x20_h = 0xffff, should be 0xfcc0 
               .
[ 3930.508811] enp7s0: resv_0x24_l = 0xffff, should be 0x0000 
               .
[ 3930.508815] enp7s0: resv_0x24_h = 0xffff, should be 0x0000

I also got this

[ 3930.632849] ------------[ cut here ]------------
[ 3930.632861] WARNING: CPU: 7 PID: 63 at /storage/manjaro/makepkg/linux55-r8168/src/r8168-8.048.00/src/r8168_n.c:6843 rtl8168_wait_phy_ups_resume+0x52/0x60 [r8168]
[ 3930.632862] Modules linked in: snd_seq_dummy snd_seq snd_seq_device fuse squashfs loop mousedev joydev input_leds edac_mce_amd nls_iso8859_1 nls_cp437 vfat fat ccp rng_core amdgpu kvm irqbypass btrfs blake2b_generic xor gpu_sched i2c_algo_bit ttm snd_hda_codec_realtek snd_hda_codec_generic drm_kms_helper ledtrig_audio snd_hda_codec_hdmi drm snd_hda_intel snd_intel_dspcfg snd_hda_codec eeepc_wmi snd_hda_core asus_wmi agpgart battery snd_hwdep crct10dif_pclmul syscopyarea crc32_pclmul sparse_keymap sysfillrect ghash_clmulni_intel snd_pcm rfkill wmi_bmof snd_timer raid6_pq snd aesni_intel sp5100_tco libcrc32c crypto_simd sysimgblt cryptd r8168(OE) glue_helper fb_sys_fops soundcore i2c_piix4 k10temp pcspkr wmi gpio_amdpt pinctrl_amd evdev mac_hid acpi_cpufreq uinput crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sd_mod hid_microsoft ff_memless hid_generic usbhid hid ahci libahci crc32c_intel libata xhci_pci xhci_hcd scsi_mod
[ 3930.632911] CPU: 7 PID: 63 Comm: ksoftirqd/7 Tainted: G        W  OE     5.5.2-1-MANJARO #1
[ 3930.632913] Hardware name: System manufacturer System Product Name/PRIME B450M-GAMING/BR, BIOS 2006 11/13/2019
[ 3930.632919] RIP: 0010:rtl8168_wait_phy_ups_resume+0x52/0x60 [r8168]
[ 3930.632923] Code: a4 ff ff ff bf 58 89 41 00 89 c3 e8 38 68 a9 d4 83 e3 07 66 44 39 eb 74 05 83 fd 63 7e d1 83 fd 64 74 07 5b 5d 41 5c 41 5d c3 <0f> 0b 5b 5d 41 5c 41 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 0f b6 87
[ 3930.632925] RSP: 0018:ffffa12fc03c3d28 EFLAGS: 00010046
[ 3930.632928] RAX: 00000d4532da1d05 RBX: 0000000000000007 RCX: 0000000000000007
[ 3930.632929] RDX: 0000000000386412 RSI: 00000d4532a1b8f3 RDI: 0000000000385ae1
[ 3930.632931] RBP: 0000000000000064 R08: 00000000ffffffff R09: 0000000000000000
[ 3930.632932] R10: 0000000000000002 R11: 00000000000000f0 R12: ffff91a5791a88c0
[ 3930.632933] R13: 0000000000000003 R14: ffff91a5791a8b18 R15: ffff91a5791a88c0
[ 3930.632936] FS:  0000000000000000(0000) GS:ffff91a5909c0000(0000) knlGS:0000000000000000
[ 3930.632937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3930.632939] CR2: 0000557811de9712 CR3: 00000002ab924000 CR4: 00000000003406e0
[ 3930.632940] Call Trace:
[ 3930.632952]  rtl8168_esd_timer.cold+0x1db/0x3a9 [r8168]
[ 3930.632963]  ? rtl8168_open+0x430/0x430 [r8168]
[ 3930.632967]  call_timer_fn+0x2d/0x160
[ 3930.632971]  run_timer_softirq+0x1ad/0x510
[ 3930.632977]  ? rtl8168_open+0x430/0x430 [r8168]
[ 3930.632983]  __do_softirq+0x111/0x34d
[ 3930.632989]  run_ksoftirqd+0x32/0x40
[ 3930.632992]  smpboot_thread_fn+0x19a/0x230
[ 3930.632996]  kthread+0xfb/0x130
[ 3930.632998]  ? sort_range+0x20/0x20
[ 3930.633000]  ? kthread_park+0x90/0x90
[ 3930.633004]  ret_from_fork+0x22/0x40
[ 3930.633009] ---[ end trace 7894c6017069a6f9 ]---

And then the system comes crashing down

[ 3959.949053] enp7s0: resv_0x2c_l = 0xffff, should be 0x1043 
               .
[ 3959.949057] enp7s0: resv_0x2c_h = 0xffff, should be 0x8677 
               .
[ 3959.949184] enp7s0: pci_sn_l = 0xffffffff, should be 0x684ce000 
               .
[ 3959.950348] enp7s0: pci_sn_h = 0xffffffff, should be 0x01000000 
               .
[ 3959.951388] enp7s0: esd_flag = 0x7fff
               .
[ 3962.788040] r8168: enp7s0: link up
[ 3962.806956] ata5.00: exception Emask 0x52 SAct 0x80fff841 SErr 0xffffffff action 0x6 frozen
[ 3962.806962] ata5: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
[ 3962.806967] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.806976] ata5.00: cmd 61/00:00:80:9d:43/0a:00:38:00:00/40 tag 0 ncq dma 1310720 ou
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.806979] ata5.00: status: { DRDY }
[ 3962.806983] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.806993] ata5.00: cmd 61/00:30:80:07:43/0a:00:38:00:00/40 tag 6 ncq dma 1310720 ou
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.806995] ata5.00: status: { DRDY }
[ 3962.806998] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807005] ata5.00: cmd 61/00:58:80:11:43/0a:00:38:00:00/40 tag 11 ncq dma 1310720 ou
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807006] ata5.00: status: { DRDY }
[ 3962.807008] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807014] ata5.00: cmd 61/00:60:80:1b:43/0a:00:38:00:00/40 tag 12 ncq dma 1310720 ou
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807016] ata5.00: status: { DRDY }
[ 3962.807018] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807024] ata5.00: cmd 61/00:68:80:25:43/0a:00:38:00:00/40 tag 13 ncq dma 1310720 ou
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807026] ata5.00: status: { DRDY }
[ 3962.807028] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807034] ata5.00: cmd 61/00:70:80:2f:43/0a:00:38:00:00/40 tag 14 ncq dma 1310720 ou
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807035] ata5.00: status: { DRDY }
[ 3962.807037] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807043] ata5.00: cmd 61/00:78:80:39:43/0a:00:38:00:00/40 tag 15 ncq dma 1310720 ou
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807044] ata5.00: status: { DRDY }
[ 3962.807046] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807052] ata5.00: cmd 61/00:80:80:43:43/0a:00:38:00:00/40 tag 16 ncq dma 1310720 ou
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807054] ata5.00: status: { DRDY }
[ 3962.807055] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807061] ata5.00: cmd 61/00:88:80:4d:43/0a:00:38:00:00/40 tag 17 ncq dma 1310720 ou
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807063] ata5.00: status: { DRDY }
[ 3962.807064] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807070] ata5.00: cmd 61/00:90:80:57:43/0a:00:38:00:00/40 tag 18 ncq dma 1310720 ou
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807071] ata5.00: status: { DRDY }
[ 3962.807073] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807079] ata5.00: cmd 61/00:98:80:61:43/0a:00:38:00:00/40 tag 19 ncq dma 1310720 ou
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807080] ata5.00: status: { DRDY }
[ 3962.807082] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807087] ata5.00: cmd 61/00:a0:80:6b:43/0a:00:38:00:00/40 tag 20 ncq dma 1310720 ou
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807089] ata5.00: status: { DRDY }
[ 3962.807091] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807096] ata5.00: cmd 61/00:a8:80:75:43/0a:00:38:00:00/40 tag 21 ncq dma 1310720 ou
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807098] ata5.00: status: { DRDY }
[ 3962.807100] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807105] ata5.00: cmd 61/00:b0:80:7f:43/0a:00:38:00:00/40 tag 22 ncq dma 1310720 ou
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807107] ata5.00: status: { DRDY }
[ 3962.807108] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807114] ata5.00: cmd 61/00:b8:80:89:43/0a:00:38:00:00/40 tag 23 ncq dma 1310720 ou
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807115] ata5.00: status: { DRDY }
[ 3962.807117] ata5.00: failed command: WRITE FPDMA QUEUED
[ 3962.807123] ata5.00: cmd 61/00:f8:80:93:43/0a:00:38:00:00/40 tag 31 ncq dma 1310720 ou
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3962.807124] ata5.00: status: { DRDY }
[ 3962.807128] ata5: hard resetting link
[ 3962.807133] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3963.839057] ata5: failed to resume link (SControl FFFFFFFF)
[ 3963.839072] ata5: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[ 3964.855824] enp7s0: cmd = 0xff, should be 0x07 
               .
[ 3964.855829] enp7s0: io_base_l = 0xffff, should be 0xf001 
               .
[ 3964.855834] enp7s0: mem_base_l = 0xffff, should be 0x4004 
               .

I can do IO-heavy tasks like moving stuff around on my btrfs filesystem with no problems, and I have been running games for testing purposes with no issues either. At least for now, these problems only seem to crop up when using the network card heavily.

Now, I do NOT have a lot of experience with Linux in general. But here's a list of things I tried:

  • Different kernel versions between 5.3 and 5.5
  • Kernel parameters on grub like iommu=off, pci_aspm=off, libata.force=noncq
  • Unplugging any peripherals that are not absolutely necessary (including disconnecting mouse and keyboard)
  • With and without the driver from "hardware cofiguration" - the problem seemed to occur more quickly with it installed, but I can't be sure of that.

Now, most of these solutions might seem completely irrelevant to the issue, but keep in mind that I don't know what else to try, which is why I'm desperate for any kind of help I could get. Any suggestions would be greatly appreciated.

It just happened again. I had a few downloads going with 25 connections, limited to 512kb/s. Network went down in less than half an hour, sata drives went down with it (my nvme drive kept working).

Here is a snippet of dmesg from the disaster area:

[ 3317.502623] pcieport 0000:02:06.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3317.502651] pcieport 0000:02:05.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3317.502659] pcieport 0000:02:04.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3317.502667] pcieport 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 3329.476097] ------------[ cut here ]------------
[ 3329.476101] NETDEV WATCHDOG: enp7s0 (r8169): transmit queue 0 timed out
[ 3329.476124] WARNING: CPU: 7 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x26a/0x280
[ 3329.476125] Modules linked in: fuse squashfs loop nls_iso8859_1 nls_cp437 vfat joydev mousedev input_leds fat edac_mce_amd btrfs blake2b_generic xor ccp rng_core kvm irqbypass amdgpu snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel gpu_sched eeepc_wmi asus_wmi i2c_algo_bit snd_intel_dspcfg crct10dif_pclmul battery ttm crc32_pclmul snd_hda_codec sparse_keymap rfkill wmi_bmof ghash_clmulni_intel snd_hda_core r8169 drm_kms_helper raid6_pq aesni_intel snd_hwdep libcrc32c realtek sp5100_tco drm crypto_simd pcspkr k10temp i2c_piix4 libphy snd_pcm cryptd glue_helper snd_timer agpgart snd syscopyarea sysfillrect sysimgblt fb_sys_fops soundcore wmi gpio_amdpt pinctrl_amd evdev mac_hid acpi_cpufreq uinput crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sd_mod hid_microsoft ff_memless hid_generic usbhid hid ahci libahci libata crc32c_intel xhci_pci xhci_hcd scsi_mod
[ 3329.476181] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G        W         5.5.2-1-MANJARO #1
[ 3329.476182] Hardware name: System manufacturer System Product Name/PRIME B450M-GAMING/BR, BIOS 2006 11/13/2019
[ 3329.476186] RIP: 0010:dev_watchdog+0x26a/0x280
[ 3329.476189] Code: 8a e2 7f ff eb 88 4c 89 f7 c6 05 8c 4d d1 00 01 e8 fb ad fa ff 44 89 e9 4c 89 f6 48 c7 c7 70 e1 fa 9d 48 89 c2 e8 c8 ae 88 ff <0f> 0b e9 66 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
[ 3329.476191] RSP: 0018:ffffb3c9c0354e60 EFLAGS: 00010286
[ 3329.476194] RAX: 0000000000000000 RBX: ffff94e445d87400 RCX: 0000000000000000
[ 3329.476195] RDX: 0000000000000103 RSI: 0000000000000096 RDI: 00000000ffffffff
[ 3329.476196] RBP: ffff94e44e46045c R08: 00000000000004d9 R09: 0000000000000001
[ 3329.476198] R10: 0000000000000000 R11: 0000000000000001 R12: ffff94e44e460480
[ 3329.476199] R13: 0000000000000000 R14: ffff94e44e460000 R15: ffff94e445d87480
[ 3329.476201] FS:  0000000000000000(0000) GS:ffff94e4509c0000(0000) knlGS:0000000000000000
[ 3329.476203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3329.476204] CR2: 00007f5d2da487e0 CR3: 00000003f3a3e000 CR4: 00000000003406e0
[ 3329.476206] Call Trace:
[ 3329.476208]  <IRQ>
[ 3329.476215]  ? qdisc_put_unlocked+0x30/0x30
[ 3329.476219]  call_timer_fn+0x2d/0x160
[ 3329.476222]  run_timer_softirq+0x1ad/0x510
[ 3329.476225]  ? qdisc_put_unlocked+0x30/0x30
[ 3329.476231]  __do_softirq+0x111/0x34d
[ 3329.476237]  irq_exit+0xac/0xd0
[ 3329.476240]  smp_apic_timer_interrupt+0xa6/0x1b0
[ 3329.476243]  apic_timer_interrupt+0xf/0x20
[ 3329.476245]  </IRQ>
[ 3329.476250] RIP: 0010:cpuidle_enter_state+0xc9/0x410
[ 3329.476253] Code: e8 7c 09 98 ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 1c 03 00 00 31 ff e8 ee 74 9e ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 72 02 00 00 49 63 d5 4c 2b 64 24 10 48 8d 04 52 48
[ 3329.476255] RSP: 0018:ffffb3c9c00e7e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 3329.476257] RAX: ffff94e4509c0000 RBX: ffff94e44e0d1000 RCX: 000000000000001f
[ 3329.476258] RDX: 0000000000000000 RSI: 000000001f383273 RDI: 0000000000000000
[ 3329.476259] RBP: ffffffff9e2c3d20 R08: 00000307343d554b R09: 0000000000001a04
[ 3329.476260] R10: 0000000000000397 R11: ffff94e4509ebbe4 R12: 00000307343d554b
[ 3329.476261] R13: 0000000000000002 R14: 0000000000000002 R15: ffff94e44ef0bc80
[ 3329.476270]  ? cpuidle_enter_state+0xa4/0x410
[ 3329.476274]  cpuidle_enter+0x29/0x40
[ 3329.476278]  do_idle+0x1e6/0x270
[ 3329.476282]  cpu_startup_entry+0x19/0x20
[ 3329.476286]  start_secondary+0x186/0x1d0
[ 3329.476290]  secondary_startup_64+0xb6/0xc0
[ 3329.476295] ---[ end trace e59409d38d427eac ]---
[ 3329.526705] r8169 0000:07:00.0 enp7s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[ 3329.527907] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3329.529077] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3329.530256] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3329.531426] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3329.532597] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3329.533774] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3329.543943] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.554129] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.564328] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.574525] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.584704] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.594836] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.604962] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.615087] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.625216] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.635340] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.645464] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.655591] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3329.665722] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.539825] r8169 0000:07:00.0 enp7s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[ 3339.541075] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3339.542288] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3339.543513] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3339.544726] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3339.545937] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3339.547151] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3339.557464] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.567666] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.577847] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.588031] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.598227] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.608423] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.618569] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.628700] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.638823] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.648948] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.659072] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.669198] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3339.679327] r8169 0000:07:00.0 enp7s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[ 3345.647369] audit: type=1130 audit(1582515772.346:107): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 3348.293392] ata5.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0x6 frozen
[ 3348.293397] ata5: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
[ 3348.293400] ata5.00: failed command: WRITE DMA EXT
[ 3348.293407] ata5.00: cmd 35/00:70:00:52:68/00:00:65:00:00/e0 tag 5 dma 57344 out
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3348.293409] ata5.00: status: { DRDY }
[ 3348.293413] ata5: hard resetting link
[ 3348.293419] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3348.293430] ata6.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0x6 frozen
[ 3348.293434] ata6: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
[ 3348.293436] ata6.00: failed command: WRITE DMA EXT
[ 3348.293442] ata6.00: cmd 35/00:70:00:b2:67/00:00:65:00:00/e0 tag 12 dma 57344 out
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
[ 3348.293443] ata6.00: status: { DRDY }
[ 3348.293445] ata6: hard resetting link
[ 3348.293448] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3349.326411] ata5: failed to resume link (SControl FFFFFFFF)
[ 3349.326426] ata5: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[ 3354.436493] ata5: hard resetting link
[ 3354.436499] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3354.436532] ata6: failed to resume link (SControl FFFFFFFF)
[ 3354.436547] ata6: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[ 3355.988075] audit: type=1131 audit(1582515782.685:108): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 3359.556970] ata6: hard resetting link
[ 3359.556975] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3360.589915] ata6: failed to resume link (SControl FFFFFFFF)
[ 3360.589930] ata6: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[ 3360.589939] ata6: limiting SATA link speed to <unknown>
[ 3365.743625] ata6: hard resetting link
[ 3365.743630] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3365.743651] ata5: failed to resume link (SControl FFFFFFFF)
[ 3365.743666] ata5: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[ 3365.743674] ata5: limiting SATA link speed to <unknown>
[ 3370.863410] ata5: hard resetting link
[ 3370.863416] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3371.897063] ata6: failed to resume link (SControl FFFFFFFF)
[ 3371.897078] ata6: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[ 3371.897085] ata6.00: disabled
[ 3371.897100] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3371.897105] ata5: failed to resume link (SControl FFFFFFFF)
[ 3371.897114] sd 5:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=55s
[ 3371.897118] sd 5:0:0:0: [sda] tag#12 Sense Key : Not Ready [current] 
[ 3371.897119] ata5: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[ 3371.897123] sd 5:0:0:0: [sda] tag#12 Add. Sense: Logical unit not ready, hard reset required
[ 3371.897126] ata5.00: disabled
[ 3371.897127] sd 5:0:0:0: [sda] tag#12 CDB: Write(16) 8a 00 00 00 00 00 65 67 b2 00 00 00 00 70 00 00
[ 3371.897132] blk_update_request: I/O error, dev sda, sector 1701294592 op 0x1:(WRITE) flags 0x100000 phys_seg 14 prio class 0
[ 3371.897138] BTRFS error (device sdb): bdev /dev/sda errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 3371.897139] ahci 0000:01:00.1: AHCI controller unavailable!
[ 3371.897148] sd 4:0:0:0: [sdb] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=55s
[ 3371.897151] sd 4:0:0:0: [sdb] tag#5 Sense Key : Not Ready [current] 
[ 3371.897154] sd 4:0:0:0: [sdb] tag#5 Add. Sense: Logical unit not ready, hard reset required
[ 3371.897157] sd 4:0:0:0: [sdb] tag#5 CDB: Write(16) 8a 00 00 00 00 00 65 68 52 00 00 00 00 70 00 00
[ 3371.897160] blk_update_request: I/O error, dev sdb, sector 1701335552 op 0x1:(WRITE) flags 0x100000 phys_seg 14 prio class 0
[ 3371.897165] BTRFS error (device sdb): bdev /dev/sdb errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 3371.897167] ata6: EH complete
[ 3371.897176] sd 5:0:0:0: rejecting I/O to offline device
[ 3371.897180] blk_update_request: I/O error, dev sda, sector 1808064 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[ 3371.897183] BTRFS error (device sdb): bdev /dev/sda errs: wr 1, rd 1, flush 0, corrupt 0, gen 0
[ 3371.897193] sd 5:0:0:0: rejecting I/O to offline device
[ 3371.897196] blk_update_request: I/O error, dev sda, sector 1701294704 op 0x1:(WRITE) flags 0x100000 phys_seg 2 prio class 0
[ 3371.897197] ata5: EH complete
[ 3371.897201] BTRFS error (device sdb): bdev /dev/sda errs: wr 2, rd 1, flush 0, corrupt 0, gen 0
[ 3371.897203] sd 4:0:0:0: rejecting I/O to offline device
[ 3371.897206] blk_update_request: I/O error, dev sdb, sector 1701335664 op 0x1:(WRITE) flags 0x100000 phys_seg 2 prio class 0
[ 3371.897209] BTRFS error (device sdb): bdev /dev/sdb errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 3371.897210] ata6.00: detaching (SCSI 5:0:0:0)
[ 3371.897225] ata5.00: detaching (SCSI 4:0:0:0)
[ 3371.897326] blk_update_request: I/O error, dev sda, sector 1701294720 op 0x1:(WRITE) flags 0x100000 phys_seg 21 prio class 0
[ 3371.897331] BTRFS error (device sdb): bdev /dev/sda errs: wr 3, rd 1, flush 0, corrupt 0, gen 0
[ 3371.897334] BTRFS error (device sdb): bdev /dev/sda errs: wr 4, rd 1, flush 0, corrupt 0, gen 0
[ 3371.897344] blk_update_request: I/O error, dev sdb, sector 1849024 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[ 3371.897346] BTRFS error (device sdb): bdev /dev/sdb errs: wr 2, rd 1, flush 0, corrupt 0, gen 0
[ 3371.897354] blk_update_request: I/O error, dev sdb, sector 1701335680 op 0x1:(WRITE) flags 0x100000 phys_seg 21 prio class 0
[ 3371.897355] blk_update_request: I/O error, dev sda, sector 1701294512 op 0x1:(WRITE) flags 0x100000 phys_seg 10 prio class 0
[ 3371.897357] BTRFS error (device sdb): bdev /dev/sda errs: wr 5, rd 1, flush 0, corrupt 0, gen 0
[ 3371.897359] BTRFS error (device sdb): bdev /dev/sdb errs: wr 3, rd 1, flush 0, corrupt 0, gen 0
[ 3371.897374] blk_update_request: I/O error, dev sdb, sector 1701335472 op 0x1:(WRITE) flags 0x100000 phys_seg 10 prio class 0
[ 3371.897795] BTRFS: error (device sdb) in btrfs_run_delayed_refs:2209: errno=-5 IO failure
[ 3371.897799] BTRFS info (device sdb): forced readonly
[ 3371.901468] sd 5:0:0:0: [sda] Synchronizing SCSI cache
[ 3371.901526] sd 5:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 3371.901528] sd 5:0:0:0: [sda] Stopping disk
[ 3371.901540] sd 5:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 3371.903051] sd 4:0:0:0: [sdb] Synchronizing SCSI cache
[ 3371.903106] sd 4:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 3371.903109] sd 4:0:0:0: [sdb] Stopping disk
[ 3371.903124] sd 4:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 3384.554126] r8169 0000:07:00.0 enp7s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[ 3384.555351] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3384.556537] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3384.557721] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[ 3384.558892] r8169 0000:07:00.0 enp7s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).

Edit: I'm running more tests to isolate the network load to be the cause of the issues (since the SATA connections seem to go poof). I started a game that keeps looping and streaming videos from my btrfs partition. Meanwhile, a btrfs scrub is running and I started a dd from /dev/urandom to the same btrfs filesystem with a few gigs of data. On top of that there's a watch on the dmesg, some light web browsing and a ping running.

Everything is working perfectly; the game is running, data is constantly being read and written, pings are all coming through, dmesg is clean, web browsing is fine and the btrfs scrub is not finding any errors. This has been going for an hour.

I'll keep this running overnight just to make sure, but for now it seems pretty clear that the cause of the problems is heavy usage of the NIC (if you can call 512kb/s spread over 27 connections "heavy usage").

If this is on a Gigabyte motherboard you might benefit from the iommu=soft kernel boot parameter.

My best guess is that this is a kernel issue. I would test at least 4 alternate kernels. I would start with 4.19 and 4.14, and the real time kernels.

Writing a suspend service for your adapter may also help.

See here:

Please post:

inxi -Fxxxza
3 Likes

Hi, thanks for the reply. I'm heading to bed now, my "stress test" ran for about two hours and is still going strong, I'll leave it running overnight and check again tomorrow, but so far it really looks like a problem with the load on the network card.

I have tried both the r8168 and r8169 kernel drivers, both show the same issues. Currently on the r8169 drivers. I'll take a closer look at the thread you linked tomorrow. But it mentions a service/script that runs on suspend/sleep, and I do not use suspend/sleep on this computer. It stays on all the time, all I do is lock the screen (Super+L). Would the suspend service you linked still have any effect?

Regarding the "iommu" parameter, right now I disabled iommu on the BIOS so if I set it to anything other than "off" on the kernel parameter, dmesg prints a message at bootup saying that it is off. Do yuo think I should turn it back on? Does it have any effect if I don't use VMs?

My motherboard is not a Gigabyte one, it is from ASUS. The model is listed on the results of the inxi command you suggested.

Regarding the Kernel, I did not want to go below 5.* because I'm using btrfs and it benefits from more up-to-date kernels. I also read something about newer Ryzen CPUs (mine is a 3400g) not being fully compatible with older kernels, but if push comes to shove I'll definetly try that.

Results of inxi -Fxxxza:

System:
  Host: retro Kernel: 5.5.2-1-MANJARO x86_64 bits: 64 
  compiler: gcc v: 9.2.0 
  parameters: BOOT_IMAGE=/boot/vmlinuz-5.5-x86_64 
  root=UUID=4f2094e3-ed71-4c70-a907-26d4aba12a07 rw quiet 
  apparmor=1 security=apparmor udev.log_priority=3 
  libata.force=noncq iommu=off 
  Desktop: KDE Plasma 5.17.5 tk: Qt 5.14.1 wm: kwin_x11 
  dm: SDDM Distro: Manjaro Linux 
Machine:
  Type: Desktop Mobo: ASUSTeK model: PRIME B450M-GAMING/BR 
  v: Rev X.0x serial: <filter> UEFI: American Megatrends 
  v: 2006 date: 11/13/2019 
CPU:
  Topology: Quad Core 
  model: AMD Ryzen 5 3400G with Radeon Vega Graphics bits: 64 
  type: MT MCP arch: Zen+ family: 17 (23) model-id: 18 (24) 
  stepping: 1 microcode: 8108109 L2 cache: 2048 KiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a 
  ssse3 svm 
  bogomips: 65624 
  Speed: 1258 MHz min/max: 1400/4100 MHz boost: enabled 
  Core speeds (MHz): 1: 1259 2: 1260 3: 1259 4: 1259 5: 1257 
  6: 1257 7: 1257 8: 1257 
  Vulnerabilities: Type: itlb_multihit status: Not affected 
  Type: l1tf status: Not affected 
  Type: mds status: Not affected 
  Type: meltdown status: Not affected 
  Type: spec_store_bypass mitigation: Speculative Store 
  Bypass disabled via prctl and seccomp 
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and 
  __user pointer sanitization 
  Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: 
  conditional, STIBP: disabled, RSB filling 
  Type: tsx_async_abort status: Not affected 
Graphics:
  Device-1: AMD Picasso vendor: ASUSTeK driver: amdgpu 
  v: kernel bus ID: 09:00.0 chip ID: 1002:15d8 
  Display: x11 server: X.Org 1.20.7 driver: amdgpu 
  FAILED: ati unloaded: modesetting alternate: fbdev,vesa 
  compositor: kwin_x11 
  resolution: 1920x1080~60Hz, 700x480_59.94~60Hz 
  OpenGL: 
  renderer: AMD RAVEN (DRM 3.36.0 5.5.2-1-MANJARO LLVM 9.0.1) 
  v: 4.5 Mesa 19.3.4 direct render: Yes 
Audio:
  Device-1: AMD Raven/Raven2/Fenghuang HDMI/DP Audio 
  vendor: ASUSTeK driver: snd_hda_intel v: kernel 
  bus ID: 09:00.1 chip ID: 1002:15de 
  Device-2: AMD Family 17h HD Audio vendor: ASUSTeK 
  driver: snd_hda_intel v: kernel bus ID: 09:00.6 
  chip ID: 1022:15e3 
  Sound Server: ALSA v: k5.5.2-1-MANJARO 
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit 
  Ethernet 
  vendor: ASUSTeK driver: r8169 v: kernel port: f000 
  bus ID: 07:00.0 chip ID: 10ec:8168 
  IF: enp7s0 state: up speed: 1000 Mbps duplex: full 
  mac: <filter> 
Drives:
  Local Storage: total: 14.78 TiB used: 874.54 GiB (5.8%) 
  ID-1: /dev/nvme0n1 vendor: Lexar model: 250GB SSD 
  size: 232.89 GiB block size: physical: 512 B logical: 512 B 
  speed: 31.6 Gb/s lanes: 4 serial: <filter> rev: S0614B0 
  scheme: GPT 
  ID-2: /dev/sda vendor: Western Digital 
  model: WD82PURZ-85TEUY0 size: 7.28 TiB block size: 
  physical: 4096 B logical: 512 B speed: 6.0 Gb/s 
  rotation: 7200 rpm serial: <filter> rev: 0A82 
  ID-3: /dev/sdb vendor: Western Digital 
  model: WD82PURZ-85TEUY0 size: 7.28 TiB block size: 
  physical: 4096 B logical: 512 B speed: 6.0 Gb/s 
  rotation: 7200 rpm serial: <filter> rev: 0A82 
Partition:
  ID-1: / raw size: 146.48 GiB size: 143.19 GiB (97.75%) 
  used: 12.60 GiB (8.8%) fs: ext4 dev: /dev/nvme0n1p2 
Sensors:
  System Temperatures: cpu: 37.9 C mobo: N/A gpu: amdgpu 
  temp: 37 C 
  Fan Speeds (RPM): N/A 
Info:
  Processes: 256 Uptime: 3h 54m Memory: 13.66 GiB 
  used: 1.58 GiB (11.6%) Init: systemd v: 242 Compilers: 
  gcc: 9.2.0 Shell: bash v: 5.0.11 running in: konsole 
  inxi: 3.0.37

Again, thank you so much for the help you offered. I'll get back to working on this issue tomorrow and leave my "stress test" running for now.

I would uninstal tlp as it could be causing problems with it's power saving settings. As this is a desktop computer tlp is not required.

I also have a service that will keep your connection alive:

It was intended for wifi, but it works well with Ethernet by simply changing the connection details in the script.

1 Like

I uninstalled tlp and copied the contents of the script you linked ("writing systemd service units") and kept an eye on my connection to run the script as soon as the link went out manually, as a test.

I started my downloads with 1 single connection. It took a while for the network to go down; running the script brought the network back up.

Then I started my downloads with 27 connections. The network went down in a couple of minutes. Again, running the script brought the network back up.

But then, after another minute, the network went down again and took down my SATA controllers and my AHCI controller (" AHCI controller unavailable!") the same way as the dmesg I linked before. Running the script again did not fix the network but, at this point, the whole thing is broken and I would need to restart anyway.

I'm starting to consider purchasing a dedicated PCIe Ethernet card (would an Intel 82574l be compatible with Linux and modern kernels?) but that would be like a last resort, for financial reasons and because purchasing one of these and not having the problem go away would be extremely frustrating. But more and more it looks like the onboard NIC is to blame here.

Right now the only thing I have left to test would be using older kernels, but I'm worried those might not play well with the rest of my hardware, and being stuck on an older kernel is probably not safe/secure? But might be interesting to see if we can isolate the issue further by testing that. I'm open to suggestions.

Unfortunately I really have to head to bed now so I'll come back and work on this tomorrow. Thanks again for all the assistance so far.

1 Like

Be sure to check if your BIOS is current.

1 Like

Loosing the network seems to me to be the first symptom, not the root cause.
This message is quite suspicious: "can't change power state from D3cold to D0"
I don't see why devices should be in D3cold, especially not an active network card. D3cold means the PCI device isn't visible on the bus. Most likely your problem is caused by a BIOS bug, or by some incompatibility between your hw and the PCI(e) core.
You could try to play with the PCI power saving options in the BIOS.

3 Likes

So here's the new information I have

  • I left that "stress test" running with no network activity for several hours and came back to the computer working just fine. Writing and reading lots of data to the HDDs caused no issues.
  • The problem occurs with a single connection limited to 512kb/s, but happens a lot sooner with a larger load (27 connections with a cap of 10mb/s).
  • There's only one BIOS available at the manufacturer's website: 2006, released on 2019/11/25. It is already installed on my MB.

I decided to run a different test. My SATA devices get disconnected in the process (which kicks my btrfs raid into read-only) when the load is too heavy. What I did was remove the fstab entry for my btrfs filesystem (which spans both of my SATA drives) and ran a 27-connection 10mb/s load straight into my NVMe SSD instead. I have downloaded 30GBs of data so far with zero problems... it has been stable for about an hour. So there's definetely some kind of conflict or something like that involving my NIC and my SATA.

Do any of you have any suggestions of Kernel parameters or similar stuff I can try to change the way the SATA is acessed to try and isolate this? I already tried (and have currently set) the parameters "libata.force=noncq" and "iommu=off". I also tried "pcie_aspm=off" in the past with no success.

I'll re-check the bios to see if there's anything else I've missed.

that to me indicates a kernel or btrfs file system support bug rather than hardware. try a few different kernels from the options marked LTS in Manjaro Settings Manager or if you feel like it 5.6rc2 which is working quite well already.

However, as well as checking for motherboard firmware updates, you also need to keep SSD and nvme drives up to date with firmware too. Vendors often patch for power management, I/O bugs, SMART incompatibilities and overall performance tweaks.

1 Like

You may want to read the Arch wiki page re network tuning:

https://wiki.archlinux.org/index.php/Improving_performance

1 Like

I am doing a few more tests and will try to use kernel 4.19.102-1 later to see if it makes any difference. I wouldn't try 5.6 because it is not in the kernel list on manjaro's gui and I'm not confortable with trying to do that by hand with my level of knowledge.

And whle the topic is my lack of knowledge :sweat_smile: how would I go about updating the firmware? As far as storage goes, I have a Lexar NM510 M.2 PCIe SSD (mounted on my root with ext4) and two 8tb WD Purple HDDs (WD82PURZ) running a btrfs filesystem on raid1. How can I check and update the firmware on these?

Additional note: on my current heavy connection test I just got a bunch of "ping: sendmsg: No buffer space available" messages on my ping command. Is that relevant?

Might want to reinstall TLP to bring the setup back into line with the default. If there are issues with the TLP defaults then we need to know about them so we can adjust them - hiding an issue doesn't fix it.

Just before you go too far with other things, did you try removing this and using r8169 instead?

1 Like

I tried kernel 4.19.102-1 and my system would boot to a black screen. I switched to tty2 and installed video-modesetting and removed video-linux and rebooted, but I went staright a black screen again. No out-of-place messages on dmesg that I could find. Reverted to kernel 5.5.2-1.

Might want to reinstall TLP to bring the setup back into line with the default. If there are issues with the TLP defaults then we need to know about them so we can adjust them - hiding an issue doesn't fix it.

I followed your advice. The presence of TLP does not change the behavior, issues happen the same way.

Just before you go too far with other things, did you try removing this and using r8169 instead?

Yes, I have been back to r8169 as the default for my current testing.

1 Like

ah, of course, you are using stable branch. kernel 5.6 hasn't been pushed to stable in any form yet.

Seemingly none of the drives in your system have firmware updates available at the moment anyway. If they did, you'd get them from the support section of the appropriate manufacturer's website. There is no NM510 listed though, only a NM500 and NM610. Either you got an OEM specific drive or it's one of those two options.

1 Like

from https://askubuntu.com/questions/210451/what-does-ping-sendmsg-no-buffer-space-available-mean

It means you reached a maximum value for a system parameter. Probably /proc/sys/net/core/wmem_max (but this might need some investigating on a system that shows this error). This setting is the maximum amount "receive socket memory".

It is likely that the cause is a broken NIC

broken NIC aka your onboard realtek gigabit LAN controller

1 Like

Is there any possibility at all that my NIC card is broken in the sense that it can't function when the SATA bus is also in use? Like if its connection to the motherboard maybe, or some DMA conflict between the NIC and the SATA?

At this point I just wanted to be sure that purchasing a new PCI-Express NIC wouldn't be too hasty.

The NIC concerned is a chip soldered to the motherboard, usually just below the I/O ports on the rear of the board. If it's power connection pins are not soldered properly it could feasibly be shorting out under load. Don't be concerned about DMA conflicts too much unless you have filled every PCI-E slot, SATA port and M.2 slot, plus overloaded all the USB headers by using a couple of hubs.

Take this problem up with ASUS if the board is still under warranty

1 Like

I'm going to take a chance and switch to the testing branch and see if I can try a 5.6 kernel in hopes it makes any difference. Will report back.

I switched to the testing branch... tested it with a 5.5 and a 5.6 kernel, problem persists. No idea what to try next other than maybe ordering a new network card and disabling the one on the motherboard.

Any new ideas would be very welcome. I'll try to reproduce the issue on a live usb media and see how it goes.

Forum kindly sponsored by