/usr/lib/modules getting deleted on boot

This is a fresh installation, installed using Manjaro Architect on an existing LVM partition inside a LUKS-encrypted partition and using a separate unencrypted boot partition. I've reinstalled several times, with the same result every time.

The problem manifests itself on the second and subsequent boots, in which systemd-modules-load.service will fail, although the boot will continue for a bit before it hangs without any further error messages.

Fortunately after a while I can switch to another tty to examine the issue, and what I've discovered so far is this:

  • The boot hangs because Xorg fails to load because the nvidia driver fails to load because the nvidia kernel module has not been loaded.
  • systemd-modules-load.service seems to fail because /usr/lib/modules does not exist. It succeeds on first boot, but then after the first boot the directory is gone and subsequent boots will fail.

I can recover by reinstalling the kernel (linux417) and nvidia driver (linux417-nvidia) which will work for exactly one boot before /usr/lib/modules disappears again.

So my questions are:

  • What could possibly be causing this during the boot process?
  • How can I proceed to find more clues?
System:    Kernel: 4.17.0-2-MANJARO x86_64 bits: 64 compiler: gcc v: 8.1.1 
           Desktop: Gnome 3.28.2 Distro: Manjaro Linux 17.1.10 Hakoila 
Machine:   Type: Desktop Mobo: ASUSTeK model: P8Z77-V v: Rev 1.xx serial: <filter> BIOS: American Megatrends 
           v: 0906 date: 03/26/2012 
CPU:       Topology: Quad Core model: Intel Core i5-3570K bits: 64 type: MCP arch: Ivy Bridge rev: 9 
           L2 cache: 6144 KiB 
           flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 27288 
           Speed: 1605 MHz min/max: 1600/3800 MHz Core speeds (MHz): 1: 1605 2: 1605 3: 1605 4: 1605 
Graphics:  Card-1: NVIDIA GK104 [GeForce GTX 670] driver: nvidia v: 396.24 bus ID: 01:00.0 
           Display: x11 server: N/A driver: nvidia resolution: <xdpyinfo missing> 
           OpenGL: renderer: GeForce GTX 670/PCIe/SSE2 v: 4.6.0 NVIDIA 396.24 direct render: Yes 
Audio:     Card-1: Intel 7 Series/C216 Family High Definition Audio driver: snd_hda_intel v: kernel 
           bus ID: 00:1b.0 
           Card-2: NVIDIA GK104 HDMI Audio driver: snd_hda_intel v: kernel bus ID: 01:00.1 
           Sound Server: ALSA v: k4.17.0-2-MANJARO 
Network:   Card-1: Intel 82579V Gigabit Network Connection driver: e1000e v: 3.2.6-k port: f040 bus ID: 00:19.0 
           IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter> 
           Card-2: Qualcomm Atheros AR9485 Wireless Network Adapter driver: ath9k v: kernel bus ID: 06:00.0 
           IF: wlp6s0 state: down mac: <filter> 
Drives:    HDD Total Size: 588.83 GiB used: 306.90 GiB (52.1%) 
           ID-1: /dev/sda vendor: Samsung model: SSD 830 Series size: 119.24 GiB 
           ID-2: /dev/sdb vendor: Samsung model: SSD 850 EVO 500GB size: 465.76 GiB 
Partition: ID-1: / size: 31.25 GiB used: 10.00 GiB (32.0%) fs: ext4 dev: /dev/dm-2 
           ID-2: /boot size: 487.9 MiB used: 66.7 MiB (13.7%) fs: ext4 dev: /dev/sdb1 
           ID-3: /home size: 410.58 GiB used: 296.83 GiB (72.3%) fs: ext4 dev: /dev/dm-3 
           ID-4: swap-1 size: 16.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-1 
Info:      Processes: 201 Uptime: 54m Memory: 15.62 GiB used: 2.46 GiB (15.7%) Init: systemd Compilers: 
           gcc: 8.1.1 clang: 6.0.0 Shell: zsh v: 5.5.1 inxi: 3.0.10 

Note: Also asked on unix.stackexchange. Make sure to get your juicy internet points over there :slight_smile:

1 Like

This sounds like the same issue that some others have found, namely they have to keep reinstalling drivers to get them to work. I don't think I've seen a cause or solution yet but this is the best report I've seen so far.

Starter for ten: Do you have kernel-alive installed and enabled?

2 Likes

Would you mind pointing one of them, please? I missed this crazy issue.. :scream:

Oh, that's interesting. I do have kernel-alive installed, and just now noticed linux-module-cleanup.service failed during the last successful boot with this error:

juni 16 21:37:50 blackbox-manjaro linux-module-cleanup[543]: rm: cannot remove '/usr/lib/modules/.old': No such file or directory
juni 16 21:37:50 blackbox-manjaro systemd[1]: linux-module-cleanup.service: Main process exited, code=exited, status=1/FAILURE
juni 16 21:37:50 blackbox-manjaro systemd[1]: linux-module-cleanup.service: Failed with result 'exit-code'.

Not sure why I didn't see that before, but that's definitely suspect. Let me try to disable it, reinstall the kernel and drivers, and do another reboot.

Alright! Post-reboot, systemctl --failed shows 0 failed units and /usr/lib/modules is still there!

I'm going to try another reboot just to be sure (edit: Success!), but could you elaborate on how you came to suspect kernel-alive?

I've submitted an issue to the kernel-alive repository: https://gitlab.manjaro.org/ste74/kernel-alive/issues/1

1 Like

Years of spotting the root cause of issues. :wink:

More seriously, it's the only thing I know that could affect kernel modules. I don't have any issues and don't have it installed, hence I asked the question.

Well spotted then! And thank you :slight_smile:

1 Like

The way kernel-alive works can make you think it's related

It may be causing a whole class of issues:

1 Like

The way kernel-alive works can make you think it’s related

Definitely. Looking at the actual script:

if [ -e /usr/lib/modules/.old ]; then

	oldkern=$(cat /usr/lib/modules/.old)
	currentkern=$(uname -r)
	
	#check if old is no current and in this case remove the modules
	if [ "$oldkern" != "$currentkern" ]; then
		rm -r /usr/lib/modules/"$oldkern"
	fi
	#remove only the hidden file
	rm /usr/lib/modules/.old
fi

if $oldkern is an empty string, which it will be if /usr/lib/modules/.old is empty, it will delete the entire /usr/lib/modules directory. So I guess the question now is why it's empty. There are some other scripts which sets up the .old file, and which probably is to blame, but I've at least suggested that it might be a good idea to check whether the string is empty before deleting important system files.

Uhmmm this happen only with kernel 4.17? (wich is not a stable kernel) i m not At home today to investigate but i presume if the kernel is not stable it not print a valid uname -r to parse in right mode with My check...

4.17 is stable. uname -r prints 4.17.0-2-MANJARO. I'm pretty sure I tried with 4.14 too though.

modified the code

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Forum kindly sponsored by