Entire system hangs when writing to SSD

Whenever I write to a disk, the system freezes for a moment. I can't move the mouse, nothing. The larger the write, the longer the hang. This command can make my PC useless basically for a whole minute:

$ dd if=/dev/zero of=/path/to/SSD/tempfile bs=1M count=1024 conv=fdatasync,notrunc status=progress

(line inspired by https://wiki.archlinux.org/index.php/Benchmarking)

Here's my system info:

System:    Host: laptop Kernel: 5.2.8-1-MANJARO x86_64 bits: 64 compiler: gcc v: 9.1.0 Desktop: Xfce 4.14.1git-23545d 
           Distro: Manjaro Linux 
Machine:   Type: Laptop System: SAMSUNG product: 900X3C/900X4C/900X4D v: 0.1 serial: <filter> 
           Mobo: SAMSUNG model: SAMSUNG_NP1234567890 v: FAB1 serial: <filter> UEFI [Legacy]: Phoenix v: P02AAC 
           date: 06/01/2012 
Battery:   ID-1: BAT1 charge: 33.3 Wh condition: 33.3/40.3 Wh (83%) model: SAMSUNG Electronics status: Full 
CPU:       Topology: Dual Core model: Intel Core i5-3317U bits: 64 type: MT MCP arch: Ivy Bridge rev: 9 L2 cache: 3072 KiB 
           flags: avx lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 13573 
           Speed: 2460 MHz min/max: 800/2600 MHz Core speeds (MHz): 1: 2428 2: 2394 3: 2459 4: 2395 
Graphics:  Device-1: Intel 3rd Gen Core processor Graphics vendor: Samsung Co driver: i915 v: kernel bus ID: 00:02.0 
           Display: x11 server: X.Org 1.20.5 driver: intel unloaded: modesetting resolution: 1600x900~60Hz 
           OpenGL: renderer: Mesa DRI Intel Ivybridge Mobile v: 4.2 Mesa 19.1.4 direct render: Yes 
Audio:     Device-1: Intel 7 Series/C216 Family High Definition Audio vendor: Samsung Co driver: snd_hda_intel v: kernel 
           bus ID: 00:1b.0 
           Sound Server: ALSA v: k5.2.8-1-MANJARO 
Network:   Device-1: Intel Centrino Advanced-N 6235 driver: iwlwifi v: kernel port: efa0 bus ID: 01:00.0 
           IF: wlp1s0 state: up mac: <filter> 
           Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Samsung Co driver: r8168 v: 8.047.02-NAPI 
           port: 2000 bus ID: 02:00.0 
           IF: enp2s0 state: down mac: <filter> 
Drives:    Local Storage: total: 223.57 GiB used: 62.95 GiB (28.2%) 
           ID-1: /dev/sda vendor: Kingston model: SMS200S3240G size: 223.57 GiB 
Partition: ID-1: / size: 218.57 GiB used: 62.70 GiB (28.7%) fs: ext4 dev: /dev/dm-0 
           ID-2: /boot size: 487.9 MiB used: 250.0 MiB (51.2%) fs: ext4 dev: /dev/sda1 
Sensors:   System Temperatures: cpu: 59.0 C mobo: N/A 
           Fan Speeds (RPM): N/A 
Info:      Processes: 219 Uptime: 1d 1h 55m Memory: 3.56 GiB used: 2.31 GiB (64.8%) Init: systemd Compilers: gcc: 9.1.0 
           Shell: zsh v: 5.7.1 inxi: 3.0.35 

In a second terminal start top or htop
sort by RAM usage or cpu.
Then dd ...

Memory: 3.56 GiB used: 2.31 GiB (64.8%)

Do not know how quick is
06/01/2012 dual core

The problem is that the screen freezes as well, so I can't see the changes in top output while dd is working :<

Why? -git

That's what Manjaro Architect installed :thinking:

Should I change it?

SB will know :wink:

1 Like

Please post the output of
sudo parted -l
and
cat /etc/fstab

You might also check if AHCI for your drive in enabled in BIOS.

Lastly, check the S.M.A.R.T for the drive.

https://wiki.archlinux.org/index.php/S.M.A.R.T.#smartctl

2 Likes

@kuba-orlik
Since it's Kingston, i'm pretty sure if it's not dead yet - it will be soon, so first thing i would advice is to backup any sensitive data you have on it, because SSDs usually die very fast if they start to fail.

And then proceed with @Sinister debug routine safely to determine reason :slight_smile:

2 Likes

Install and test alternate kernels. I would start with 4.14.

3 Likes

Which scheduler are you using? cat /sys/block/sda/queue/scheduler

Anyways, I have a similar experience with bulk write operations on my laptop (Lenovo X1 carbon). When I do my fio benchmark, which I do every once in a while on all my PCs, only the laptop freezes while the bulk of the fio test file is being written/read. I always attributed this to power saving measures by the board. The SSD in the laptop is pretty fast with several 100 MB / s write and read. And so is yours.

2 Likes

It is very possible it is a combination of the scheduler and kernel in use.

2 Likes

also do a memtest, to see if your ram is ok

Strange that a SSD should have this kind of problem. You can try lower dirty memory settings. It has worked for me in the past for this type of problem on a HDD.

First write down your current settings from this output:

sysctl {vm.dirty_background_bytes,vm.dirty_bytes,vm.dirty_background_ratio,vm.dirty_ratio}

Then do (in root):

sysctl -w vm.dirty_background_bytes=67108864
sysctl -w vm.dirty_bytes=67108864

And run the benchmark again.


Also, do you use discard in fstab or fstrim service?

Maybe your m.2 ssd is not able to stay cool. If it gets to hot. System can freeze. So, why not get a heat sink for it.

Well, since we're throwing stuff against the wall to see if anything sticks, here's one of my favorites...

@kuba-orlik --that's a 7-year BIOS date. Anything newer than 2012?

regards

1 Like

Sorry for the delay.

Model: ATA KINGSTON SMS200S (scsi)
Disk /dev/sda: 240GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End    Size   Type     File system  Flags
 1      1049kB  538MB  537MB  primary  ext4         boot
 2      538MB   240GB  240GB  primary
# /dev/mapper/cryptroot
UUID=3b63c4cb-f930-4190-a167-7f1abf017874	/         	ext4      	rw,relatime	0 0

# /dev/sda1
UUID=31d47118-f018-43cd-ba84-3cf7884b283e	/boot     	ext4      	rw,noatime	0 0

/swapfile       	none      	swap      	defaults,pri=-2	0 0

Here's the SMART info:

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.78-1-MANJARO] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KINGSTON SMS200S3240G
Serial Number:    50026B727BC57DB9
LU WWN Device Id: 5 0026b7 27bc57db9
Firmware Version: 60AABBF0
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Oct 13 17:48:06 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7d) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Abort Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  48) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0025)	SCT Status supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   120   120   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5277 (123 210 0)
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       3044
171 Unknown_Attribute       0x000a   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       947
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       1
181 Program_Fail_Cnt_Total  0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
189 Unknown_SSD_Attribute   0x0000   038   064   000    Old_age   Offline      -       34363932710
194 Temperature_Celsius     0x0022   038   064   000    Old_age   Always       -       38 (Min/Max 8/64)
195 Hardware_ECC_Recovered  0x001c   120   120   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unknown_SSD_Attribute   0x001c   120   120   000    Old_age   Offline      -       0
204 Soft_ECC_Correction     0x001c   120   120   000    Old_age   Offline      -       0
230 Unknown_SSD_Attribute   0x0013   100   100   000    Pre-fail  Always       -       100
231 Temperature_Celsius     0x0000   088   088   011    Old_age   Offline      -       90194313216
233 Media_Wearout_Indicator 0x0032   000   000   000    Old_age   Always       -       115567
234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       13939
241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       13939
242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       11291
244 Unknown_Attribute       0x0000   085   085   010    Old_age   Offline      -       33096153

SMART Error Log not supported

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I did perform the "Short self-test routine" with smartctl

I do have AHCI enabled

[mq-deadline] kyber bfq bfq-mq none

Switching to 4.14 did indeed fix the issue :open_mouth: Now why would that be? It's kind of against the intuition that newer is better :wink: Perhaps it's some kind of mismatch between i/o scheduler and the kernel version?

Yes, I definitely feel the scheduler is part of it as well.

I use:

noop deadline cfq [bfq]

I use older kernels on my older hardware, they seem to like 4.9 or 4.14 better on my 10 yo machines.

2 Likes

4.14 uses non-multiqueue (or singe queue, sq) schedulers and Manjaro chooses bfq-sq by default for all drives.

4.19 and higher uses multiqueue (blk-mq) schedulers and mq-deadline dy default. But there were a few kernels recently that used bfq by default.

It is possible to use sq schedulers on 4.19 with boot option scsi_mod.use_blk_mq=0. But this boot option was deprecated in 5.2.

But there should not be this much difference on a SSD. bfq is better for rotational discs or responsiveness during high load. But on a ssd this is not such a problem. Unless the ssd is full or has write issues that make it perform more similar to a rotational disc.

The performance of bfq-sq and bfq (mq) is the same now in my testing. But you can tune the scheduler on any kernel.

4 Likes

Forum kindly sponsored by