SSD problem (with kernel 5.6 maybe?)

Swapped to kernel 5.6 about a week ago and incidentally at the same time started having problems with my storage SSD.

The problem part is that at some point the drive gets inaccessible -- can't read, can't write.

So either the SSD is failing or the problem is new kernel and I have not yet had time to swap back kernel to 5.4 and see if the problem appears there also or not (will do eventually).

But, maybe some of you bigger linux experts see the root of this problem already from the logs:

journald -r -p3
mai   04 19:49:26 Zen kernel: blk_update_request: I/O error, dev sda, sector 494996184 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
mai   04 19:49:26 Zen kernel: blk_update_request: I/O error, dev sda, sector 494996184 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
mai   04 19:49:26 Zen kernel: blk_update_request: I/O error, dev sda, sector 494996184 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
mai   04 19:49:25 Zen kernel: blk_update_request: I/O error, dev sda, sector 520161616 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1) in ext4_reserve_inode_write:5618: IO failure
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1): __ext4_get_inode_loc:4368: inode #75629156: block 302514246: comm doublecmd: unable to read itable block
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1) in ext4_reserve_inode_write:5618: IO failure
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1): __ext4_get_inode_loc:4368: inode #75629156: block 302514246: comm doublecmd: unable to read itable block
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1) in ext4_orphan_add:3000: IO failure
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1) in ext4_reserve_inode_write:5618: IO failure
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1): __ext4_get_inode_loc:4368: inode #75629156: block 302514246: comm doublecmd: unable to read itable block
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 2420115976 op 0x0:(READ) flags 0x80000 phys_seg 6 prio class 0
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 2420115968 op 0x0:(READ) flags 0x80700 phys_seg 7 prio class 0
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1) in ext4_reserve_inode_write:5618: IO failure
mai   04 19:49:23 Zen kernel: EXT4-fs (sda1): I/O error while writing superblock
mai   04 19:49:23 Zen kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
mai   04 19:49:23 Zen kernel: EXT4-fs error (device sda1): __ext4_get_inode_loc:4368: inode #16302082: block 65014816: comm doublecmd: unable to read itable block
mai   04 19:49:23 Zen kernel: blk_update_request: I/O error, dev sda, sector 520120576 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
mai   04 19:49:21 Zen kernel: blk_update_request: I/O error, dev sda, sector 520161616 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
mai   04 19:49:19 Zen kernel: blk_update_request: I/O error, dev sda, sector 520161616 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
mai   04 19:28:30 Zen kernel: ata2: reset failed, giving up
mai   04 19:28:30 Zen kernel: ata2: softreset failed (device not ready)
mai   04 19:28:25 Zen kernel: ata2: softreset failed (device not ready)
mai   04 19:27:50 Zen kernel: ata2: softreset failed (device not ready)
mai   04 19:27:40 Zen kernel: ata2: softreset failed (device not ready)
SMART seems alright at first glance...
$ sudo smartctl -i -a /dev/sda1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.8-1-MANJARO] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Silicon Motion based SSDs
Device Model:     Patriot P200 2TB
Serial Number:    MX_<snip>
LU WWN Device Id: 5 000000 00000000e
Firmware Version: H190117K
User Capacity:    2,048,408,248,320 bytes [2.04 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May  4 20:15:00 2020 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:                (   33) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (   2) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 19
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3817
12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       13
167 Average_Erase_Count     0x0022   100   100   000    Old_age   Always       -       0
168 Max_Erase_Count_of_Spec 0x0012   100   100   000    Old_age   Always       -       1
169 Remaining_Lifetime_Perc 0x0013   100   100   010    Pre-fail  Always       -       20971530
171 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       8590983174
175 Program_Fail_Count_Chip 0x0022   070   100   010    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0012   100   100   000    Old_age   Always       -       3547146
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   000    Pre-fail  Always       -       13568
187 Reported_Uncorrect      0x0032   100   000   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       13
194 Temperature_Celsius     0x0022   040   040   000    Old_age   Always       -       40 (Min/Max 40/40)
199 UDMA_CRC_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
206 Unknown_SSD_Attribute   0x0032   200   200   000    Old_age   Always       -       2
207 Unknown_SSD_Attribute   0x0032   200   200   000    Old_age   Always       -       16
208 Unknown_SSD_Attribute   0x0032   200   200   000    Old_age   Always       -       6
209 Unknown_SSD_Attribute   0x0032   200   200   000    Old_age   Always       -       33
210 Unknown_Attribute       0x0032   200   200   000    Old_age   Always       -       60
211 Unknown_Attribute       0x0032   200   200   000    Old_age   Always       -       49
231 Temperature_Celsius     0x0023   100   100   005    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       23097854336
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       153754169
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       7666683947
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       7090100536
245 TLC_Writes_32MiB        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
1        0        0  Not_testing
2        0        0  Not_testing
3        0        0  Not_testing
4        0        0  Not_testing
5        0        0  Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

just in case mount line also... maybe there is something wrong?
UUID=c28eb095-f4a7-4952-9683-0c952da97068 /media/patriot ext4 defaults,noatime 0 2

Any suggestions what I can do to troubleshoot this problem further?

2020-05-11 edit:
(
chronological order of things:
2020-04-26 - stable update, which I applied few days later, 2020-04-30 or something. Upgraded kernels + amd-ucode. swapped kernel from 5.4 to 5.6.
2020-05-01 - first time kernel loses SSD.
2020-05-04 - kernel loses SSD second time.
2020-05-04 - I apply the "stable update 2020-05-03" which also upgraded kernels + amd-ucode.
2020-05-05 - kernel loses SSD.
trying different kernel, 5.6 => 5.4
2020-05-06 - kernel loses SSD.
tried to clean ssd pins + replaced sata cable etc. swapped kernel back to 5.6. downgraded amd-ucode from 20200424 to 20200421.
2020-05-08 new stable update. kernels updated.
2020-05-10 - kernel loses SSD.
2020-05-10 - i downgrade amd-ucode to 20200417 to try something. (downgrading kernel 5.6 to earlier version of 5.6 seems trickier and I rather not try it)
2020-05-11 - kernel loses SSD. sadly starting to think this might be hardware problem afterall :frowning:
2020-05-11 - another stable update. with new kernels. also went back to newest amd-ucode as this doesn't seem to change a thing either. looking new 2TB+ SSD's now in online stores :frowning:
)

I have no idea what's the cause of this, hopefully someone smarter will give you some good advice.

But in the meantime, if I were you, I'd make sure my backup is rock solid, it does seem like a potential hardware failure.

If you think it might be kernel related, try another kernel. Maybe 5.4 or 4.19. This only takes a minute or two.

1 Like
  1. What @kresimir said:

    How to make a crash-proof backup for Manjaro

  2. SMART is indeed fine, but:

    mai   04 19:49:26 Zen kernel: blk_update_request: I/O error, dev sda, sector 494996184 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
    

    is a hardware error, so I'm suspecting the cable or the contacts of the disk itself are faulty / dirty as SMART is looking very healthy.

    • Replace the SATA cable if a desktop / have your machine serviced if it's a laptop
    • clean the disk contacts with a toothbrush (without toothpaste!

:scream:

Note 1: Sorry for mentioning without toothpaste because I've had some idiot someone else try that in the past...

2 Likes

Thank you for trying to help me :slight_smile:

First of all, it's not my "system drive", but solely for torrents. So no real need to backup my torrent drive luckily :blush:

... but this is the end of the line errors I am afraid (the journald is in the reverse order due to -r. so the real beginning is (also in the reverse order):

mai   04 19:28:30 Zen kernel: ata2: reset failed, giving up
mai   04 19:28:30 Zen kernel: ata2: softreset failed (device not ready)
mai   04 19:28:25 Zen kernel: ata2: softreset failed (device not ready)
mai   04 19:27:50 Zen kernel: ata2: softreset failed (device not ready)
mai   04 19:27:40 Zen kernel: ata2: softreset failed (device not ready)

googling kernel: ata2: softreset failed (device not ready) led me to some 2011 solution to: "recompile the kernel with CONFIG_SATA_PMP=n (default is CONFIG_SATA_PMP=y)". And now I am a little bit confused how to do that or should I really do that or why is kernel 5.6 somehow acting different than 5.4 in this regard. I will go back to kernel 5.4 and see if the problem ever reoccurs there (then again, with kernel 5.6 it has happened so far also only 2 times during one week, so I can't really make it drop the drive on command). It's seems to do quite often (in google results I found) with AMD cpu's (which I do have) and kernel 5.6/recent_stable_updates did introduce new AMD microcode also... hmmm....

Another observation is, that when this happens, just soft reboot does not cure it either. The BIOS is unable to boot up even though it's not even the boot drive. After I do reboot in manjaro and the system reboots, it freezes on the motherboard logo screen ... then need to power it off from PSU button, then turn back on after few seconds, and then everything works again perfectly like nothing happened. Can some kind of faulty AMD microcode do such a thing? Can hard power off reset the buggy microcode to default (would explain why I need to hard power off)?
Maybe it has nothing to do with the kernel 5.6 but with the latest stable update to amd-ucode 20200316.8eb0b28-1 => 20200421.78c0348-1 and/or the combination of those?

grrr... something is amiss here... stable update 2020-04-26 gave us

amd-ucode 20200421.r1628.78c0348-1 20200424.r1632.b2cad6a-1

and the next stable update 2020-05-03 gave us:

amd-ucode   20200316.8eb0b28-1   20200421.78c0348-1

and my pamac does say I have still the previous update file 20200424.r1632.b2cad6a-1 (which even arch linux doesn't have atm active - https://www.archlinux.org/packages/core/any/amd-ucode/ ... they have also the 20200421.78c0348-1 ... which was downgraded? but is still currently the newest available in the manjaro repos (20200424 then)?

A pencil eraser works.

4 Likes
  • Try to see if there is a BIOS update (can't verify as you didn't post an inxi)
  • Different kernels are easy under Manjaro...

That's all I can think of...

:frowning:

1 Like

For the record I experienced this same problem 4 months ago, and I had to substitute the SSD.

2020-05-07 Situation update.
Downgraded amd-ucode to 20200421.r1628.78c0348 --> problem did appear again;
Swapped to kernel 5.4 --> problem did appear again;
Went physical, removed drive, cleaned pins with isopropyl alcohol, replaced sata cable with another one, changed drive power cable connector to closest one to PSU (psu has this long sata power cable with 4 connectors). --> waiting... a day without incidents.

also ran the smartctl -t short, which lasted about half an hour at least (although guides suggested it would take 2 minutes), but came out clean:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3848         -

Don't think this would be necessary, as the drive did not have problems with this bios for a year. Don't see how this could suddenly fix something. Also there is a bios update, but it was suggested not to be used with first gen ryzen cpus by manufacturer and reserved only for gen 2.

UPDATE 2020-05-10 : sadly it lost the drive again. Now rolled back amd-ucode even 1 more step to 20200417.r1623.6314fa0. Not even sure it is the problem and not the kernel itself. Or hardware. Just trying "something" in-between incidents (which happen not too often, but sadly do).

$ inxi -Fxxxz
System:    Host: Zen Kernel: 5.6.10-3-MANJARO x86_64 bits: 64 compiler: gcc v: 9.3.0 Desktop: Xfce 4.14.2 tk: Gtk 3.24.13 
           info: xfce4-panel wm: xfwm4 dm: LightDM 1.30.0 Distro: Manjaro Linux 
Machine:   Type: Desktop Mobo: ASRock model: AB350M Pro4 serial: <filter> UEFI: American Megatrends v: P5.80 date: 04/19/2019 
CPU:       Topology: 8-Core model: AMD Ryzen 7 1700 bits: 64 type: MT MCP arch: Zen rev: 1 L2 cache: 4096 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 95848 
           Speed: 3194 MHz min/max: N/A Core speeds (MHz): 1: 3194 2: 3194 3: 3194 4: 3193 5: 3191 6: 3193 7: 3193 8: 3193 
           9: 3191 10: 3193 11: 3193 12: 3192 13: 3193 14: 3193 15: 3194 16: 3192 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel 
           bus ID: 1e:00.0 chip ID: 1002:687f 
           Display: x11 server: X.Org 1.20.8 driver: amdgpu,ati unloaded: modesetting alternate: fbdev,vesa 
           resolution: 1920x1080~60Hz, 1920x1200~60Hz, 1920x1080~60Hz 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.36.0 5.6.10-3-MANJARO LLVM 10.0.0) v: 4.6 Mesa 20.0.6 
           direct render: Yes 
Audio:     Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] driver: snd_hda_intel v: kernel 
           bus ID: 1e:00.1 chip ID: 1002:aaf8 
           Device-2: Advanced Micro Devices [AMD] Family 17h HD Audio vendor: ASRock driver: snd_hda_intel v: kernel 
           bus ID: 20:00.3 chip ID: 1022:1457 
           Sound Server: ALSA v: k5.6.10-3-MANJARO 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: ASRock driver: r8168 v: 8.048.02-NAPI 
           port: f000 bus ID: 18:00.0 chip ID: 10ec:8168 
           IF: enp24s0 state: up speed: 1000 Mbps duplex: full mac: <filter> 
Drives:    Local Storage: total: 3.73 TiB used: 3.10 TiB (83.3%) 
           ID-1: /dev/nvme0n1 vendor: Intel model: SSDPEKNW020T8 size: 1.86 TiB speed: 31.6 Gb/s lanes: 4 serial: <filter> 
           rev: 002C scheme: GPT 
           ID-2: /dev/sda vendor: Patriot model: P200 2TB MX size: 1.86 TiB speed: 6.0 Gb/s serial: <filter> rev: 117K 
           scheme: GPT 
Partition: ID-1: / size: 1.83 TiB used: 1.27 TiB (69.4%) fs: ext4 dev: /dev/nvme0n1p2 
Sensors:   System Temperatures: cpu: 46.8 C mobo: 45.0 C gpu: amdgpu temp: 77 C 
           Fan Speeds (RPM): fan-1: 0 fan-2: 1746 fan-3: 0 fan-4: 0 fan-5: 0 gpu: amdgpu fan: 676 
           Voltages: 12v: N/A 5v: N/A 3.3v: 3.34 vbat: 3.26 
Info:      Processes: 359 Uptime: 11h 29m Memory: 15.57 GiB used: 5.54 GiB (35.6%) Init: systemd v: 244 Compilers: gcc: 9.3.0 
           clang: 10.0.0 Shell: bash v: 5.0.16 running in: xfce4-terminal inxi: 3.0.37 

When this happened to me it simply meant that the drive was about to completely die.

And smartctl didn't saw any issue. As SMART only detects some malfunctions, not all of them.

What I would do is to intermediately back up any data, and buy a new drive. I had the best results with Sandisk.

1 Like

Starting to think that you might be right, sadly. It's getting more frequent now. Drive itself is relatively new Patriot P200, released just last summer, bought in August last year. (Then again, I have no clue how good are the kernel drivers for this SMI 2258XT controller and it's low power states etc. Maybe drive just "goes to sleep" and can't wake up? No idea how to force it never to try any low power states or whatever gimmicks it might be trying)

If the cable and contact cleaning did not bring any help, your product is defective:

Your drive comes with a 3 year warranty according to the Partriot Website so it's time to raise a ticket with Patriot instead of trying to work around a HW error from the Manjaro side...

Include a link to this conversation in the Patriot ticket.

:sob:

1 Like

in process already :smiley:

1 Like

However... why is it trying "softreset" only?

[ 2460.073652] ata2: softreset failed (device not ready)
[ 2470.120126] ata2: softreset failed (device not ready)
[ 2480.673260] ata2: link is slow to respond, please be patient (ready=0)
[ 2505.156134] ata2: softreset failed (device not ready)
[ 2505.156137] ata2: limiting SATA link speed to 3.0 Gbps
[ 2510.326041] ata2: softreset failed (device not ready)
[ 2510.326044] ata2: reset failed, giving up
[ 2510.326045] ata2.00: disabled

can I somehow force HARDreset? :smiley:

Mon chèr,

Je reviens à ce que je t'ai déjà dit :

:stuck_out_tongue_winking_eye:

P.S.: mai 04: :fr: :exclamation: :grin: :innocent:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Forum kindly sponsored by