Shared publicly  - 
 
Another SSD died on me?


[252837.140962] sd 1:0:0:0: [sdb] Unhandled error code
[252837.140965] sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[252837.140968] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0e 84 8b d8 00 00 08 00
[252837.140976] end_request: I/O error, dev sdb, sector 243567576

Somehow I doubt that both my SSDs in different machines went bad within a couple of weeks of each other. Both machines are running Ubuntu 11.10 with kernel 3.0.0-16-generic and the SSD fstab looking like this:

UUID=65da1033-347d-4b9a-a660-312cb2f33ac0 / ext4 discard,noatime,errors=remount-ro 0 1

Crucial M4 256G

Did I miss some critical Linux bug related to SSDs?
10
2
Alexey Vasilyev's profile photoJay McKinsey's profile photogandhi bittu's profile photoMitchell Monahan's profile photo
30 comments
 
rasmus: why do you doubt they failed at the same time? were they different manufacturers acquired at different times? if not, proximal temporal failure seems quite plausible to me. Driver/kernel problems are also possible, of course.
 
Yes, they were bought about 4 months apart and one was a 128G actually and the other a 256G
 
Did you try re-plugging the SATA cables? I've had a recurring error that goes away for a while with a re-plug.
 
But both were Crucial m4's?

I seriously doubt that error has anything to do with the OS. That's a failing drive. It's most likely dead.

You might be lucky and able to continue with it by just powering down the machine, turning it around five times widdershins, and sacrificing a small goat. Don't get any blood on the electronics, though. When you power it back up, maybe things just magically work again for a while.

But Crucial is one of those SSD's that do bad garbage collection, and really want TRIM support. The disk manufacturers will tell you that is a good thing - but they are full of sh*t. It's a big blinking sign that says "incompetent firmware", and that's a really really bad thing in an SSD.

Another big red warning sign is "can be slow on random writes". The Crucials are very high performance on average, but have horrible max random latencies when under stress.

See for example

http://www.anandtech.com/show/4253/the-crucial-m4-micron-c400-ssd-review/2

which shows really great write latencies from the benchmarks at the top of the page. Scroll down a bit, though, to after the 20-minute stress test, and read some more: "max write random latencies can reach as high as 1.4 seconds".

At that point, I don't want to touch the hardware any more. It's broken shit.

Personally, I do like Intel SSD's. I've had several of them since the very first ones, and while they've wanted firmware updates too, on the whole they seem to be pretty good. They are no longer the fastest ones around, and they were never the cheapest, but they do seem to be reliable.

But other drives really do seem to be getting pretty good too. I'm not sure I really like the compression SandForce does (it smells a bit like benchmark tuning to me), but outside of my worries about tuning for benchmarks, it does have a rather good reputation, and a lot of data really is very compressible - and lowering write factors to SSD's is a fundamnentally good idea.

Anyway, bottom line: SSD's have improved tremendously in the last few years, and the utter garbage that you could dismiss immediately is largely gone. But firmware bugs are still a huge issue, and if you do a bad job at garbage collection, you almost certainly do a bad job at wear leveling and avoiding excessive write amplification too.

And if you don't do wear leveling right, or you don't avoid write amplification sufficiently, the drives will fail. Not immediately, but you can't expect years out of them. And I suspect that is what you're seeing.
Will B
+
1
2
1
 
"...sacrificing a small goat. Don't get any blood on the electronics, though..." Ok, note to myself "Never buy an SSD ask Linus Torvalds what todo with my possibly bad SSD"
 
Dear Sir +Linus Torvalds Yes sir true. Great. sir i search you in google but not any personel contact of you on webpages! You are co-inventor of Linux. I read your news, wikki, articles etc. Great. Look your friends +Rasmus Lerdorf not feedback me about my question. :) Sir what are you say's about this? Bittu Gandhi (Researcher, Author)
 
+Wilhelm Babernits You can always use the goat to make a "Seco de cabrito" (delicious goat stew, typical from the North of Peru). So all is not lost :-)
 
+Rasmus Lerdorf, looking at the errors, I would also tend to go for a hw failure. I have an Asus 1000 netbook that came with 2 SSDs (ASUS-PHISON), one 8GB and the other 32GB (got this machine in 2007), and I have been running Ubuntu on this little thing since day one (running 11.10 now).
I have not been kind to it, not even optimizing my fstab for the SSDs (for historical reasons the 8GB is ext4 and the 32GB is ext3) and using the default config. My 3yr old son has dropped it more times than I could care to count, and still the damn thing keeps ticking.

Long shot: did you see if the same SSDs give errors when put in another box. Once at work we had SSDs behaving wonky (they were hooked to an extension card), and it turned out to be the freaking card. Just an idea.
 
+Linus Torvalds I have 3 SSDs across 3 different machines:

HTPC: (still working)
Crucial RealSSD C300 CTFDDAC064MAG-1G1 2.5" 64GB SATA III

Desktop: (broke today)
Crucial M4 CT128M4SSD2 2.5" 128GB SATA III

Thinkpad: (broke 3 weeks ago)
Crucial M4 CT256M4SSD2 2.5" 256GB SATA III
 
I made the mistake of relying on an SSD once. And lost a few private keys because of it. Woe.
 
+Rasmus Lerdorf did you check if crucial throw out a new firmware. I've experience lots of trouble with Corsair series, and without last firmware, it's not a question about will it fail, but when.
This could explain why the htpc is working.
 
So I upgraded the 128GB M4 to firmware version 0309 and have been recompiling and make testing PHP for the past 6 hours without a hiccup. Not ready to declare it fixed yet, but it looks promising. If it really does fix it, I wonder what in the world caused it to run for 10 months on firmware 0001 without problems and then suddenly start failing.
 
+Linus Torvalds So yes, it turned out to be bad (extremely bad) firmware which made the SSD stop responding after 5184 hours of power-on time. The Changelog entry from http://www.crucial.com/support/firmware.aspx (choose M4 9mm in the dropdown):

Correct a condition where an incorrect response to a SMART counter will cause the m4
drive to become unresponsive after 5184 hours of Power-on time. The drive will recover
after a power cycle, however, this failure will repeat once per hour after reaching this point.
The condition will allow the end user to successfully update firmware, and poses no risk to
user or system data stored on the drive.

That explains why all my drive tests passed after rebooting but it would then fail again shortly after. I have upgraded the firmware in both and they both seem to be working well again.
 
Wish I knew this before I sent the drive back to the manufacturer. This might be a good time to re-key. Le sigh.
 
Excellent!, I had the same problem, upgraded firmware and everything is ok, no bit lost. Thanks Google for find this conversation! and thanks all off you! By the way my disk is recognized by ubuntu, where I was using it day by day, but when I put it in a windows 7 for the firmare update application It was not there, The dos iso img update method worked without problem. So linux recognized the drive, windows not.
 
Ok, I used dos iso img update via a direct sata interface as normal usage. But via usb-sata converter ubuntu see the disk, windows 7 no, before update. (i do not tested after)
 
My question exactly--did I miss something?  That was not hard to
find...guess the answer is to be found in farming.  Anyway

In use since Nov 14 2011 as a server; always on.  Not bad, 7.5 months
for this SSD!

"disk/drive"
    Crucial M4 64GB SSD

kern.log
    Jun 27 14:01:05 banaan kernel: [ 7204.405275] sd 1:0:0:0: [sda] Unhandled error code
    Jun 27 14:01:05 banaan kernel: [ 7204.405280] sd 1:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    Jun 27 14:01:05 banaan kernel: [ 7204.405285] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 03 c0 99 c1 00 00 08 00
    Jun 27 14:01:05 banaan kernel: [ 7204.405295] end_request: I/O error, dev sda, sector 62953921

kernel
    Linux banaan 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Also with Ubuntu -24, -25 builds.

    UUID=3d183dc5-e418-4b02-93c1-469de0b2d023 /               ext4    discard,noatime,errors=remount-ro 0       1

It's an EFI boot machine, any hints for adding a second/alternative
EFI boot partition?
 
I found this post after some googling and it saved me from tossing my Crucial m4.  My SSD recently started causing my system to lock up and sure enough, it's around 5200+ hours of on-time and updating the firmware stopped the lockups.  I never would have guessed. Thanks!
 
Sorry for necromancy, but I just faced the same problem immediately after replacing HDD with SSD. And it seems that I have to blame Linux kernel this time :( (sorry, +Linus Torvalds SSDs might be not that bad)

All IO errors appear right after wakeup/suspend. So this bug was described and confirmed in Ubuntu bug tracker: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/819096
Latest 3.9 kernel still has this bug. 
 
One of mine diead yesterday, but dunno why he wasn't connected lol. It's was my disc of spare.
Add a comment...