Shared publicly  - 
 
Update: Looks like this is an xHCI specific issue, and probably not the cause of the USB device disconnects under EHCI.  To everyone who commented with other USB issues (none of which really sounded related), please email the linux-usb mailing list with a description of your issue.

Facepalm.  Linux may have been causing USB disconnects on resume from device suspend all along.

Pretty much for my entire career in Linux USB (eight years now?), we've been complaining about how USB device power management just sucks.  We enable auto-suspend for a USB device driver, and find dozens of different USB devices that simply disconnect from the bus when auto-suspend is enabled.

For years, we've blamed those devices for being cheap, crappy, and broken.  We talked about blacklists in the kernel, and ripped those out when they got too big.  We've talked about whitelists in userspace, but not many distros have time to cultivate such lists.

It turns out it's not always the device's fault.

There's a 10 ms timeout in the USB core, with a simple comment over it "TRSMRCY = 10 msec".  And indeed, in the USB 2.0 spec, section 7.1.7.7, the spec states "The USB System Software must provide a 10 ms resume recovery time (TRSMRCY) during which it will not attempt to access any device connected to the affected (just-activated) bus segment."

A software developer would say, "Ok, that means I can access the device after I wait 10 ms."  However, a hardware developer would take a careful look at Table 7-14 and notice TRSMRCY is listed as a minimum value.

That means the USB ports can be in resume for longer than TRSMRCY.  If the USB core attempts to access those ports while the device is still coming out of resume, such as issuing transfers to the device, or resetting the port, the device will disconnect, or transfer errors will occur.  This causes the USB core to mark the device as disconnected.

It's easy to see that TRSMRCY varies on an Intel xHCI host controller by adding some extra debugging.  The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state.  The maximum time I've seen is around 17 ms, and the time is above 10 ms in about 8% of the remote wakeup events I've tested.

This patch is not the "real fix" for solving the issues with the USB core, and I despise fixing things by tweaking timeout values, so I'll have to work on a real fix tomorrow.  But at least there's a light at the end of the tunnel for USB device power management.
675
102
Christo Greeff's profile photoDermot Haughey's profile photoAnders Thomsen's profile photoDarren Hart's profile photo
76 comments
 
Well crap :-) But congrats on finding it. And isn't it cool that we all learn something from this? Rather than it going silently ignored and then suddenly magically start working... Kudos +Sarah Sharp !
 
Props to the Intel system validation engineers for pointing out that TRSMRCY was a minimum value when I complained the host was "violating" the USB 2.0 spec by not resuming the port within 10ms.
 
+Darren Hart Nope!  Table 7-14 doesn't list a maximum.  We'll have to find it by trial and error...
 
I had an iMac that would do that... too bad I recycled it months ago.
 
erm, guess this fixes my usb drive issues on suspend.
 
hmm interesting, i can see how it happens. though i have rarely had it happen (mostly because i don't use many usb devices)
 
+Sarah Sharp : Oh dear, this has been bothering me soooooo much! Thank you very much for looking into the issue! Looking forward to running 3.12!
 
Auch!

Well, on the bright side we are gonna have working hibernation in a short time now that this bug has been identified :) It was about time!

What puzzle me is that I always tough that you (kernel developers) worked with (some) vendor to produce those drivers, nobody ever told you to wait more?
 
Blimey, that's a useful tidbit to pin on the mental noteboard. How much was it frustrating you to make you dig down and find the root cause!
 
Well done! Great to see progress in this area :-)
 
There is no "maximum" for a reason. Because it should be evaluated as "hey hardware developer, you will have guaranteed 10 ms from System Software  to resume". If you don't wake up in 10ms, you are clearly violating the spec. 

9.2.6.2 states: 
After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers. The device may ignore any data transfers during the recovery interval.
After the end of the recovery interval (measured from the end of the reset or the end of the EOP at the end of the resume signaling), the device must accept data transfers at any time.
 
This is about usb autosuspend, where the host suspends a device when not using it (a web camera when not using it), not about computer suspending or hibernating. Although I suppose it could also be triggered when resuming a computer from sleep.

I hope this adds reliability to usb communications, because I always view usb as the mos unreliable method of communicating with a device, with XactErr, disconnects, debounce, status change, link, unlink...

Good work.
 
If the USB spec doesn't specify a maximum time, then the problem is the USB spec is broken. Or USB developers (including the intel ones) have been misreading it. A spec that depends on "you have to play with a bunch of devices and see how long they take to get ready" is not a spec, it's a suggestion.
 
Changing 10 to 20 isn't a fix. I would not even test this. A proper patch is needed.
 
Don't you hate standards that aren't clear or just leave too much open for interpretation?
 
I just love it when I see women who solve complicated problems such as this one, you are showing the world we are moving in to brighter times (Not that women aren't fit to work on complex things like this, but more that some people "think" they don't.)! :)  Well done!
 
I've seen this behavior on Chrome OS with USB docks.  I'm guessing this could be the culprit?
Translate
 
It's interesting how software and hardware / systems engineers look at things differently! 
 
Hrm.... +Deniz Mert Edincik got me thinking. Min/max is relative based on your point of reference, host or device. It sounds like it is indeed a max for the device, and a small % of devices may have misinterpreted it as a min for the device. It so, your fix here is correct as it is still within spec for the host, and devices that do something silly like wait 10ms, will still work within the next 10ms (total of 20ms) if they got the rest of it right. I suppose I should go read the LKML thread, someone smarter than me has more than likely already said this or refuted it :-)
 
Great find! And I'm getting stuck in reading micro controller datasheets ;D I really love your work +Sarah Sharp 
 
Guessing it's impossible to tell to what extent this is a problem?
 
Sarah Sharp is pro-level! Living the dream; I would treasure finding this.
 
Speaking of USB and power management, I have an Intel DZ77BH-55K motherboard with USB 3.0 ports. If I connect then disconnect physically an external hard-disk and I suspend to RAM the computer, it wakes up immediately. If I don't connect the disk at all, suspend works fine. Do you happen to have any idea what might be causing this?
 
Wow!  I wrote a patch for my (the new) motherboard chipset's USB about 6-ish years ago to prevent it from exhibiting a very similar behavior with a USB KVM switch resuming. Other OS's never had issues with the device/hardware combo.  I never really submitted it and assumed it was a kludge.  I assumed I was being stupid and not understanding the problem.  Turns out I might have been onto something!  Great catch!  Nice to see I wasn't crazy the whole time.  
 
The more I think about it, the more I think it's a badly written spec and +Deniz Mert Edincik and +Darren Hart have done a bang-up job of explaining just what the problem is. "Min/max is relative based on your point of reference, host or device. It sounds like it is indeed a max for the device, and a small % of devices may have misinterpreted it as a min for the device."
 
+Peter da Silva The problem here is that host controller engineers also interpreted the value as a minimum. Both the host and device have to participate in link training after resume, so the fault could be on either side, or perhaps a combination of factors.
 
I think this is an acceptable fix even if no better permanent fix happens. 20ms is small enough that no one human will notice the difference (its not like its being changed it from 1 to 2 seconds).
 
Yeh, I get it that it's too late to fix the hardware, but it's completely clear to me that the blame belongs on the broken spec.
 
Ah, the good old USB spec - "it's more what you'd call 'guidelines' than actual rules"... :-)
 
If it's up to interpretation as to whether it's a min or a max, then that's not a decent hardware spec.

Not 8 years too soon, either! I love how getting exasperated and complaining to experts tends to solve huge problems.
 
+Sarah Sharp Nice catch, indeed. I have a couple of devices which obviously work fine on Windows systems (since they shipped at all). Mainy my Unicomp keyboard is a bit horrible when it comes to waking up.
 
I cannot see how the spec is broken unless you isolate table 7-14 and remove the "conditions" column.  Sure, table 7-14 lists min/max values, but it quite clearly says this is provided by system software and links to section 7.1.7.7 which clarifies it.  Logically I can see no defense for the interpretation as a minimum value for device hardware and I would still consider any hardware with >10ms requirement in violation of the spec.
 
Perhaps the real solution is disassembling the "gold standard" of USB implementations, the one that everyone probably tests against, and seeing what timeout they use? :)

P.S. usbhub.sys!UsbhPortResumeComplete waits for 10ms, just sayin'.
 
Wow, nice work nailing the bug down like that - it sounds like a frustrating one to identify! And thanks for all your work on the innards of the code that I use every day :-)
 
Awesome write up! thanks for sharing!
 
I always suspected something like this, ever since 2006 when I tried to set up a backup machine on Linux with a USB multi-drive unit (yeah, yeah, I was on a tight budget of exactly $0, and those were what I had to hand). I also tried firewire, but the linux support for that was even worse. I finally ended up using a five-year-old Macintosh I had lying around, because it, you know, worked.
 
+Sarah Sharp Just read the spec in detail. Section 7.1.7.7 clearly says "The USB System Software must provide a 10 ms resume recovery time (TRSMRCY) during which it will not
attempt to access any device connected to the affected (just-activated) bus segment."

And your aforementioned table 7-14 references that passage, showing that the software is required to provide AT LEAST 10ms. The software can also wait as long as it wants. Technically, this means the hardware should be ready at 10ms, or risk the software expecting it to be ready before it is.

The spec should probably have clarified further on this I suppose.

I think the hardware guys are in the wrong on this, not that it really matters. The kernel still has to deal with it, no matter who is right.
 
Seems to me that the Linux kernel documentation needs work. Of course, I'm a tech writer, so that's my answer to every problem...
 
Kudos!

I hope this means that I can re-enable xHCI on my new Haswell chip, and get rid of the USB 3.0 PCI card that I installed as a workaround. With xHCI enabled, all of my USB devices (even USB 2.0 devices) were being dropped seconds after trying to use them.

I'm not sure if the disconnects were being caused by device suspends, but it happened with 3 different USB 3.0 external drives, and a USB 2.0 SD card reader seconds after trying to read data from them.
 
+Dang Ren Bo  True, i'm not familiar enough to be entirely sure whether that was what the hardware wasn't doing so I didn't want to use that as an example of clarification. Assuming you are right, it seems clear to me that the hardware is not following spec.
 
The problem is for some reason they read it as "After the end if the recovery interval, which is at a minimum 10ms, ...". Specs are difficult because natural language is inherently ambiguous. 
 
You're awesome Sarah! Keep up the great (and much needed) work!
Rob Bos
 
Well spotted.
 
This is the one problem that has annoyed me the most for years, especially when using various docking stations Thanks for finding it. Has a bug reported been opened for this?
 
It's mind blowing thinking how many Linux USB hardware issues you may have fixed with this single catch! Congrats SS!
 
"Update: Looks like this is an xHCI specific issue, and probably not the cause of the USB device disconnects under EHCI."
So, it's Not a Misinterpretation of Standard Causing USB Disconnects On Resume In Linux.
 
Nice, It's hard to report on USB issues as for years I myself always leaned towards buggy implementations on cheapo devices as the primary cause. Does this mean that the spec needs some clarification or the kernel just handles it by giving the timeout a generous ceiling?
 
Well holy crap! Some good news on the USB front. (Well... bad... but good ultimately).
 
I am sorry to have to say - I changed to Ubuntu and it is much smoother experience (Gnome and KDE). I was with SuSe for many years but it never released distro that was free of some serious bugs.
 
+Sarah Sharp  Could the MEI_ME resume from S3 be related to this. On a side note my tomato patch looks great this year.
 
I used to work on PC Cards (PCMCIA) on a proprietary, embedded device (anyone remember Magic Cap?) and I observed a similar problem. Devices were supposed to come ready within 100ms, but I used a logic analyzer to watch that line while inserting every card I could get my hands on. Many modems, network adapters, etc. would take longer than the spec allowed. I learned the hard way, like you, that timings in a specification aren't to be taken too literally!
 
TRSMRCY now renamed to SHRPMRCY in your honor
 
Interesting.  Are you sure?  Table 7-14 states Resume Recovery Time is "Provided by USB System Software; Section 7.1.7.7" with a minimum wait of 10ms.  This is the same thing as spec'd earlier "The USB System Software must provide a 10 ms resume..."  Doesn't look like a "bug" more like an unclear to everyone but the writers spec problem.

What I'm not seeing is a clear specification as to when Trsmrcy starts?  From reading the spec, a critical timing element appears to be signal resume (Tdrsmdn):
 "The host may signal resume (TDRSMDN) at any time. It must send the resume signaling for at least 20 ms and then end the resume signaling in one of two ways, depending on the speed at which its port was operating when it was suspended."
and
"The 20 ms of resume signaling ensures that all devices in the network that are enabled to see the resume are awakened."

Looking at the spec, I would guess that the USB System Software, aka Linux in this case, starts Trsmrcy after Tdrsmdn.  I'm guessing that Tdrmrcy is providing the time for the resume signal to be ended as per the spec (either a low-speed EOP or transition to high-speed idle state) as either should be completed within Tdrmrcy and the devices ready.

Of course I could be completely wrong.  It would be interesting to look at the relationship between Tdrsmdn, resume signal ended, and Trsmrcy.
 
Intresting, I have a bug with a SD disk that I use with a partition for SWAP.
everytime that it wakes from Hybernate/suspend it simply does not wait for the SD swap npartition to be READY. it corrupts the Windows Session 
Maybe it is related to this USB Bug.
 
This is awesome. It's amazing to see such blaring problems fixed!
 
That is very nice detective work. How do you time the signals when testing?
But anyways: Good work !!
Ralph H
 
+Peter da Silva good point.  You'd think that being a mature version 3.0 spec, the specification would call for minimum AND maximum values for response times.
Add a comment...