Update: Looks like this is an xHCI specific issue, and probably not the cause of the USB device disconnects under EHCI.  To everyone who commented with other USB issues (none of which really sounded related), please email the linux-usb mailing list with a description of your issue.

Facepalm.  Linux may have been causing USB disconnects on resume from device suspend all along.

Pretty much for my entire career in Linux USB (eight years now?), we've been complaining about how USB device power management just sucks.  We enable auto-suspend for a USB device driver, and find dozens of different USB devices that simply disconnect from the bus when auto-suspend is enabled.

For years, we've blamed those devices for being cheap, crappy, and broken.  We talked about blacklists in the kernel, and ripped those out when they got too big.  We've talked about whitelists in userspace, but not many distros have time to cultivate such lists.

It turns out it's not always the device's fault.

There's a 10 ms timeout in the USB core, with a simple comment over it "TRSMRCY = 10 msec".  And indeed, in the USB 2.0 spec, section 7.1.7.7, the spec states "The USB System Software must provide a 10 ms resume recovery time (TRSMRCY) during which it will not attempt to access any device connected to the affected (just-activated) bus segment."

A software developer would say, "Ok, that means I can access the device after I wait 10 ms."  However, a hardware developer would take a careful look at Table 7-14 and notice TRSMRCY is listed as a minimum value.

That means the USB ports can be in resume for longer than TRSMRCY.  If the USB core attempts to access those ports while the device is still coming out of resume, such as issuing transfers to the device, or resetting the port, the device will disconnect, or transfer errors will occur.  This causes the USB core to mark the device as disconnected.

It's easy to see that TRSMRCY varies on an Intel xHCI host controller by adding some extra debugging.  The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state.  The maximum time I've seen is around 17 ms, and the time is above 10 ms in about 8% of the remote wakeup events I've tested.

This patch is not the "real fix" for solving the issues with the USB core, and I despise fixing things by tweaking timeout values, so I'll have to work on a real fix tomorrow.  But at least there's a light at the end of the tunnel for USB device power management.
Shared publiclyView activity