Shared publicly  - 
 
Some interesting numbers for both client devices, and host operating systems, with regards to usb to serial devices.

I think I need to look into why Linux is so slow on small packet sizes.  I think I know why, it's just that no one has ever really cared in the past, now I have a test case, and a reason to buy a Teensy++ device...
34
5
Arun Raghavan's profile photoTae Wan Kwon's profile photoJosé J. Cabezas Castillo's profile photoCarlos Silva's profile photo
19 comments
 
Hope you figure it out then ;) USB subsystem is you beach :)
 
+Greg Kroah-Hartman - you say you have an idea - care to elaborate? Is it stack overhead, or locking, libusb or?
 
+Auke Kok it better not be libusb, is that what the code is using here?  If so, ick.  I was thinking about some of our ldisc issues where we grab loads of locks per byte going through the system.  Smaller byte packets also don't really use the USB hardware very well at all, so I want to run some apples-to-apples numbers on the same macbook once I get a teensy++ device.
 
Perhaps it is the driver that is not queuing urbs e.g. working synchronously. Ftdi also has docs about optimization of their hardware but sending small packets at irregular intervals is not that great for USB. Interesting question! 
 
+Greg Kroah-Hartman I haven't seen the code and am just throwing random thoughts. It seems that nobody has ever looked at small packet performance on USB at all. +Sarah Sharp something for an intern?
 
+Auke Kok small packets are always going to be bad, as you waste hardware cycles, which is so odd about the speeds under osx that is reported here, it's as if the os is ignoring the request for small packets, and bunching them all together somehow.  Or the test is incorrect, which I'll check next week when I get the device.
 
+Greg Kroah-Hartman I remember looking at small packet performance but for e1000 years ago - but ethernet is easy ;^)
 
It may not be an issue with the USB stack at all.  The Portland State Aerospace Society (PSAS) found that small isochronous USB packets encountered a 1ms latency from interrupt to libusb userspace receiving the packet.  Most of that was in kernel scheduling overhead, not the USB core: http://psas.pdx.edu/news/2012-09-11/usb-timing-breakout.png  PSAS wasn't using real-time Linux, so that may be part of the issue.

More notes here:
http://psas.pdx.edu/news/2012-07-25/
http://psas.pdx.edu/news/2012-08-032/
http://psas.pdx.edu/news/2012-09-11/

The crazy rocket people have started looking at running sensor data over ethernet rather than USB to avoid the issue, so I haven't looked into it deeply.

+Auke Kok  Not sure this is an intern level task.  USB performance was next on my todo list, but that todo list has been side-tracked by small bug fixes, being an architectural contact within Intel, ramping up new people, and OPW for months.
 
+Sarah Sharp this benchmark isn't using libusb at all, it's writing directly to the tty device the kernel provides, so it's probably a different issue.
 
A 1ms latency delivering interrupts also possibly indicates hardware latencies or even some power saving interference - +Sarah Sharp perhaps a student won't be able to fix this, but it sure sounds like an intern could learn from debugging this complex issue - it's how I learned how to write kernel code in the first place.
 
I'm the guy who wrote and published this benchmark.  I use Linux as my primary desktop system, so of course I'm excited to see there's interest in these performance issues.

Though I didn't mention it on that benchmark page, I have looked at the data stream with a USB protocol analyzer (a Beagle from Total Phase).  It appears the cdc_acm driver on Linux is creating a URB for each call to write(), which results in a USB transaction for each individual byte.  The Beagle shows every USB package holding only 1 byte of data, which is a tremendous amount of protocol overhead.

OS-X is somehow able to merge the many pending single-byte writes into large USB transactions.  When I view with the Beagle, nearly all of the USB packets are the maximum 64 byte size.  As you can see on the benchmark, it makes a tremendous difference.

Another issue I've seen with the cdc_acm driver is the behavior of read() after the device is disconnected.  On OS-X, read() returns -1 when the device is no longer present.  On Linux, read() returns 0.  Many program spin at 100% CPU, rapidly calling read().

If someone intends to really work on the cdc_acm driver, I'd be happy to send a Teensy 3.0 board pre-programmed with this benchmark.
 
+Paul Stoffregen I ordered a Teensy 3.0 board yesterday already, so when it comes I'll program it with the benchmark code and take a look at it.

I haven't looked at the cdc_acm driver in a long time, there are probably areas where it could be sped up.  Although if userspace only sends it one byte to send, it isn't going to wait around to see if something else is going to come in before sending it off to the device, so there might not be much we can do there, but I will check, the merging packets does utilize the hardware much better.

The read() issue is interesting, userspace should have gotten a HANGUP signal, telling it that the device is now gone, and it shouldn't keep trying to write to it.  The read() of 0 is correct from what I can tell with the TTY posix layer, interesting that OS-X differs there.
 
Ok, I got the teensy 3.0 board (it really is tiny, nice job), and got the sketch installed and the benchmark running on it.  I get the same speeds you are on Linux, so I'll play with perf and see what is going on here.
 
And I've fixed one problem already, so now the 1 byte packet case is a very small bit faster (still limited by hardware buffer issues, but we reduced the host cpu time somewhat.)  More work to come...
 
+Heyer Alex a patch is queued up for 3.11-rc1 to resolve some of the 1 byte speed issues, but that's just a tiny incremental increase.  It's really a pathological case, rarely seen in the real world, so optimizing to fix it is almost not worth it.  But I have ideas how to do so, it's a matter of more complexity to resolve the issue, so I don't know how well it will work out.

Is this an issue you are seeing in a real-world device that you need resolved?
 
1 byte writes are probably not the norm, but I wouldn't dismiss the 1 byte case as "pathological".  One popular application that tends to write 1 byte at a time is Puredata.  Perhaps Puredata's event-driven data flow model isn't a very efficient way to design applications, but it is a very widely used application.  There are widely used binary protocols modeled after MIDI's short 2 and 3 byte messages, such as Firmata, where writes would be only 2 or 3 bytes, rather than 1.  Many applications use text-based protocols over USB serial links, where writes can be expected in the 10 to 80 byte range, which is still far less than the many kbyte size needed to transfer efficiently as all (or mostly) 64 byte packets.  Also fairly common are protocols designed for unreliable serial links which include a small header at the beginning and a checksum or CRC word at the end of each message.  Often those applications will perform a 1 or 2 byte write for the header, then a larger write for the data, and another 1 or 2 byte write for the checksum.  Those problems, with the cdc_acm driver, could be much more efficient if they composed the entire message into a single buffer and did one write.  But the authors of those applications almost certainly have the mental model of a UART with a FIFO, where writing small chunks will simple go into the hardware FIFO, or a buffer in the kernel that feeds the FIFO, so the data will stream continuously.  My point to this long winded message is real applications can be expected to perform relatively small writes, probably more than just 1 byte in most cases, but probably not many hundreds or thousands of bytes that result in efficient use of the USB host controller.  I realize it's probably unrealistic to do something what's done for TCP networking, where writes are combined into maximum size packets.  But on the other hand, that's obviously what Apple is doing.  For small writes feeding into a packet-based medium, it really makes a tremendous difference.
 
+Auke Kok mice don't accept 1 byte messages thrown at them from the host as fast as the host can send them, that's not their data pattern at all.  Nor do mice use the cdc-acm driver at all, even serial mice don't do that.

+Paul Stoffregen for "real" protocols, throwing 1 byte messages at the hardware as fast as the CPU will allow is not a normal model at all.  2 bytes increases your throughput immensely, and usually you need the device on the other end to do something with your data, so you have to wait anyway.  Do you know of any real-world applications that are affected by the current cdc-acm buffering model (i.e. that it really is too slow for the hardware/application)?

Benchmarks are fine, but finding something that's actually affected by the current implementation seems to be difficult.
Add a comment...