Shared publicly  - 
 
Wow... amazon.com is completely disregarding the initial TCP congestion window: it looks like their servers simply pumping as much data as they can, with an aggressive retransmit timeout as well...

See wireshark capture below: no client ACK's have been sent following the initial GET request for the homepage, and Amazon servers are just pumping bytes: ~76KB outstanding, and growing! 

I guess that's one way to make your site faster... =/
93
38
Filip Talpa's profile photoRobert Obryk's profile photoFernando Miguel's profile photoRich Bradshaw's profile photo
15 comments
 
Someone should talk to Amazon about the tragedy of the commons.
 
+Chris Heald indeed, try... you'll notice that neither of the sites cited in that post do it.  :)

The cwnd=8 was the experiment Google was running back in 2010 to motivate the new setting (over cwnd=4), whereas Microsoft was just blasting away (no longer).
 
ilya, can you provide the pcap trace in addition to the screenshot?

Eyeballing "/" I don't see that.. I see more like IW=4 with a really quick retransmission scheme. Looking at a css and an image I see an IW=10, but then a definite rtt sized pause before more sending.

I've seen this happen before thanks to middleboxes that are actually doing the acking for the client.. its not always the server. typically they are meant to be run in pairs across slow links (like satellite) but sometimes their configuration escapes.

(is an optimistic retransmission that much different than crude FEC?)
 
+Patrick McManus different run from the one in the screenshot, but same results (unless I'm not reading it right): 

- http://www.cloudshark.org/captures/c5cdc8328aa5
https://www.evernote.com/shard/s1/sh/8d294a32-223f-4b11-bc99-40ac79993c20/70fdafd38230dbae4abfa5907bb6f9c2

~1.4s into the trace the first burst of reply packets arrive (all 29 of them), and then continue to pile up on the client. Note: yes I'm injecting an intentional delay on the client (ipfw)... otherwise, you're racing with the actual RTT and its hard to see this in action.

Middleboxes: interesting! In this case I'm inclined to point fingers at Amazon though, since this is the only site that's showing this behavior... and I went through a whole bunch yesterday.

Optimistic retransmission: I guess, kinda/sorta :) ... Hopefully a proper FEC implementation would incur less overhead though. In capture above they basically double the overhead.
 
interesting ilya. amz is always interesting and unsatisfying :)

I can't reproduce your results from the cable modem where I'm spending Thanksgiving  - see http://cloudshark.org/captures/cd2b3b7befee which is more or less what I always get. It looks like a more normal iw=10 behavior with a few weird things from my vantage:

1] the 3whs has a 28ms rtt but the ack of the GET is 67ms. (the rtt from the GET to first byte of response is 102ms, but that could just be application queueing). This is totally reproducible, and smells strongly of some kind of tcp splitting game going on. (sort of like a CDN, but the tcp stream is handled e2e after the termination..)

2] in my trace amz sends 10 packets in 10ms and then waits about 60ms.. this is what you'd expect if the rtt is ~67. At that point 9 of those 10 packets have been acked by the client, the last is held up in an ack coalescing scheme. AMZ sends a dup for that last packet, triggering a dupack and ~50ms later the sending continues. That's totally reproducible for me and totally bizarre and suboptimal.

Even your trace doesn't quite fit the "dump an infinite cwnd" model - it has a large number of stops and starts  (e.g. after 30KB there is a 50ms pause and another one 2 packets later) and fits of activity. I wonder if they are trying to do some kind of rate paced based congestion control that just doesn't work well at all for the huge delays you have added.

how different does it look without the injected delay from your pov?

its worth noting that amz was also always a huge pipelining troublemaker. Sometimes when I would send requests that spanned multiple packets amz would freak out and tcp reset the streams,

anyhow - I still lean more towards amz having a buggy system that is overcomplicated by half rather than just trying to pump as much data as fast as possible. It certainly isn't traditional though :)

fun.
 
Perhaps an ignorant question, but why is this a bad idea (i.e. why is this not how the spec works)? 
 
+Patrick McManus here's another run from my local computer: https://gist.github.com/igrigorik/7739728 (one with no delay, and other with 150ms delay)

- Server response time lines up with your numbers: ~100ms
- Amz seems to push ~21 KB of data at me from the start
- With 150ms delay they end up retransmitting all of it before ACKs arrive

In other words, if you're on a fast RTT, you get the content once.. but if you're on a slow RTT, then tough luck, you're getting all the bytes twice ;)

Now, here's the real kicker.. If I push my delay to 250 ms, check this out: https://gist.github.com/igrigorik/7739728#file-250ms-delay-txt.. As far as I can tell, it's interleaving both retransmits for earlier data and new data -- by the time the first client ACK is received, it's pushed ~90KB of data (plus retransmit overhead). 

My pet theory is that the ~21KB delay is simply a stall in their processing: header is flushed from one service, but the remainder takes some extra time, hence the difference?

+Tony Gentilcore yep! That said, certainly one way to make your site load faster.. :)
 
The fastest way is to send everything at once and retry every so often. That's how TCP used to be  before 1988 collapse. The only problem is that won't scale if everyone does that. Google HTTP uses IW10 as spec'd in the RFC (http://tools.ietf.org/html/rfc6928), this has been carefully experimented with lots of analysis and discussion on the IETF tcpm. Of course this does not forbid experimenting a faster protocol, which maybe amazon is doing. but it'll be nice to share those results so we know it's scalar-able.
 
also the other problem is that window scaling is not working very well on Windows machines... 
 
> Sometimes when I would send requests that spanned multiple packets amz would freak out and tcp reset the streams,

that's probably a Citrix LB bug. There seem to be quite a few of them (SSL offload being another bad one). It doesn't surprise me in the slightest that between the Citrix and webserver configurations Amz might be playing fast and loose.
Add a comment...