A coworker was experiencing a strange problem with their Internet connection at home. Large downloads from most sites worked fine. The exception was that downloads from a Amazon S3 would get up to a good speed (500Mbps), stall completely for a few seconds, restart for a while, stall again, and eventually hang completely. The problem seemed to be specific to S3, downloads from generic AWS VMs were ok.

What could be going on? It shouldn't be a problem with the ISP, or anything south of that: after all, connections to other sites were working. It should not be a problem between the ISP and Amazon, or there would have been problems with AWS too. But it also seems very unlikely that S3 would have a trivially reproducible problem causing large downloads to hang. It's not like this is some minor use case of the service.

If it had been a problem with e.g. viewing Netflix, one might suspect some kind of targeted traffic shaping. But an ISP throttling or forcibly closing connections to S3 but not to AWS in general? That's just silly talk.

The normal troubleshooting tips like reducing the MTU didn't help either. This sounded like a fascinating networking whodunit, so I couldn't resist butting in after hearing about it through the grapevine.

The packet captures

The first step of debugging pretty much any networking problem is getting a packet capture from as many points in the network as possible. In this case we only had one capture point: the client machine. The problem could not be reproduced on anything but S3, and obviously taking a capture from S3 was not an option. Nor did we have access to any devices elsewhere on the traffic path. [0]

A superficial check of the ACK stream showed the following pattern. The traffic would be humming along nicely, from the sequence numbers we can see that about 57MB have already been downloaded in the first 2.5 seconds.

00:00:02.543596 client > server: Flags [.], ack 57657817
00:00:02.543623 client > server: Flags [.], ack 57661318
00:00:02.543682 client > server: Flags [.], ack 57667046

Then, a single packet loss occurs. We can tell from the SACK block that 1432 bytes of payload are missing. That's almost certainly a single packet.

00:00:02.543734 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:57669910}]

After the single packet loss, more data continues to be delivered with no problems. In the next 100ms a further 6MB gets delivered. But the missing data never arrives.

...
00:00:02.648316 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63829515}]
00:00:02.648371 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63830947}]

In fact, no further ACKs are sent for 4 seconds. And even then it's not done by one 1432 byte packet like we expected, but by two 512 byte packets and one 408 byte one. There's also a RTT-sized delay between the first and second packets.

00:00:06.751691 client > server: Flags [.], ack 57667558,
    options [sack 1 {57668478:63830947}]
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]
00:00:06.796277 client > server: Flags [.], ack 63830947

After that, the connection continues merrily along, but the exact same thing happens 3 seconds later.

What can we tell from this? Clearly the actual server would be retransmitting the lost packet much more quickly than with a 4 second delay. It also would not be re-packetizing the 1432 byte packet into three pieces. Instead what must be happening is that each retransmitted copy is getting lost. After a few seconds RFC 4821-style path MTU probing kicks in, and a smaller packet gets retransmitted. For some reason this retransmission makes it through; this makes the sender believe that the path MTU has been reduced, and it starts sending smaller packets.

Again this suggests there's something dodgy going on with MTUs, but as mentioned in the beginning, reducing the MTU did not help.

But it also suggests a mechanism for why the connection eventually hangs completely, rather than alternating between stalling and recovering. There's a limit to how far the MSS can be reduced. If nothing else, the segments will need to have at least one byte of payload. In practice most operating systems have a much higher limit on the MSS (something in the 80-160 byte range is typical). If even packets of the minimum size aren't making it through, the server can't react by sending smaller packets.

With the information from the ACK stream exhausted, it's time to look at the packets in both directions. And what do you know? We actually see the earlier retransmissions at the client, with beautiful exponential backoff. The packets were not lost in the network, but were silently rejected by the client for some reason.

00:00:02.685557 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:02.960249 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:03.500500 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:04.580168 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:06.751657 server > client: Flags [.], seq 57667046:57667558, ack 4257, length 512
00:00:06.751691 client > server: Flags [.], ack 57667558, win 65528,
    options [sack 1 {57668478:63830947}]
00:00:06.792565 server > client: Flags [.], seq 57667558:57668070, ack 4257, length 512
00:00:06.792567 server > client: Flags [.], seq 57668070:57668478, ack 4257, length 408
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]

There are really just two reasons this would happen. The IP or TCP checksum could be wrong. But how could it be wrong for the same packet six times in a row? That's crazy talk, the expected packet corruption rate is more like one in a million. Alternatively the packet is too large. But damn it, we know that's not the problem, no matter how well this case is matching the common pattern. Let's just have a look at the checksums, to rule it out...

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
...

Oh... Every single copy of that packet had a checksum of 0 instead of the expected checksum of 0xd7a7. (Checksums of 0 are often not real errors, but just artifacts of checksum offload. The packets being captured by software before the checksum is computed by hardware. That's not the case here; these are packets we're receiving rather than transmitting.). And it gets crazier, when we look at the next instance of the problem a few seconds later.

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
...

It's the exact same problem, all the way down to the problem appearing specifically with a TCP checksum of 0xd7a7. Further analysis of the captures verified that this was a systematic problem and not a coincidence. Packets with an expected checksum of 0xd7a7 would always have the checksum replaced with 0. Packets with any other expected checksum would work just fine. [1].

This explains why the path MTU probing temporarily fixes the problem: the repacketized segments have different checksums, and make it through unharmed.

TCP Timestamps

So, a problem internal to S3 is causing this very specific kind of packet corruption then?

Not so fast! It turns out that most TCP implementations would work around this kind of corruption by accident. The reason for that is TCP Timestamps. And while you don't need to actually know much about TCP Timestamps to understand this story, I have been looking for an excuse to rant about them.

With TCP Timestamps, every TCP packet will contain a TCP option with two extra values. One of them is the sender's latest timestamp. The other is an echo of the latest timestamp the sender received from the other party. For example here the client is sending the timestamp 805, and the server is echoing it back:

client > server: Flags [.], ack 89,
    options [TS val 805 ecr 10087]
server > client: Flags [P.], seq 89:450, ack 569,
    options [TS val 10112 ecr 805]

TCP Timestamps were added to TCP very early on, for two reasons, neither of which was very compelling in retrospect.

Reason number one was PAWS, Protection Against Wrapped-Around Sequence-Numbers. The idea was that very fast connections might require huge TCP window sizes, and minor packet reordering/duplication might cause an old packet to be interpreted as a new packet, due to the 32 bit sequence number having wrapped around. I don't think that world ever really arrived, and PAWS is irrelevant to practically all TCP use cases.

The other original reason for timestamps was to enable TCP senders to measure RTTs in the presence of packet loss. But this can also be done with TCP Selective ACKs, a feature that's much more useful in general (and thus was widely deployed a lot sooner, despite being standardized later).

In exchange for these dubious benefits, every TCP packet (both data segments and pure control packets) is bloated by 12 bytes. This is in contrast to something like selective ACKs, where most packets don't grow in size. You only pay for selective ACKs when packets are lost or reordered. I think that the debuggability of network protocols is important, but with TCP you get basically everything you need from other sources. TCP timestamps have a high fixed cost, but give very little additional power.

If TCP Timestamps suck so much, why does everyone use them them? I don't know for sure anyone else's reasons. I ended up implementing them purely due to an interoperability issue with the FreeBSD TCP stack. Basically FreeBSD uses a small static receive window for connections without TCP timestamps, while with TCP timestamps on it'd scale the receive window up as necessary. With connections with even a bit of latency, you needed TCP timestamps to avoid the receive window becoming a bottleneck. (This was fixed in FreeBSD a few months ago. Yay!).

Now, performance of FreeBSD clients isn't a big deal for me as long as the connections work. But you know who else uses a FreeBSD-derived TCP stack? Apple. And when it comes to mobile networks, performance of iOS devices is about as important as it gets. Anyone who cares about large transfers to iOS or OS X clients must use TCP Timestamps, no matter how distasteful they find the feature.

"But Juho, what does any of this have to do with S3?", you ask. Well, S3 is one of those rare services that disable timestamps. And that actually makes for a big difference in this case. With timestamps, each retransmitted copy of a packet would use a different timestamp value [2]. And when any part of the TCP header changes, odds are that the checksum changes as well. Even if some packets are lost due to the having the magic checksum, at least the retransmissions will make it through promptly.

To check this theory, I asked for a test with TCP timestamps disabled on the client. And immediately large downloads from anywhere - even the ISP's own speedtest server - started hanging. Success!

Conclusion

With this information I suggested my coworker call his ISP, and report the problem. He was smarter than that, and ran one more test: switching the cable modem from router mode to bridging mode. Bam, the problem was gone. In retrospect this makes sense: in router mode the cable modem needs to update the checksums for each packet that pass through the device. In bridging mode there's no NAT, so no checksum update is needed.

And that's how a dodgy cable modem caused downloads to fail with one service, but one service only. I've seen many kinds of packet corruption before, but never anything that was so absurdly specific.

Footnotes

[0] There are techniques around for routing the traffic such that we would have had a measurement point. One would have been using something like a VPN or a Socks proxy. But that's such a fundamental change to the traffic pattern that it doesn't make for a very interesting test. Odds are that the problem would just go away when you do that. The other option would be to use a fully transparent generic TCP proxy on some server with a public IP, have the client connect to the TCP proxy and the proxy connect to the actual server. But setting that up is tedious; certainly not worth doing as a first step.

It's also pretty common to only have one trace point to start with. For analysis I'd do for actual work purposes, we pretty often have just a trace from somewhere in the middle of the path, but nothing from the client or the server. Getting traces from multiple points is so much trouble that we usually need to roughly pinpoint the problem first with single-point packet capture, and only then ask for more trace points.

[1] As far as I can tell 0xd7a7 has no interesting special properties. The bytes are not printable ASCII characters. 0xd7a7 isn't a value with any special significance in another TCP header field either. There are ways to screw up TCP checksum computations, but I think they're mostly to do with the way 0x0 and 0xffff are both zero values in a one's complement system.

[2] Assuming sensible timestamp resolution. Not the rather unpractical 500ms tick that e.g. OpenBSD uses.