The mystery of the hanging S3 downloads

Posted on 2017-07-20 in Networking

A coworker was experiencing a strange problem with their Internet connection at home. Large downloads from most sites worked fine. The exception was that downloads from a Amazon S3 would get up to a good speed (500Mbps), stall completely for a few seconds, restart for a while, stall again, and eventually hang completely. The problem seemed to be specific to S3, downloads from generic AWS VMs were ok.

What could be going on? It shouldn't be a problem with the ISP, or anything south of that: after all, connections to other sites were working. It should not be a problem between the ISP and Amazon, or there would have been problems with AWS too. But it also seems very unlikely that S3 would have a trivially reproducible problem causing large downloads to hang. It's not like this is some minor use case of the service.

If it had been a problem with e.g. viewing Netflix, one might suspect some kind of targeted traffic shaping. But an ISP throttling or forcibly closing connections to S3 but not to AWS in general? That's just silly talk.

The normal troubleshooting tips like reducing the MTU didn't help either. This sounded like a fascinating networking whodunit, so I couldn't resist butting in after hearing about it through the grapevine.

The packet captures

The first step of debugging pretty much any networking problem is getting a packet capture from as many points in the network as possible. In this case we only had one capture point: the client machine. The problem could not be reproduced on anything but S3, and obviously taking a capture from S3 was not an option. Nor did we have access to any devices elsewhere on the traffic path. [0]

A superficial check of the ACK stream showed the following pattern. The traffic would be humming along nicely, from the sequence numbers we can see that about 57MB have already been downloaded in the first 2.5 seconds.

00:00:02.543596 client > server: Flags [.], ack 57657817
00:00:02.543623 client > server: Flags [.], ack 57661318
00:00:02.543682 client > server: Flags [.], ack 57667046

Then, a single packet loss occurs. We can tell from the SACK block that 1432 bytes of payload are missing. That's almost certainly a single packet.

00:00:02.543734 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:57669910}]

After the single packet loss, more data continues to be delivered with no problems. In the next 100ms a further 6MB gets delivered. But the missing data never arrives.

...
00:00:02.648316 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63829515}]
00:00:02.648371 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63830947}]

In fact, no further ACKs are sent for 4 seconds. And even then it's not done by one 1432 byte packet like we expected, but by two 512 byte packets and one 408 byte one. There's also a RTT-sized delay between the first and second packets.

00:00:06.751691 client > server: Flags [.], ack 57667558,
    options [sack 1 {57668478:63830947}]
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]
00:00:06.796277 client > server: Flags [.], ack 63830947

After that, the connection continues merrily along, but the exact same thing happens 3 seconds later.

What can we tell from this? Clearly the actual server would be retransmitting the lost packet much more quickly than with a 4 second delay. It also would not be re-packetizing the 1432 byte packet into three pieces. Instead what must be happening is that each retransmitted copy is getting lost. After a few seconds RFC 4821-style path MTU probing kicks in, and a smaller packet gets retransmitted. For some reason this retransmission makes it through; this makes the sender believe that the path MTU has been reduced, and it starts sending smaller packets.

Again this suggests there's something dodgy going on with MTUs, but as mentioned in the beginning, reducing the MTU did not help.

But it also suggests a mechanism for why the connection eventually hangs completely, rather than alternating between stalling and recovering. There's a limit to how far the MSS can be reduced. If nothing else, the segments will need to have at least one byte of payload. In practice most operating systems have a much higher limit on the MSS (something in the 80-160 byte range is typical). If even packets of the minimum size aren't making it through, the server can't react by sending smaller packets.

With the information from the ACK stream exhausted, it's time to look at the packets in both directions. And what do you know? We actually see the earlier retransmissions at the client, with beautiful exponential backoff. The packets were not lost in the network, but were silently rejected by the client for some reason.

00:00:02.685557 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:02.960249 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:03.500500 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:04.580168 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:06.751657 server > client: Flags [.], seq 57667046:57667558, ack 4257, length 512
00:00:06.751691 client > server: Flags [.], ack 57667558, win 65528,
    options [sack 1 {57668478:63830947}]
00:00:06.792565 server > client: Flags [.], seq 57667558:57668070, ack 4257, length 512
00:00:06.792567 server > client: Flags [.], seq 57668070:57668478, ack 4257, length 408
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]

There are really just two reasons this would happen. The IP or TCP checksum could be wrong. But how could it be wrong for the same packet six times in a row? That's crazy talk, the expected packet corruption rate is more like one in a million. Alternatively the packet is too large. But damn it, we know that's not the problem, no matter how well this case is matching the common pattern. Let's just have a look at the checksums, to rule it out...

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
...

Oh... Every single copy of that packet had a checksum of 0 instead of the expected checksum of 0xd7a7. (Checksums of 0 are often not real errors, but just artifacts of checksum offload. The packets being captured by software before the checksum is computed by hardware. That's not the case here; these are packets we're receiving rather than transmitting.). And it gets crazier, when we look at the next instance of the problem a few seconds later.

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
...

It's the exact same problem, all the way down to the problem appearing specifically with a TCP checksum of 0xd7a7. Further analysis of the captures verified that this was a systematic problem and not a coincidence. Packets with an expected checksum of 0xd7a7 would always have the checksum replaced with 0. Packets with any other expected checksum would work just fine. [1].

This explains why the path MTU probing temporarily fixes the problem: the repacketized segments have different checksums, and make it through unharmed.

TCP Timestamps

So, a problem internal to S3 is causing this very specific kind of packet corruption then?

Not so fast! It turns out that most TCP implementations would work around this kind of corruption by accident. The reason for that is TCP Timestamps. And while you don't need to actually know much about TCP Timestamps to understand this story, I have been looking for an excuse to rant about them.

With TCP Timestamps, every TCP packet will contain a TCP option with two extra values. One of them is the sender's latest timestamp. The other is an echo of the latest timestamp the sender received from the other party. For example here the client is sending the timestamp 805, and the server is echoing it back:

client > server: Flags [.], ack 89,
    options [TS val 805 ecr 10087]
server > client: Flags [P.], seq 89:450, ack 569,
    options [TS val 10112 ecr 805]

TCP Timestamps were added to TCP very early on, for two reasons, neither of which was very compelling in retrospect.

Reason number one was PAWS, Protection Against Wrapped-Around Sequence-Numbers. The idea was that very fast connections might require huge TCP window sizes, and minor packet reordering/duplication might cause an old packet to be interpreted as a new packet, due to the 32 bit sequence number having wrapped around. I don't think that world ever really arrived, and PAWS is irrelevant to practically all TCP use cases.

The other original reason for timestamps was to enable TCP senders to measure RTTs in the presence of packet loss. But this can also be done with TCP Selective ACKs, a feature that's much more useful in general (and thus was widely deployed a lot sooner, despite being standardized later).

In exchange for these dubious benefits, every TCP packet (both data segments and pure control packets) is bloated by 12 bytes. This is in contrast to something like selective ACKs, where most packets don't grow in size. You only pay for selective ACKs when packets are lost or reordered. I think that the debuggability of network protocols is important, but with TCP you get basically everything you need from other sources. TCP timestamps have a high fixed cost, but give very little additional power.

If TCP Timestamps suck so much, why does everyone use them them? I don't know for sure anyone else's reasons. I ended up implementing them purely due to an interoperability issue with the FreeBSD TCP stack. Basically FreeBSD uses a small static receive window for connections without TCP timestamps, while with TCP timestamps on it'd scale the receive window up as necessary. With connections with even a bit of latency, you needed TCP timestamps to avoid the receive window becoming a bottleneck. (This was fixed in FreeBSD a few months ago. Yay!).

Now, performance of FreeBSD clients isn't a big deal for me as long as the connections work. But you know who else uses a FreeBSD-derived TCP stack? Apple. And when it comes to mobile networks, performance of iOS devices is about as important as it gets. Anyone who cares about large transfers to iOS or OS X clients must use TCP Timestamps, no matter how distasteful they find the feature.

"But Juho, what does any of this have to do with S3?", you ask. Well, S3 is one of those rare services that disable timestamps. And that actually makes for a big difference in this case. With timestamps, each retransmitted copy of a packet would use a different timestamp value [2]. And when any part of the TCP header changes, odds are that the checksum changes as well. Even if some packets are lost due to the having the magic checksum, at least the retransmissions will make it through promptly.

To check this theory, I asked for a test with TCP timestamps disabled on the client. And immediately large downloads from anywhere - even the ISP's own speedtest server - started hanging. Success!

Conclusion

With this information I suggested my coworker call his ISP, and report the problem. He was smarter than that, and ran one more test: switching the cable modem from router mode to bridging mode. Bam, the problem was gone. In retrospect this makes sense: in router mode the cable modem needs to update the checksums for each packet that pass through the device. In bridging mode there's no NAT, so no checksum update is needed.

And that's how a dodgy cable modem caused downloads to fail with one service, but one service only. I've seen many kinds of packet corruption before, but never anything that was so absurdly specific.

Footnotes

[0] There are techniques around for routing the traffic such that we would have had a measurement point. One would have been using something like a VPN or a Socks proxy. But that's such a fundamental change to the traffic pattern that it doesn't make for a very interesting test. Odds are that the problem would just go away when you do that. The other option would be to use a fully transparent generic TCP proxy on some server with a public IP, have the client connect to the TCP proxy and the proxy connect to the actual server. But setting that up is tedious; certainly not worth doing as a first step.

It's also pretty common to only have one trace point to start with. For analysis I'd do for actual work purposes, we pretty often have just a trace from somewhere in the middle of the path, but nothing from the client or the server. Getting traces from multiple points is so much trouble that we usually need to roughly pinpoint the problem first with single-point packet capture, and only then ask for more trace points.

[1] As far as I can tell 0xd7a7 has no interesting special properties. The bytes are not printable ASCII characters. 0xd7a7 isn't a value with any special significance in another TCP header field either. There are ways to screw up TCP checksum computations, but I think they're mostly to do with the way 0x0 and 0xffff are both zero values in a one's complement system.

[2] Assuming sensible timestamp resolution. Not the rather unpractical 500ms tick that e.g. OpenBSD uses.

If you liked this and want to be notified of new posts, follow me on Twitter

Like word games or puzzles? Try out my new word puzzle Huewords

Next » Why PS4 downloads are so slow

Previous « I don't want no 'wantarray'

Comments

By Mark Entingh on 2017-07-20

amazing insight, thank you

By Matt Cross on 2017-07-20

I once worked on a CMTS (Cable Modem Termination System - the box at the cable company that talks to cable modems) on the fast-path forwarding software. We had a very similar bug that I tracked down to exactly this, and it did turn out to be a bug in our IP checksum calculation. The details are fuzzy but it had to do with handling carry bits and ones-complement arithmetic. Every router has to do this as it decrements TTL values. There was code in there that was optimized to not fully recalculate the checksum but instead "un-checksum" the old bytes from the old checksum and recalculate the checksum with the replacement bytes.

I suspect this is exactly what's going on here, just at the TCP header level as the router changes TCP port numbers.

By Will Mooney on 2017-07-20

Just curious as to what kind of client you were using on the client side.

By Vincent Bernat on 2017-07-20

Timestamps also enable to reuse timewait connections on Linux (through tcp_tw_reuse).

By Tim Schaller on 2017-07-20

Just curious, what kind of came modem was it?

Thanks.

By A Zepeda on 2017-07-20

Loved this article, good stuff

By Zan Lynx on 2017-07-20

Some TCP congestion algorithms rely on timestamps for measuring latency. The newest and best of these is BBR.

By Juho on 2017-07-20

Matt: Interesting! Yeah, getting the one's complement incremental checksum update computation is what I was referring to in footnote 1 (see RFC 1624 for the details of that). But it's still odd that it'd happen with this specific magic number. If it had been 0xffff replaced by 0x0000, it'd make more sense.

Will: I believe it was Chrome on Windows. But the client wouldn't matter for this, as long as it had TCP timestamps on by default. Any OS would have rejected the packets.

Tim: I don't know the exact make.

Zan: Measuring latency just isn't a good reason for timestamps. RTTs can be measured from the ACKs/SACKs with no need for TCP timestamps.

The only exception is LEDBAT, which tries to measure directional latency rather than RTTs. But it's making an assumption about just how timestamps work that's not actually in the spec. So I believe the only production use of LEDBAT doesn't actually use directional latency, but also uses RTTs.

By Dj on 2017-07-21

What modem model # are you testing with? Wouldn't happen to be a PUMA 6 based modem would it? You do know there is a big problem with PUMA 6 based modems right? http://www.dslreports.com/forum/r31122204-SB6190-Puma6-TCP-UDP-Network-Latency-Issue-Discussion

http://www.dslreports.com/forum/r31079834-ALL-SB6190-is-a-terrible-modem-Intel-Puma-6-MaxLinear-mistake

By KRT on 2017-07-23

I've encountered a similar occurrence with packet handling in some professional kit before. Anything UDP that passed through without a checksum would be translated into a packet with an incorrect checksum being inserted. The vendor didn't believe me until they sent their own techs onsite and saw it with their own packet captures. Firmware was quickly issued that addressed the problem.

By osbjmg on 2017-07-25

Interesting story.

I understand that with these S3 downloads, you would commonly see this 0xd7a7 checksums, but it is present in all S3 downloads?

Do you know more about that datagram, and why it would be so reliable?

By osbjmg on 2017-07-26

Interesting story.

I understand that with these S3 downloads, you would commonly see this 0xd7a7 checksums, but it is present in all S3 downloads?

Do you know more about that datagram, and why it would be so reliable?

By Juho on 2017-07-26

osbjmg,

The checksums are effectively random, so 0xd7a7 will be present equally often in transmissions from any server. You'll get one on average every 100MB; so statistically one would be present in any sufficiently large download. It's just that on S3 the checksums would be "sticky" in retransmissions, unlike on other services.

By Juri Rischel Jensen on 2017-07-27

I'm facing a (looks like similar) problem right now, where I in the receiving end see the following:

REDACTED.22 > REDACTED.52469: Flags [.], cksum 0x7ab7 (incorrect -> 0x2a12), ack 656401448, win 1022, options [nop,nop,TS val 328489647 ecr 210124470], length 0

and every line reports the same checksum: 0x7ab7

The result is stalling connections (I do zfs send/recieve over SSH).

Anyone has any clue what's going on here...?

By Juri Rischel Jensen on 2017-07-27

Maybe I should add that I have a PFSense box in front of the sending part...

By Juho on 2017-07-27

Juri,

Do you have the ability to take a capture on the sender or the PFSense box? These kinds of things are always a lot easier when you can see both sides of a connection.

And just to be clear (since sender/receiver could be either way), is the packet you pasted from the REDACTED.52469 host? If it was taken from REDACTED.22, that looks like just capture artifact from hardware TCP checksum offload, not the actual problem.

(This sounds like an interesting one, so if you get traces from both ends, feel free to send me an email. Contact information available at the front page of this site).

By Juri Rischel Jensen on 2017-07-28

Hi Juho

Thank you for your answer. After posting I did a capture on the sending end. There I got the same error (after a short while), but the checksum is different:

cksum 0x6b87 (incorrect -> 0xb48b)

Every time the error occurs, the checksum is reported as 0x6b87 on the sending side, and 0x7ab7 on the receiving side.

I haven't done a capture on the pfsense box, as I haven't got access to it. But I've requested a dump.

And the packet I've pasted was captured on the receiver (REDACTED.22).

Lastly, the problem just appeared about 2 weeks ago. I've had these send/receives running for several years on the same equipment, with hundreds of TB's going through without problems.

I'll maybe take you up on your offer to take a look on the dumps. I'll try to do captures on all three points and send them to you.

Thank you.

By Fede on 2017-08-03

This is just the weirdest story ever.

By Diego on 2018-02-02

I'm experiencing the exact same behaviour from my Cloudways hosting. I guess they won't try to switch a cable...

By Nick on 2018-06-07

Great write up!

The CM is Hitron CODA 5482 .

Now a year later and many firmware updates since, the bug is still here.

I can't thank Juho enough for his assistance in finding the needle in the hay(tcp) stack!

By Juho on 2018-06-10

Thanks for the update Nick!

By Mike on 2018-08-27

I had exactly the same problem with a Billion 7800N router and had been tearing my hair out for months over it. I replaced the router and it's fine now. I don't know how I would have diagnosed it without this page. Thanks.

By Terence on 2018-10-29

Thank you!!!!! This really helped me figure out a problem I would have never figured out!

By Mekkaz on 2019-01-18

Goddamn. Bro you are smart f**k!

I dont even work Networking. In fact, Networking is my weakest area despite. The most I do with packet level stuff like this is Wireshark a few things. I just landed on this page for searching if I can packet mangle and looking up [F,A]...that was 3-4 articles ago. This is a good damn website. Reading thru this joint. Im learning more and more stuff that I dont need to actually know but that is presented interestingly.

Anyway. keep up the good work. This is getting added into my favs like that Malware Analysis website.

By Mofoman on 2019-06-06

I ended up on this when investigating problem stalled S3 downloads on my iPad application. The files (300 - 500 MB in size) are downloaded for a while, but then they just stall. No error HTTP response, even one with an error is received from S3.

So if I understood correctly, this is not an issue that can be addressed from the application layer (meaning in my code), right? So somehow I would have to enable TCP timestamps on OS level?

Or is there anything I can do to this?

By Juho on 2019-06-12

Mofoman,

I'm afraid that enabling TCP timestamps from the client isn't possible if the server doesn't support it.

The obvious application-level fix would be to do the download in chunks. So download the file e.g. a 10MB range at a time. If a chunk appears to stall, open a new connection to S3 and restart that chunk but keep all the previous progress.

Post comment

Name
Message
	As an antispam measure, you need to write a super-secret password below. Today's password is "xyzzy" (without the quotes).
Password