TCP is harder than it looks

Posted on 2014-11-11 in Networking

In theory it's quite easy to write software that deals with TCP at the packet level, whether it's a full TCP stack or just a packet handling application that needs to be TCP-aware. The main RFCs aren't horribly long or complicated, and specify things in great detail. And while new extensions appear regularly, the option negotiation during the initial handshake ensures that you can pick and choose which extensions to support. (Of course some of the extensions are in effect mandatory -- a TCP stack without support for selective acknowledgements or window scaling will be pretty miserable.)

If all you want to do is to e.g. load the frontpage of Google, writing a TCP stack can be a one afternoon hack. And even with a larger scope, if all you need to do is to connect to arbitrary machines running vanilla Linux, BSD, or Windows on the same internal network, you can fairly quickly validate both interoperability and performance.

But since TCP looks so easy to implement, there are a lot of implementations around. Some full TCP stacks, some TCP mangling middleboxes, and some that are simply trying to track the state of TCP connections such as firewalls.

At work our TCP stack doesn't just need to interoperate with just a limited number of top operating systems. We handle hundreds of terabytes of traffic every day, with a traffic mix that's not really under our control. In practice it's completely arbitrary traffic, dealing with any device that could possibly get connected to a cellular network or any device that might have a public IP. Under those circumstances you basically have to be bug-compatible with everything.

There's some well established folklore about which areas tend to be buggy in other systems, and that you thus need to be particularly careful with. For example TCP option ordering and alignment is a common source of such problems, to the extent that at some point you might as well just use the same exact option placement as Linux or Windows, on the assumption that even the sloppiest firewall vendor will have tested at least against those systems!

Zero windows are another frequent source of grief, to such lengths that at multiple mobile operators the technical staff have quizzed us extensively on our use of zero windows. I don't quite know why zero windows have that reputation, but we have definitely seen that class of problems in the wild occasionally (for example the FreeBSD problem from a few years back was very annoying).

But here's a new one we saw recently, which was good for an afternoon of puzzling and is a case that I hadn't heard any scary stories about. A customer reported failures for a certain website when using our TCP implementation, but a success when using a standard one. Not consistently though, there were multiple different failure / succcess cases.

Sometimes we were seeing connections hanging right after the handshake; the SYNACK would have no options at all set (a big red flag), advertised a zero window, and the server would never reply to any zero window probes or otherwise open any window space:

19:53:40.384444 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [S], seq 2054608140, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:53:40.779236 IP x.x.x.x.443 > 10.0.1.110.34098: Flags [S.], seq 3403190647, ack 2054608141, win 0, length 0
19:53:40.885177 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [S], seq 2054608140, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:53:41.189576 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 29200, length 0
19:53:41.189576 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 29200, length 0
19:53:42.189892 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 64000, length 0
19:53:43.391186 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 64000, length 0
19:53:44.832112 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 64000, length 0

Other times the SYNACK would be a lot more reasonable looking, and the connection would work fine:

19:29:16.457114 IP 10.0.1.110.33842 > x.x.x.x.443: Flags [S], seq 1336309505, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:29:17.264497 IP x.x.x.x.443 > 10.0.1.110.33842: Flags [S.], seq 2619514903, ack 1336309506, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
19:29:17.264556 IP 10.0.1.110.33842 > x.x.x.x.443: Flags [.], ack 1, win 229, length 0
19:29:17.265665 IP 10.0.1.110.33842 > x.x.x.x.443: Flags [P.], seq 1:305, ack 1, win 229, length 304
19:29:18.059278 IP x.x.x.x.443 > 10.0.1.110.33842: Flags [.], ack 305, win 995, length 0
19:29:18.087425 IP x.x.x.x.443 > 10.0.1.110.33842: Flags [.], seq 1:1461, ack 305, win 1000, length 1460

And there were also occasions where we'd get back two SYNACKs with different sequence numbers, which of course didn't always work too well:

19:37:41.677890 IP 10.0.1.110.33933 > x.x.x.x.443: Flags [S], seq 2689636737, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:37:41.877046 IP 10.0.1.110.33933 > x.x.x.x.443: Flags [S], seq 2689636737, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:37:42.076611 IP x.x.x.x.443 > 10.0.1.110.33933: Flags [S.], seq 3107565270, ack 2689636738, win 0, length 0
19:37:42.275471 IP x.x.x.x.443 > 10.0.1.110.33933: Flags [S.], seq 3109157454, ack 2689636738, win 0, length 0

You might be able to guess the problem just from the above traces, but actually verifying it required quite a few attempts with slightly tweaked parameters to find the boundary conditions. Who knew that there are systems around that can't handle receiving a duplicate SYN? The three different behaviors seem to correspond with no SYN being retransmitted, the retransmission arriving to the middlebox before it emits a SYNACK, and the retransmission arriving after the middlebox has emitted a SYNACK.

The middlebox was located in Australia, but most likely that IP was just a loadbalancer, transparent reverse proxy, or some similar form of traffic redirection with a real final destination somewhere in the US. When being accessed from Europe, this resulted in an aggregate RTT of something like 450-550ms. Our TCP implementation has a variable base SYN retransmit timout, and in this case it was roughly 500ms. So most of the time the page load would fail with our TCP stack, but succeed with an off the shelf one that had a SYN retransmit timeout of 1 second.

(I said above that I had not heard any scary stories on this, which of course does not mean those scary stories don't exist. After figuring out the root cause, it was easy enough to find more reports of connection breakage due to SYN retransmits, for example this one involving satellites).

It's easy to see how the developers of that unidentified piece of traffic redirecting kit missed a bug in this bit of functionality. Outside of 2G cellular connections, satellites communications, or networks suffering from extreme buffer bloat, it's rare to see RTTs that are long enough to trigger a SYN retransmit. Heck, in this case we were most likely talking of the packets going round the world the long way around.

But from my point of view this is a particularly annoying bug. It is by definition triggered before we have any information at all on the other end of the connection, so there's no possible heuristic we could use to conditionally disable the offending feature just for hosts that are at risk. The options are either to tell a customer that some traffic won't work (which is normally unacceptable, even if the root cause is undeniably on the other end), or to water down a useful feature a little bit to at least fail no more often than the "competition" does.

And it's the slowly aggregating pile of cases like this that makes dealing with TCP hard in practice, no matter how simple it looks to start with.

Name
Message
	As an antispam measure, you need to write a super-secret password below. Today's password is "xyzzy" (without the quotes).
Password

TCP is harder than it looks

Comments