Juho Snellman's Weblog

Why PS4 downloads are so slow

jsnell@iki.fi — Sat, 19 Aug 2017 19:00:00 GMT

Game downloads on PS4 have a reputation of being very slow, with many people reporting downloads being an order of magnitude faster on Steam or Xbox. This had long been on my list of things to look into, but at a pretty low priority. After all, the PS4 operating system is based on a reasonably modern FreeBSD (9.0), so there should not be any crippling issues in the TCP stack. The implication is that the problem is something boring, like an inadequately dimensioned CDN.

But then I heard that people were successfully using local HTTP proxies as a workaround. It should be pretty rare for that to actually help with download speeds, which made this sound like a much more interesting problem.

This is going to be a long-winded technical post. If you're not interested in the details of the investigation but just want a recommendation on speeding up PS4 downloads, skip straight to the conclusions.

Background

Before running any experiments, it's good to have a mental model of how the thing we're testing works, and where the problems might be. If nothing else, it will guide the initial experiment design.

The speed of a steady-state TCP connection is basically defined by three numbers. The amount of data the client is will to receive on a single round-trip (TCP receive window), the amount of data the server is willing to send on a single round-trip (TCP congestion window), and the round trip latency between the client and the server (RTT). To a first approximation, the connection speed will be:

    speed = min(rwin, cwin) / RTT

With this model, how could a proxy speed up the connection? Well, with a proxy the original connection will be split into two mostly independent parts; one connection between the client and the proxy, and another between the proxy and the server. The speed of the end-to-end connection will be determined by the slower of those two independent connections:

    speed_proxy_client = min(client rwin, proxy cwin) / client-proxy RTT
    speed_server_proxy = min(proxy rwin, server cwin) / proxy-server RTT
    speed = min(speed_proxy_client, speed_server_proxy)

With a local proxy the client-proxy RTT will be very low; that connection is almost guaranteed to be the faster one. The improvement will have to be from the server-proxy connection being somehow better than the direct client-server one. The RTT will not change, so there are just two options: either the client has a much smaller receive window than the proxy, or the client is somehow causing the server's congestion window to decrease. (E.g. the client is randomly dropping received packets, while the proxy isn't).

Out of these two theories, the receive window one should be much more likely, so we should concentrate on it first. But that just replaces our original question with a new one: why would the client's receive window be so low that it becomes a noticeable bottleneck? There's a fairly limited number of causes for low receive windows that I've seen in the wild, and they don't really seem to fit here.

Maybe the client doesn't support the TCP window scaling option, while the proxy does. Without window scaling, the receive window will be limited to 64kB. But since we know Sony started with a TCP stack that supports window scaling, they would have had to go out of their way to disable it. Slow downloads, for no benefit.
Maybe the actual downloader application is very slow. The operating system is supposed to have a certain amount of buffer space available for each connection. If the network is delivering data to the OS faster than the application is reading it, the buffer will start to fill up, and the OS will reduce the receive window as a form of back-pressure. But this can't be the reason; if the application is the bottleneck, it'll be a bottleneck with or without the proxy.
The operating system is trying to dynamically scale the receive window to match the actual network conditions, but something is going wrong. This would be interesting, so it's what we're hoping to find.

The initial theories are in place, let's get digging.

Experiment #1

For our first experiment, we'll start a PSN download on a baseline non-Slim PS4, firmware 4.73. The network connection of the PS4 is bridged through a Linux machine, where we can add latency to the network using tc netem. By varying the added latency, we should be able to find out two things: whether the receive window really is the bottleneck, and whether the receive window is being automatically scaled by the operating system.

This is what the client-server RTTs (measured from a packet capture using TCP timestamps) look like for the experimental period. Each dot represents 10 seconds of time for a single connection, with the Y axis showing the minimum RTT seen for that connection in those 10 seconds.

The next graph shows the amount of data sent by the server in one round trip in red, and the receive windows advertised by the client in blue.

First, since the blue dots are staying constantly at about 128kB, the operating system doesn't appear to be doing any kind of receive window scaling based on the RTT. (So much for that theory). Though at the very right end of the graph the receive window shoots out to 650kB, so it isn't totally fixed either.

Second, is the receive window the bottleneck here? If so, the blue dots would be close to the red dots. This is the case until about 10:50. And then mysteriously the bottleneck moves to the server.

So we didn't find quite what we were looking for, but there are a couple of very interesting things that are correlated with events on the PS4.

The download was in the foreground for the whole duration of the test. But that doesn't mean it was the only thing running on the machine. The Netflix app was still running in the background, completely idle [1]. When the background app was closed at 11:00, the receive window increased dramatically. This suggests a second experiment, where different applications are opened / closed / left running in the background.

The time where the receive window stops being the bottleneck is very close to the PS4 entering rest mode. That looks like another thing worth investigating. Unfortunately, that's not true, and rest mode is a red herring here. [2]

Experiment #2

Below is a graph of the receive windows for a second download, annotated with the timing of various noteworthy events.

The differences in receive windows at different times are striking. And more important, the changes in the receive windows correspond very well to specific things I did on the PS4.

When the download was started, the game Styx: Shards of Darkness was running in the background (just idling in the title screen). The download was limited by a receive window of under 7kB. This is an incredibly low value; it's basically going to cause the downloads to take 100 times longer than they should. And this was not a coincidence, whenever that game was running, the receive window would be that low.
Having an app running (e.g. Netflix, Spotify) limited the receive window to 128kB, for about a 5x reduction in potential download speed.
Moving apps, games, or the download window to the foreground or background didn't have any effect on the receive window.
Launching some other games (Horizon: Zero Dawn, Uncharted 4, Dreadnought) seemed to have the same effect as running an app.
Playing an online match in a networked game (Dreadnought) caused the receive window to be artificially limited to 7kB.
Playing around in a non-networked game (Horizon: Zero Dawn) had a very inconsistent effect on the receive window, with the effect seemingly depending on the intensity of gameplay. This looks like a genuine resource restriction (download process getting variable amounts of CPU), rather than an artificial limit.
I ran a speedtest at a time when downloads were limited to 7kB receive window. It got a decent receive window of over 400kB; the conclusion is that the artificial receive window limit appears to only apply to PSN downloads.
Putting the PS4 into rest mode had no effect.
Built-in features of the PS4 UI, like the web browser, do not count as apps.
When a game was started (causing the previously running game to be stopped automatically), the receive window could increase to 650kB for a very brief period of time. Basically it appears that the receive window gets unclamped when the old game stops, and then clamped again a few seconds later when the new game actually starts up.

I did a few more test runs, and all of them seemed to support the above findings. The only additional information from that testing is that the rest mode behavior was dependent on the PS4 settings. Originally I had it set up to suspend apps when in rest mode. If that setting was disabled, the apps would be closed when entering in rest mode, and the downloads would proceed at full speed.

A 7kB receive window will be absolutely crippling for any user. A 128kB window might be ok for users who have CDN servers very close by, or who don't have a particularly fast internet. For example at my location, a 128kB receive window would cap the downloads at about 35Mbp to 75Mbps depending on which CDN the DNS RNG happens to give me. The lowest two speed tiers for my ISP are 50Mbps and 200Mbps. So either the 128kB would not be a noticeable problem (50Mbps) or it'd mean that downloads are artificially limited to to 25% speed (200Mbps).

Conclusions

If any applications are running, the PS4 appears to change the settings for PSN store downloads, artificially restricting their speed. Closing the other applications will remove the limit. There are a few important details:

Just leaving the other applications running in the background will not help. The exact same limit is applied whether the download progress bar is in the foreground or not.
Putting the PS4 into rest mode might or might not help, depending on your system settings.
The artificial limit applies only to the PSN store downloads. It does not affect e.g. the built-in speedtest. This is why the speedtest might report much higher speeds than the actual downloads, even though both are delivered from the same CDN servers.
Not all applications are equal; most of them will cause the connections to slow down by up to a factor of 5. Some games will cause a difference of about a factor of 100. Some games will start off with the factor of 5, and then migrate to the factor of 100 once you leave the start menu and start playing.
The above limits are artificial. In addition to that, actively playing a game can cause game downloads to slow down. This appears to be due to a genuine lack of CPU resources (with the game understandably having top priority).

So if you're seeing slow downloads, just closing all the running applications might be worth a shot. (But it's obviously not guaranteed to help. There are other causes for slow downloads as well, this will just remove one potential bottleneck). To close the running applications, you'll need to long-press the PS button on the controller, and then select "Close applications" from the menu.

The PS4 doesn't make it very obvious exactly what programs are running. For games, the interaction model is that opening a new game closes the previously running one. This is not how other apps work; they remain in the background indefinitely until you explicitly close them.

And it's gets worse than that. If your PS4 is configured to suspend any running apps when put to rest mode, you can seemingly power on the machine into a clean state, and still have a hidden background app that's causing the OS to limit your PSN download speeds.

This might explain some of the superstitions about this on the Internet. There are people who swear that putting the machine to rest mode helps with speeds, others who say it does nothing. Or how after every firmware update people will report increased download speeds. Odds are that nothing actually changed in the firmware; it's just that those people had done their first full reboot in a while, and finally had a system without a background app running.

Speculation

Those were the facts as I see them. Unfortunately this raises some new questions, which can't be answered experimentally. With no facts, there's no option except to speculate wildly!

Q: Is this an intentional feature? If so, what its purpose?

Yes, it must be intentional. The receive window changes very rapidly when applications or games are opened/closed, but not for any other reason. It's not any kind of subtle operating system level behavior; it's most likely the PS4 UI explicitly manipulating the socket receive buffers.

But why? I think the idea here must be to not allow the network traffic of background downloads to take resources away from the foreground use of the PS4. For example if I'm playing an online shooter, it makes sense to harshly limit the background download speeds to make sure the game is getting ping times that are both low and predictable. So there's at least some point in that 7kB receive window limit in some circumstances.

It's harder to see what the point of the 128kB receive window limit for running any app is. A single game download from some random CDN isn't going to muscle out Netflix or Youtube... The only thing I can think of is that they're afraid that multiple simultaneous downloads, e.g. due to automatic updates, might cause problems for playing video. But even that seems like a stretch.

There's an alternate theory that this is due to some non-network resource constraints (e.g. CPU, memory, disk). I don't think that works. If the CPU or disk were the constraint, just having the appropriate priorities in place would automatically take care of this. If the download process gets starved of CPU or disk bandwidth due to a low priority, the receive buffer would fill up and the receive window would scale down dynamically, exactly when needed. And the amounts of RAM we're talking about here are miniscule on a machine with 8GB of RAM; less than a megabyte.

Q: Is this feature implemented well?

Oh dear God, no. It's hard to believe just how sloppy this implementation is.

The biggest problem is that the limits get applied based just on what games/applications are currently running. That's just insane; what matters should be which games/applications someone is currently using. Especially in a console UI, it's a totally reasonable expectation that the foreground application gets priority. If I've got the download progress bar in the foreground, the system had damn well give that download priority. Not some application that was started a month ago, and hasn't been used since. Applying these limits in rest mode with suspended apps is beyond insane.

Second, these limits get applied per-connection. So if you've got a single download going, it'll get limited to 128kB of receive window. If you've got five downloads, they'll all get 128kB, for a total of 640kB. That means the efficiency of the "make sure downloads don't clog the network" policy depends purely on how many downloads are active. That's rubbish. This is all controlled on the application level, and the application knows how many downloads are active. If there really were an optimal static receive window X, it should just be split evenly across all the downloads.

Third, the core idea of applying a static receive window as a means of fighting bufferbloat is just fundamentally broken. Using the receive window as the rate limiting mechanism just means that the actual transfer rate will depend on the RTT (this is why a local proxy helps). For this kind of thing to work well, you can't have the rate limit depend on the RTT. You also can't just have somebody come up with a number once, and apply that limit to everyone. The limit needs to depend on the actual network conditions.

There are ways to detect how congested the downlink is in the client-side TCP stack. The proper fix would be to implement them, and adjust the receive window of low-priority background downloads if and only if congestion becomes an issue. That would actually be a pretty valuable feature for this kind of appliance. But I can kind of forgive this one; it's not an off the shelf feature, and maybe Sony doesn't employ any TCP kernel hackers.

Fourth, whatever method is being used to decide on whether a game is network-latency sensitive is broken. It's absurd that a demo of a single-player game idling in the initial title screen would cause the download speeds to be totally crippled. This really should be limited to actual multiplayer titles, and ideally just to periods where someone is actually playing the game online. Just having the game running should not be enough.

Q: How can this still be a problem, 4 years after launch?

I have no idea. Sony must know that the PSN download speeds have been a butt of jokes for years. It's probably the biggest complaint people have with the system. So it's hard to believe that nobody was ever given the task of figuring out why it's slow. And this is not rocket science; anyone bothering to look into it would find these problems in a day.

But it seems equally impossible that they know of the cause, but decided not to apply any of the the trivial fixes to it. (Hell, it wouldn't even need to be a proper technical fix. It could just be a piece of text saying that downloads will work faster with all other apps closed).

So while it's possible to speculate in an informed manner about other things, this particular question will remain as an open mystery. Big companies don't always get things done very efficiently, eh?

Footnotes

[1] How idle? So idle that I hadn't even logged in, the app was in the login screen.

[2] To be specific, the slowdown is caused by the artifical latency changes. The PS4 downloads files in chunks, and each chunk can be served from a different CDN. The CDN that was being used from 10:51 to 11:00 was using a delay-based congestion control algorithm, and reacting to the extra latency by reducing the amount of data sent. The CDN used earlier in the connection was using a packet-loss based congestion control algorithm, and did not slow down despite seeing the latency change in exactly the same pattern.

The mystery of the hanging S3 downloads

jsnell@iki.fi — Thu, 20 Jul 2017 16:00:00 GMT

A coworker was experiencing a strange problem with their Internet connection at home. Large downloads from most sites worked fine. The exception was that downloads from a Amazon S3 would get up to a good speed (500Mbps), stall completely for a few seconds, restart for a while, stall again, and eventually hang completely. The problem seemed to be specific to S3, downloads from generic AWS VMs were ok.

What could be going on? It shouldn't be a problem with the ISP, or anything south of that: after all, connections to other sites were working. It should not be a problem between the ISP and Amazon, or there would have been problems with AWS too. But it also seems very unlikely that S3 would have a trivially reproducible problem causing large downloads to hang. It's not like this is some minor use case of the service.

If it had been a problem with e.g. viewing Netflix, one might suspect some kind of targeted traffic shaping. But an ISP throttling or forcibly closing connections to S3 but not to AWS in general? That's just silly talk.

The normal troubleshooting tips like reducing the MTU didn't help either. This sounded like a fascinating networking whodunit, so I couldn't resist butting in after hearing about it through the grapevine.

The packet captures

The first step of debugging pretty much any networking problem is getting a packet capture from as many points in the network as possible. In this case we only had one capture point: the client machine. The problem could not be reproduced on anything but S3, and obviously taking a capture from S3 was not an option. Nor did we have access to any devices elsewhere on the traffic path. [0]

A superficial check of the ACK stream showed the following pattern. The traffic would be humming along nicely, from the sequence numbers we can see that about 57MB have already been downloaded in the first 2.5 seconds.

00:00:02.543596 client > server: Flags [.], ack 57657817
00:00:02.543623 client > server: Flags [.], ack 57661318
00:00:02.543682 client > server: Flags [.], ack 57667046

Then, a single packet loss occurs. We can tell from the SACK block that 1432 bytes of payload are missing. That's almost certainly a single packet.

00:00:02.543734 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:57669910}]

After the single packet loss, more data continues to be delivered with no problems. In the next 100ms a further 6MB gets delivered. But the missing data never arrives.

...
00:00:02.648316 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63829515}]
00:00:02.648371 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63830947}]

In fact, no further ACKs are sent for 4 seconds. And even then it's not done by one 1432 byte packet like we expected, but by two 512 byte packets and one 408 byte one. There's also a RTT-sized delay between the first and second packets.

00:00:06.751691 client > server: Flags [.], ack 57667558,
    options [sack 1 {57668478:63830947}]
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]
00:00:06.796277 client > server: Flags [.], ack 63830947

After that, the connection continues merrily along, but the exact same thing happens 3 seconds later.

What can we tell from this? Clearly the actual server would be retransmitting the lost packet much more quickly than with a 4 second delay. It also would not be re-packetizing the 1432 byte packet into three pieces. Instead what must be happening is that each retransmitted copy is getting lost. After a few seconds RFC 4821-style path MTU probing kicks in, and a smaller packet gets retransmitted. For some reason this retransmission makes it through; this makes the sender believe that the path MTU has been reduced, and it starts sending smaller packets.

Again this suggests there's something dodgy going on with MTUs, but as mentioned in the beginning, reducing the MTU did not help.

But it also suggests a mechanism for why the connection eventually hangs completely, rather than alternating between stalling and recovering. There's a limit to how far the MSS can be reduced. If nothing else, the segments will need to have at least one byte of payload. In practice most operating systems have a much higher limit on the MSS (something in the 80-160 byte range is typical). If even packets of the minimum size aren't making it through, the server can't react by sending smaller packets.

With the information from the ACK stream exhausted, it's time to look at the packets in both directions. And what do you know? We actually see the earlier retransmissions at the client, with beautiful exponential backoff. The packets were not lost in the network, but were silently rejected by the client for some reason.

00:00:02.685557 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:02.960249 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:03.500500 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:04.580168 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:06.751657 server > client: Flags [.], seq 57667046:57667558, ack 4257, length 512
00:00:06.751691 client > server: Flags [.], ack 57667558, win 65528,
    options [sack 1 {57668478:63830947}]
00:00:06.792565 server > client: Flags [.], seq 57667558:57668070, ack 4257, length 512
00:00:06.792567 server > client: Flags [.], seq 57668070:57668478, ack 4257, length 408
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]

There are really just two reasons this would happen. The IP or TCP checksum could be wrong. But how could it be wrong for the same packet six times in a row? That's crazy talk, the expected packet corruption rate is more like one in a million. Alternatively the packet is too large. But damn it, we know that's not the problem, no matter how well this case is matching the common pattern. Let's just have a look at the checksums, to rule it out...

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
...

Oh... Every single copy of that packet had a checksum of 0 instead of the expected checksum of 0xd7a7. (Checksums of 0 are often not real errors, but just artifacts of checksum offload. The packets being captured by software before the checksum is computed by hardware. That's not the case here; these are packets we're receiving rather than transmitting.). And it gets crazier, when we look at the next instance of the problem a few seconds later.

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
...

It's the exact same problem, all the way down to the problem appearing specifically with a TCP checksum of 0xd7a7. Further analysis of the captures verified that this was a systematic problem and not a coincidence. Packets with an expected checksum of 0xd7a7 would always have the checksum replaced with 0. Packets with any other expected checksum would work just fine. [1].

This explains why the path MTU probing temporarily fixes the problem: the repacketized segments have different checksums, and make it through unharmed.

TCP Timestamps

So, a problem internal to S3 is causing this very specific kind of packet corruption then?

Not so fast! It turns out that most TCP implementations would work around this kind of corruption by accident. The reason for that is TCP Timestamps. And while you don't need to actually know much about TCP Timestamps to understand this story, I have been looking for an excuse to rant about them.

With TCP Timestamps, every TCP packet will contain a TCP option with two extra values. One of them is the sender's latest timestamp. The other is an echo of the latest timestamp the sender received from the other party. For example here the client is sending the timestamp 805, and the server is echoing it back:

client > server: Flags [.], ack 89,
    options [TS val 805 ecr 10087]
server > client: Flags [P.], seq 89:450, ack 569,
    options [TS val 10112 ecr 805]

TCP Timestamps were added to TCP very early on, for two reasons, neither of which was very compelling in retrospect.

Reason number one was PAWS, Protection Against Wrapped-Around Sequence-Numbers. The idea was that very fast connections might require huge TCP window sizes, and minor packet reordering/duplication might cause an old packet to be interpreted as a new packet, due to the 32 bit sequence number having wrapped around. I don't think that world ever really arrived, and PAWS is irrelevant to practically all TCP use cases.

The other original reason for timestamps was to enable TCP senders to measure RTTs in the presence of packet loss. But this can also be done with TCP Selective ACKs, a feature that's much more useful in general (and thus was widely deployed a lot sooner, despite being standardized later).

In exchange for these dubious benefits, every TCP packet (both data segments and pure control packets) is bloated by 12 bytes. This is in contrast to something like selective ACKs, where most packets don't grow in size. You only pay for selective ACKs when packets are lost or reordered. I think that the debuggability of network protocols is important, but with TCP you get basically everything you need from other sources. TCP timestamps have a high fixed cost, but give very little additional power.

If TCP Timestamps suck so much, why does everyone use them them? I don't know for sure anyone else's reasons. I ended up implementing them purely due to an interoperability issue with the FreeBSD TCP stack. Basically FreeBSD uses a small static receive window for connections without TCP timestamps, while with TCP timestamps on it'd scale the receive window up as necessary. With connections with even a bit of latency, you needed TCP timestamps to avoid the receive window becoming a bottleneck. (This was fixed in FreeBSD a few months ago. Yay!).

Now, performance of FreeBSD clients isn't a big deal for me as long as the connections work. But you know who else uses a FreeBSD-derived TCP stack? Apple. And when it comes to mobile networks, performance of iOS devices is about as important as it gets. Anyone who cares about large transfers to iOS or OS X clients must use TCP Timestamps, no matter how distasteful they find the feature.

"But Juho, what does any of this have to do with S3?", you ask. Well, S3 is one of those rare services that disable timestamps. And that actually makes for a big difference in this case. With timestamps, each retransmitted copy of a packet would use a different timestamp value [2]. And when any part of the TCP header changes, odds are that the checksum changes as well. Even if some packets are lost due to the having the magic checksum, at least the retransmissions will make it through promptly.

To check this theory, I asked for a test with TCP timestamps disabled on the client. And immediately large downloads from anywhere - even the ISP's own speedtest server - started hanging. Success!

Conclusion

With this information I suggested my coworker call his ISP, and report the problem. He was smarter than that, and ran one more test: switching the cable modem from router mode to bridging mode. Bam, the problem was gone. In retrospect this makes sense: in router mode the cable modem needs to update the checksums for each packet that pass through the device. In bridging mode there's no NAT, so no checksum update is needed.

And that's how a dodgy cable modem caused downloads to fail with one service, but one service only. I've seen many kinds of packet corruption before, but never anything that was so absurdly specific.

Footnotes

[0] There are techniques around for routing the traffic such that we would have had a measurement point. One would have been using something like a VPN or a Socks proxy. But that's such a fundamental change to the traffic pattern that it doesn't make for a very interesting test. Odds are that the problem would just go away when you do that. The other option would be to use a fully transparent generic TCP proxy on some server with a public IP, have the client connect to the TCP proxy and the proxy connect to the actual server. But setting that up is tedious; certainly not worth doing as a first step.

It's also pretty common to only have one trace point to start with. For analysis I'd do for actual work purposes, we pretty often have just a trace from somewhere in the middle of the path, but nothing from the client or the server. Getting traces from multiple points is so much trouble that we usually need to roughly pinpoint the problem first with single-point packet capture, and only then ask for more trace points.

[1] As far as I can tell 0xd7a7 has no interesting special properties. The bytes are not printable ASCII characters. 0xd7a7 isn't a value with any special significance in another TCP header field either. There are ways to screw up TCP checksum computations, but I think they're mostly to do with the way 0x0 and 0xffff are both zero values in a one's complement system.

[2] Assuming sensible timestamp resolution. Not the rather unpractical 500ms tick that e.g. OpenBSD uses.

The hidden cost of QUIC and TOU

jsnell@iki.fi — Thu, 01 Dec 2016 16:00:00 GMT

Application specific UDP-based protocols have always been around, but with traffic volumes that are largely rounding errors. Recently the idea of using UDP has become a lot more respectable. IETF has started the ball rolling on standardizing QUIC, Google's UDP-based combination of TCP+TLS+HTTP/2. And Facebook published Linux kernel patches to add an encrypted UDP encapsulation of TCP, TOU (Transports over UDP). On a very high level, the approaches are dramatically different.

QUIC is a totally new design that can really experiment on the protocol level, but requires implementation to start from scratch. Some of the new features are compelling (e.g. proper multiplexing of multiple data streams), a few I have my doubts on (e.g. forward error correction). TOU is a conservative evolution, and pretty much just includes one actual new feature. But it can fully leverage the host TCP stack on the server. The client would still require a user space TCP stack and user space TOU encapsulation.

But despite the difference in designs, the goals are very similar. Both proposals attempt to speed up protocol evolution by decoupling the protocol from the client OS, and moving it to the application. (The companies that designed these protocols happen to control the servers and the client application program, but not really the client OS). They'd also both add support for connection migration in a way that should more deployable than multipath TCP. It's hard to argue against either of these ideas.

And then there's the third big commonality. Both proposals encrypt and authenticate the layer 4 headers. This is the bit that I'm uneasy about.

The recent movement to get all traffic encrypted has of course been great for the Internet. But the use of encryption in these protocols is different than in TLS. In TLS, the goal was to ensure the privacy and integrity of the payload. It's almost axiomatic that third parties should not be able to read or modify the web page you're loading over HTTPS. QUIC and TOU go further. They encrypt the control information, not just the payload. This provides no meaningful privacy or security benefits.

Instead the apparent goal is to break the back of middleboxes [0]. The idea is that TCP can't evolve due to middleboxes and is pretty much fully ossified. They interfere with connections in all kinds of ways, like stripping away unknown TCP options or dropping packets with unknown TCP options or with specific rare TCP flags set. The possibilities for breakage are endless, and any protocol extensions have to jump through a lot of hoops to try to minimize the damage.

It's almost an extension of the end-to-end principle. Not only should protocols be defined such that functionality that can't be implemented correctly in the network is defined in the application. Protocols should in addition be defined such that it's not possible for the network to know anything about the traffic, lest somebody try to add any features at that level. Dumb pipes all the way!

It's a compelling story. I'm even pretty sympathetic to it, since in my line of work I see a lot of cases where obsolete or badly configured middleboxes cause major performance degradation. (See this HN comment for an example).

But let's take the recent findings about the deployability of TCP Fast Open as an example. The headline number is absolutely horrific: 20% failure rate! But actually that appears to be 20% where TCP Fast Open can't be successfully negotiated, not 20% where connections fail. And this is for the absolute worst case; it's not just new TCP options, but effectively modifies the TCP state machine for the handshake. I've implemented a bunch of TCP extensions over the years. TCP Fast Open was by far the hardest to get right.

Compared to the reported 8% failure rates to negotiate a QUIC connection, that number looks totally reasonable. (In both cases there is a fallback to negotiate a different type of connection, and blacklists will be used to directly go to the fallback method the next time around). But somehow one of these is deemed acceptable, while the other is a sign of terminal ossification. [1].

What you lose with encrypted headers

What's wrong with encrypted transport headers? One possible argument is that middleboxes actually serve a critical function in the network, and crippling them isn't a great idea. Do you really want a world where firewalls are unviable? But I work on middleboxes, so of course I'd say that. (Disclaimer: these are my own opinions, not my employer's). So let's ignore that. Even so, readable headers have one killer feature: troubleshooting.

The typical network problem that my team gets to troubleshoot is some kind of traffic either not working at all, or working slower than it should be. So something like the following [2]:

Users are complaining that Youtube videos only play in SD, but are choppy in HD.
Speedtest is showing 10Mbps on an LTE connection that should be able to do 50Mbps.
Large FTP transfers between machines in Germany and Singapore are only getting speeds of 2Mbps.
Uploads over a satellite link are so slow that they stall and get terminated rather than ever finish.

To debug issues like this I start with a packet capture from the points in the network I have access to. Most of the time that's just a point in the middle (e.g. a mobile operator's core network). From just one trace, we can determine things such as the following:

Determine packet loss rates (on both sides, i.e. packets lost on the server -> core hop, and on the core -> client hop).
Correlate packet loss with other events.
Detect packet reordering rates (on both sides).
Detect packet corruption rates (on both sides).
Determine RTTs continuously over the lifetime of a connection, not just during a connection handshake (e.g. to use queuing as a congestion signal to establish the downlink as the bottleneck).
Estimate sender congestion windows from observed delivery rates (to determine whether congestion control is the bottleneck).
Inspect the TCP options (e.g. window scaling, mss) and the receive windows to determine whether the software on the client or the server is the bottleneck.
Distinguish between pure control packets and data packets (e.g. to distinguish multiple separate HTTPS requests within a single TCP connection).
Detect the presence of middleboxes that are interfering with the connection. (But only occasionally; more often you'll need multiple traces for this).

We do most this with some specialized tools. But it's essentially no different from opening up the trace in Wireshark, following a connection with disappointing performance, and figuring out what happened. That's something that every network engineer probably does on a regular basis.

With encrypted control information you can't figure out any of this. The only solid data you get is the throughput (not even the goodput). For anything more you, need traces from multiple points in the network. Those are hard to get, sometimes it's even outright impossible. And to do the analysis, you need to correlate those multiple traces with each other. That's a significantly higher barrier than just opening up Wireshark. In practice the network becomes a total black box, even to the people who are supposed to keep it running. That's not going to be a great place to be in.

Conclusion

To conclude, I think encrypting the L4 headers is a step too far. If these protocols get deployed widely enough (a distinct possibility with standardization), the operational pain will be significant.

There would be a reasonable middle ground where the headers are authenticated but not encrypted. That prevents spoofing and modifying packets, but still leaves open the possibility of understanding what's actually happening to the traffic.

Footnotes

[0] Whoops, that's not quite accurate. There is one specific kind of middlebox that a company like Google or Facebook needs: a load balancer. And very conveniently, both protocols introduce a new field containing the information a load balancer needs, and give it special treatment as the one field that gets to live outside the encryption envelope.

[1] There seems to be a bit of a difference in how this is rolled out. If you want TCP Fast Open in Chrome, you'll need to enable it via a flag. Meanwhile my understanding is that QUIC is effectively rolled out by geographical regions; the flip that's getting switched is server side. Presumably the latter rollout procedureincludes working with the main service providers in that area to make sure any problems get fixed in advance. A process like that would be a lot more tractable than trying to fix the whole world at once.

[2] In order, the bottlenecks were:

The Google cache not sending data quickly enough. That's where our visibility ended, but it was still enough to say that the actual mobile network was fine and things were out of the operator's hands.
Large amounts of packet loss in the access network, correlated with burstiness of traffic. That insight was sufficient to allow the customer to locate some specific switches with insufficient buffer space. (And we could add a feature that mitigated the problem centrally, rather than require upgrading thousands of network nodes).
We found no indications of a protocol or network level bottleneck, so the problem had to be either with the application programs or OS configuration. Switching to a different FTP server did in fact solve the problem.
A massive proportion (more than 5%) of packets from a subset of satellite endpoints had TCP checksum errors. This was a specific enough diagnosis to enable a binary search through the network path for the problematic link or device.

What I'm getting at here is that there's a seemingly unending supply of different potential network problems. So many that you need to have an idea of the nature of the underlying issue before you can try to pinpoint it exactly.

The many ways of handling TCP RST packets

jsnell@iki.fi — Mon, 01 Feb 2016 16:00:00 GMT

What could be a simpler networking concept than TCP's RST packet? It just crudely closes down a connection, nothing subtle about it. Due to some odd RST behavior we saw at work, I went digging in RFCs to check what's the technically correct behavior and in different TCP implementations to see what's actually done in practice.

Background

In the original TCP specification, RFC 793, RSTs are defined in terms of the following TCP state variables:

RCV.NXT - The sequence number of the next byte of data the receiver is expecting from the sender
RCV.WND - The amount of receive window space the receiver is advertising
RCV.NXT + RCV.WND - The sequence number of the last byte of data the receiver is willing to accept at the moment.
SND.UNA - The first sequence number that the sender has not yet seen the receiver acknowledge.
SND.NXT - The sequence number of the next byte of payload data the sender would transmit.

An RST is accepted if the sequence number is in the receiver's window (i.e. RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND). The effect of the RST is to immediately close the connection. This is slightly different from a FIN, which just says that the other endpoint will no longer be transmitting any new data but can still receive some.

There are two types of event that cause a RST to be emitted. A) the connection is explicitly aborted by the endpoint, e.g. the process holding the socket being killed (just closing the socket normally is not grounds for RST, even if there is still unreceived data). B) the TCP stack receiving certain kinds of invalid packets, e.g. a non-RST packet for a connection that doesn't exist or has already been closed.

The RST packet that should be generated is slightly different for these two cases. For case A the sequence number for the RST packet should be SND.NXT of the connection. For case B the sequence number should be set to the sequence number ACKed by the received packet. In the latter case the ACK bit will not be set in the RST. (Where the distiction matters, I'll call the first one RST-ABORT, the second a RST-REPLY).

Receiving a RST

The RCV.NXT state variable is doing double duty in the original RFC 793. It's defined as "next sequence number expected on an incoming connection", but it's also implied that it's the most recent acknowledgement number sent out. This was true early on, but not after the introduction of delayed ACKs (RFC 1122). Which of these two interpretations should be used for checking whether the RST is in window?

Linux goes with the latter one, and splits the two roles out. RCV.NXT is strictly defined as the next expected sequence number, and RCV.WUP is the highest sequence number for which an ACK has actually been sent. RST handling is done using RCV.WUP, and in fact the following comment implies that RSTs are the main reason for this mechanism.

 * Also, controls (RST is main one) are accepted using RCV.WUP instead
 * of RCV.NXT. Peer still did not advance his SND.UNA when we
 * delayed ACK, so that hisSND.UNA<=ourRCV.WUP.
 * (borrowed from freebsd)

With code like this:

return !before(end_seq, tp->rcv_wup) &&
       !after(seq, tp->rcv_nxt + tcp_receive_window(tp));

The use of RCV.NXT instead of RCV.WUP for the second subexpression is a clever ruse; due to the way tcp_receive_window() is defined, the final result is relative to RCV.WUP instead of RCV.NXT. But it is a slightly mysterious piece of code. Why is SND.UNA relevant? It should be SND.NXT that matters.

Sure, you can construct cases where a RST would be emitted with a sequence number where this makes a difference (but only for RST-REPLY, not for RST-ABORTs). In these particular you'd save one roundtrip. An example would be something like:

Client	Server

sends 1000:1100
sends 1100:1200
	receives 1000:1100
	ACKs 1100
	receives 1100:1200
	delays ACK
aborts the connection
does not send a FIN
sends RST seqnr 1200
	RST is lost
receives ACK for 1100
sends RST 1100 (not 1200)
	receives RST 1100
	closes connection iff using RCV.WUP
	ACKs 1200
receives ACK for 1200
sends RST 1200
	receives RST 1200

But that's extremely contrived, and would be easily broken by delays (all that needs to happen is for that delayed ACK to actually get emitted before the second RST is received, and it would get rejected).

I tried to do a bit of archaeology to see if historical context would help figure this out, but since it all happened in the dark pre-git ages, it's a bit tricky. The commit is in 585d5180 in netdev-vger-cvs, and while I couldn't track down any discussion from 2001/2002 when the change was made, there is a mailing list post from this decade (here). After reading these, I was still not really enlightened. Was/is there kit around that generates RSTs using SND.UNA rather than SND.NXT when closing a connection? (I mean genuine endpoints. There's of plenty of middleboxes around that generate RSTs with all kinds of sequence numbers, but those might not be the RSTs you want to be reacting to anyway.)

The original FreeBSD commit where this concept was introduced has some additional detail:

* Note: this does not take into account delayed ACKs, so
*   we should test against last_ack_sent instead of rcv_nxt.
*   Also, it does not make sense to allow reset segments with
*   sequence numbers greater than last_ack_sent to be processed
*   since these sequence numbers are just the acknowledgement
*   numbers in our outgoing packets being echoed back at us,
*   and these acknowledgement numbers are monotonically
*   increasing.

The bit I bolded does make it appear that the motivation for the change was specifically RST-REPLYs, not RST-ABORTs. If you're looking for an exact match with some ACK you've sent, it'd be insane to try to match an ACK that had never been sent. But if the other kind of RST had been considered, it would have been obvious that in the presence of delayed ACKs a valid RST could have a sequence number higher than last_ack_sent.

This rule of exact match was indeed later changed in various ways; first reverted to accepting all RSTs in window (so just moving the window), then re-enabled while socket is in ESTABLISHED state but being less strict in other states, then loosened to accepting either RCV.NXT or last_ack_sent, or accepting either of those ± 1, etc.

If you're already tracking last_ack_sent / RCV.WUP, doing the window checks on that instead of RCV.NXT seems sensible. But it does feel like an optional thing, rather than one of those pieces of necessary TCP behavior that were only standardized in folklore.

But that's ancient history. In RFC 5961 the suggested rules for accepting a RST were changed to make RST spoofing attacks harder. There are now three possible reactions:

RST completely out of window, do nothing
RST matches RCV.NXT exactly, close connection
RST in-window but not RCV.NXT, send an ACK (if the RST was genuine, the other endpoint will send a new one but this time with the correct sequence number).

This means that even for the above contrived case using a RST.WUP as the start of the window doesn't make a substantial difference. One extra round-trip is required no matter what; whether the ACK is the delayed ACK or the challenge ACK to a semi-bogus RST is immaterial.

That's pretty fertile ground for different implementations. One standard, one ambiguity that appears to have never been officially clarified, and one proposed standard. I could go into more detail on the archaeology or on how to perform the experiments for if the source code isn't available. But that'd get tedious, so here's a summary of my best understanding of how various operating systems work for RSTs received in the ESTABLISHED state (the picture assumes a last ACK sent of 12, last sequence number received of 17, and a window of 10).

Yes, seven operating systems and no two even agree on what an acceptable RST looks like! And there's still some scope for other reasonable but different implementations beyond those. RFC 5961 handling but with the window based on RCV.NXT, and RFC 5961 handling that accepts both RCV.WUP and RCV.NXT as the sequence number would be obvious examples. (To be fair, some of these differences will collapse away if no ACKs are currently being delayed when the RST is received).

The one data point in that table that seems particularly strange is RCV.NXT + 1 in OpenBSD. If you're going to widen the window of acceptable sequence numbers by 1, I would have thought that adjusting it downwards rather than up would be the obvious choice. The commit message is pretty spartan, just blaming Windows.

You could get into some further complications with the window handling for RSTs that include data (sigh, the awesome idea in RFC 1122 to embed the reason for the RST as ASCII payload), but AFAIK those don't exist in the wild so maybe this is enough for now.

Interlude on RST handling in TCP aware middleboxes

The above was just considering RST handling from the point of view of the endpoints, where the main question is whether to react to an RST, not how you react to it. Things get different yet again for middlebox. The middlebox might not actually have the same understanding of the state variables as either of the endpoints. If the reaction of the middlebox to the RST is somehow destructive, the middlebox has to be careful. As a trivial example, imagine a firewall that blocks all packets for unknown connections. It would be quite bad for that firewall to react to a RST packet that the actual endpoint rejects.

What if the middlebox is not just TCP aware, but an active participant in the connection? At work we do a fully transparent non-terminating TCP optimizer for mobile networks (see previous post for more technical details if that's your kind of thing). Despite not terminating the connections, we still ACK data and thus take responsibility for delivering it. How should a device like that react to a RST if we still have undelivered but ACKed data? It actually depends on which directions we've seen a valid RST in. To a first approximation we should stop sending new data to the direction that we've gotten RSTs from, but not close the connection or forward the RST packet until all data that we've acknowledged has been delivered to the destination of the RST. (With the edge case of valid RSTs in both directions forcing connection closure even with undelivered data).

Sending RSTs

So there are a bunch of different ways to process a RST, but surely sending RSTs is trivial? The rules on that are simple, unambiguous, and unlike for processing an incoming RST there should be no motivation to deviate from the rules. There are good reasons to be stricter about accepting RSTs, there should be no reason to send RSTs that will get rejected.

Ha-ha, just kidding.

First, what does SND.NXT really mean? RFC 793 only describes it as the "next sequence number to be sent". On the face of it this could mean two things; either it's the highest sequence number that has never been sent. Or it could be the next sequence number to be transmitted including retransmissions. The latter interpretation seems morally bankrupt. The RFC describes a mechanism by which SND.NXT increases, but not one by which it decreases. The glossary also explicitly mentions retransmissions as falling between SND.UNA and SND.NXT, which implies SND.NXT should not decrease.

But in reality you see a lot of clients reduce SND.NXT if they believe there's been a full tail loss. Something like this:

00:00:53.693026 client > server: Flags [P.], seq 973096734:973096882, ack 2107240749, win 4200, length 148
00:00:53.693043 server > client: Flags [.], ack 973095334, win 4000, options [sack 1 {973096734:973096882},eol], length 0
00:00:53.789417 client > server: Flags [.], seq 973095334:973096734, ack 2107240749, win 4200, length 1400
00:00:53.789434 server > client: Flags [.], ack 973096882, win 4000, length 0
00:00:53.789455 client > server: Flags [.], seq 973095334:973096734, ack 2107240749, win 4200, length 1400
00:00:53.789458 server > client: Flags [.], ack 973096882, win 4000, length 0
00:00:53.789475 client > server: Flags [R.], seq 973096734, ack 2107240749, win 0, length 0

In this example when the client sends the RST, SND.UNA should be 973095334 and SND.NXT should be 973096882 - highest sequence number sent, and in fact selectively acked. The RST is not generated with either of those, but the in-between sequence number of 973096734.

There's another problem, which will matter especially for FIN+RST connection teardowns. Here's what closing down a connection prematurely looks like on OS X:

00:43:09.299678 client > server: Flags [.], ack 3343152432, win 32722, length 0
00:43:09.300615 client > server: Flags [F.], seq 773912652, ack 3343155284, win 32768, length 0
00:43:09.300617 server > client: Flags [.], ack 773912653, win 6000, length 0
00:43:09.300981 client > server: Flags [R], seq 773912652, win 0, length 0
00:43:09.300983 client > server: Flags [R], seq 773912652, win 0, length 0
00:43:09.300984 client > server: Flags [R], seq 773912652, win 0, length 0
00:43:09.301449 client > server: Flags [R], seq 773912653, win 0, length 0

The first RST has the same sequence number as the FIN. That's the correct behavior, since that RST was sent in reply to a packet that was already in flight to the client by the time the server received the packet. Apperently OS X had the RST get triggered by an incoming packet before it got around to sending an RST with a useful sequence number. So there's an extra round trip's delay, and possibly an extra round trip's worth of packets. But that's not a huge deal.

What would really suck is if an operating system were to totally ignore the rule to reply with an RST when receiving a packet on a closed socket. Which is of course exactly what iOS does. I don't know the exact parameters of when it happens (e.g. it might be something that happens just when connected over cellular, not over Wifi), but what you get is something like this:

00:00:15.099703 client > server: Flags [.], ack 3593074068, win 8192, length 0
00:00:15.099705 client > server: Flags [F.], seq 80108342, ack 3593074068, win 8192, length 0
00:00:15.099706 server > client: Flags [.], ack 80108343, win 124, length 0
00:00:15.099706 client > server: Flags [R], seq 80108342, win 0, length 0
00:00:15.228490 server > client: Flags [.], seq 3593074068:3593075068, ack 80108343, win 125, length 1000
... [ crickets, while the same data gets regularly retransmitted to the client ] ...
00:02:11.646208 server > client: Flags [.], seq 3593074068:3593075068, ack 80108343, win 125, length 1000

We get a single RST, which for an active connection almost guaranteed to have a sequence number that will be rejected. Without at least one valid RST, the server has to keep the connection open, and data gets wastefully retransmitted over and over. This lasts until the server times out the connection, which might take anything from ten seconds to two minutes.

And just to be clear, the phone was still connected and functional for the full duration of that trace. There were other active connections to it, and those continued happily along. It's just the closed sockets that become total black holes.

Note that the RCV.WUP trickery discussed in the earlier section alone would not be sufficient to handle this kind of a situations. In addition you need to either delay ACKs to FINs (OS X), be more forgiving about RSTs after receiving a FIN (OSX, FreeBSD 10.2), or have a small amount of slack in the sequence number check (±1 in FreeBSD 10.2; OpenBSD has slack but in the wrong direction).

Some statistics

To see how much of a difference these different variations have, I ran some simulations against a trace from a varied real world traffic mix and looked at what percentage of first RSTs received by the server would have been accepted, vs been dropped or caused a challenge ACK. (I grouped challenge ACKs together with dropping packets, since the point here is to see which policy is the most efficient at actually closing the connection as soon as possible. It's not to see which ones manage to do it eventually. Either they'll all manage to do it later since some packets will be sent eventually, or none of them will do it since the other device isn't properly sending followup RSTs. Also, an after the fact simulation can't possibly tell anything about the efficiency of the challenge ACKs).

The data was filtered such that we only looked at connections matching the following criteria. This was about 35k connections after the filtering, so not a huge data set. This was almost exactly 1/3 RST-ABORTs and 2/3 RST-REPLYs.

The connection received an RST from the client at some point
The RST arrived before the server had sent a FIN or a RST to the client
The RST arrived after at least one data packet (not during handshake)

Also not that this is only comparing the different policies for RST acceptance, not for example the effects of different delayed ACK behavior in different TCP implementations.

	% of first RSTs accepted
RST receive policy	RST-ABORT	RST-REPLY
FreeBSD 10.2	96.60	96.81
FreeBSD CURRENT	94.86	83.90
Illumos	99.91	81.20
Linux / Windows	96.45	81.20
OpenBSD	96.58	83.90
OS X	95.05	83.90

Illumos is the only implementation here using the standard check against the massive full window (tens or hundreds of kilobytes, vs. 1-6 bytes for everything else. So it's not much of a surprise that it's more effective than the others at closing the connection on just about every RST-ABORT it receives. The tradeoff is that it's a lot more susceptible to RST spoofing attacks.

When handling a RST-ABORT, we'll almost always have seen a FIN just before. This means that RCV.WUP and RCV.NXT will be the same, so all the other options are very close to each other. The minor differences there come from:

FreeBSD 10.2 accepts a superset of what OpenBSD does.
OpenBSD accepts superset of what Linux and Windows do.
When there is no FIN, checking RCV.NXT is more predictive of the sequence number of a RST-ABORT than RCV.WUP would be, so anything not checking RCV.NXT loses out.
OS X relaxes the checks after receiving FIN, FreeBSD CURRENT no longer does.

For RST-REPLY, practically everyone gets the case of no FIN right. But if there's a FIN (like in the iOS example above), FreeBSD 10.2 is vastly more effective than anything else thanks to accepting RCV.NXT - 1. There's not much of a tradeoff. It's a good idea, and I'm a bit surprised that not only did it never spread outside of FreeBSD, but has now been removed from there too.

The other results are boring, with just the RCV.WUP vs. RCV.NXT difference (with the opposite results compared to RST-ABORTs).

Conclusion

Almost every time I talk about TCP optimization to the general development public, the reaction is something along the lines of "ah, you're one of the guys who breaks the standard?". That's technically true but useless; nobody actually implements all the standards exactly as written. Hopefully this dive into just one relatively simple aspect shows a) everyone doesn't interpret the standards the same way, b) there are reasons to intentionally deviate from the standards, and c) everybody does it. You just can't be abusive when deviating from the standards.

Oh, and I'd already implemented accepting RSTs with sequence number RCV.NXT - 1 before looking at these details (resetting the connections promptly is kind of important for us due to reasons that I can't go into here). But this investigation did make me feel a lot better about that change.

Flow disruptor - a deterministic per-flow network condition simulator

jsnell@iki.fi — Thu, 01 Oct 2015 15:00:00 GMT

Introduction

I finally got around to open sourcing flow disruptor, a tool I wrote at work late last year. What does it do? Glad you asked! Flow disruptor is a deterministic per-flow network condition simulator.

To unpack that description a bit, per-flow means that the network conditions are simulated separately for each TCP connection rather than on the link layer. Deterministic means that we normalize as many network conditions as possible (e.g. RTT, bandwidth), and any changes in those conditions happen at preconfigured times rather than randomly. For example the configuration could specify that the connection experiences a packet loss exactly 5s after it was initiated, and then a packet loss every 1s after that. Or that packet loss happens at a specifed level of bandwidth limit based queueing.

You can check the Github repo linked above for the code and for documentation on e.g. configuration. This blog post is more on why this tool exists and why it looks the way it does.

Motivation and an example

Why write yet another network simulator? The use case here is fairly specific; we want to compare the behavior of different TCP implementations (our own and others') under controlled conditions. Doing this with random packet loss, or with different connections having different RTTs would be quite tricky. We also want each TCP connection to be as isolated from the other traffic as possible. With link level network simulation it's really easy to have test cases bleed into each other or some background traffic to mess things up, unless you're really careful.

Here's an example. This is looking at the behavior of half a dozen large CDNs, and a few operating system installations with no tuning at all by downloading an equally large file from them all. This particular scenario set up a 200ms base RTT (200ms no matter how far away from the test client the server is, the amount of extra delay added depends on the observed latencies from the proxy to the client and the server), and alternating between a 2Mbps and a 4Mbps bandwidth limit every 5 seconds.

profile {
    id: "200ms-variable-bandwidth"
    filter: "tcp and port 80 or 443"
    target_rtt: 0.20
    dump_pcap: true

    downlink {
        throughput_kbps: 2000
    }
    uplink {
        throughput_kbps: 1000
    }

    timed_event {
        trigger_time: 5.0
        duration: 5.0
        repeat_interval: 10.0

        downlink {
            throughput_kbps_change: 2000
        }
    }
}

Below is the pattern of RTTs produced for a download of a 10MB file. Time in seconds on X axis, time between segment sent and segment acked on Y axis. It might look like a line graph, but actually it's a scatterplot where every dot is a single segment. It's just that there are a lot of RTT samples. I've removed the legend from the graph, since the data is pretty old and the point here isn't an argument about what's reasonable TCP behavior and what isn't. But just so that you have a frame of reference, the hot pink dots (4th from the top) are Linux 3.16.

What does that graph tell us? Well, when the available bandwidth changes, it's the RTT that changes rather than the amount of data in flight. This suggests that none of the tested servers uses a RTT-based congestion control algorithm [0]. Which is of course not much of a surprise given the challenges in deploying one of those, but you never know. With no packet loss and no RTT feedback, it's not surprising that there's some major bufferbloat happening.

For another point of view, a similar graph but this time looking at the amount of data in flight (i.e. unacked) the moment each packet was sent:

You can see that a lot of CDNs clamp the congestion window to reasonably sensible values (at around 256kB). Some others appear to have neither any limit to the congestion window nor any moral equivalent of a slow start threshold. Especially for a CDN that seems a bit questionable; CDNs are supposed to have endpoints near to the user, and there should be no need to keep such large amounts of data in flight to a nearby user.

And does overstuffing the buffers like this help? At a first glance it might appear to. After all the "lines" with high amounts of data in flight also end earlier. But that's just an artifact of them having 5-10s of data still undelivered when the last payload packet is sent. But if you look at the low right corner of the graph, you can see some orphan dots corresponding to the connection teardown. And thos are all happening at roughly the same time, as one would expect.

Another interesting thing is how some connections show up as one smooth line, while others instead appear to consist of vertical lines spaced out at fairly even intervals. This is probably the difference between the server reacting to an ACK by sending out data immediately, and the server reacting to an ACK only once sufficient congestion window space has been opened up and then filling it all at once. So for example see the olive green connection is extremely bursty, sending out chunks of 256kB at a go.

So that's the kind of transport layer investigation this tool was built for. You set up an interesting controlled scenario, run it against a bunch of different implementations, and see if anything crazy happens. What happens with an early packet loss, what happens with persistent packet losses, with different lengths of connection freezes, and so on. That's also why the simulator has a facility for generating trace files directly. It's rather useful for generating graphs like this, and have them pre-split by connection.

Implementation

This program was a bit of an experiment for me. On one hand it's pretty similar to a lot of programs we write at work, and could have shared a lot of code with the rest of the codebase. On the other hand (as I've written before) once a bit of code has been embedded in a monorepo, it can be really hard to extract it out.

So I set out to write this in a separate repository, preferring open source libraries instead of internal ones. In places where I wanted to use our existing code I just copied it over. If it felt ugly or out of place (as old but working code so often does), I rewrote it with no regard for keeping compatibility with all the existing clients of that code. You could say that the goal with this was to see whether life would be more pleasant if some of that old baggage was replaced.

Some of those experiments were:

Protocol buffers for configuration instead of our existing JSON infrastructure.
libev for the event loop, timers and signal handling instead of our in-house equivalents.
CMake for building.
More aggressive use of C++11 features than we can use in our real code, due to needing to support gcc 4.4.

Configuration

Most of my time at Google was spent working on an in-house programming language used for configuration [1]. It'd take in 10KLOC programs split across multiple files that described the configurations of systems, compile them to 100KLOC protocol buffers, and those protobufs would then be used as the actual configuration of record. So I'm pretty comfortable with using protocol buffers for configuration, but less so for using the text protocol buffer format for it. There's a reason configuration ended up requiring a separate language there. But this is a simple task where configs are unlikely to be anywhere near that size, so perhaps ascii protobufs would work just fine?

The protocol buffer text format actually turned out to be pretty pleasant to use for this, which makes some sense given how similar it is to JSON. One difference that didn't really matter is the lack of maps. I never want them in my configs, so that's ok. The second big difference was lists vs. repeated fields. Repeated fields of scalar values are indeed kind of awkward. Compare this:

"a": [1, 2, 3]

Versus this:

a: 1
a: 2
a: 3

But I rarely have lists of that sort in a configuration. It's always a collection of some kind of complex compound structures. And there the repeated field syntax feels more natural since it removes so much of the noise.

"profiles": [{
  "id": "p80",
  "filter": "tcp and port 80",
  ...
}, {
  "id": "p443",
  "filter": "tcp and port 443",
  ...
}]

profile {
  id: "p80"
  filter: "tcp and port 80"
}
profile {
  id: "p443"
  filter: "tcp and port 443"
  ...
}

What didn't end up working well was accessing the protocol buffers using the generated code. It's of course great for accessing the contents of the configuration tree in its raw form. But it turns out I rather like for the reified configuration objects to have some smarts built into them, not be just a pile of data. Here's what the protobuf schema for the profiles looks like:

message FlowDisruptorProfile {
    required string id = 1;
    optional string filter = 2;
    ...
}

This is how the same definition fragment would look like for our existing JSON schemas:

DECLARE_FIELD(string, id, "")
DECLARE_FIELD(config_packet_filter_t, filter, config_packet_filter_t())
...

Yes, that's some ghetto syntax [2]. But the real difference is that we've been able to declare a more useful data type for the filter field than just string. When we parse a JSON message with this field, the filter field is automatically parsed as a pcap filter, compiled to BPF, and the actual compiled BPF program is then accessible through the config object. If compilation fails, parsing of the JSON object as a whole fails. If a configuration is copied or two configurations are merged together, deep copies of the compiled filters will be done as appropriate.

As far as I can tell this isn't really possible with the protocol buffer solution. As soon as any data gets some slightly more complex semantic meaning, you need a separate and manual validation layer. Or even worse, in this case since we don't just want to validate the data, but actually want to generate a complex domain object from it at parse time. This means there will actually be two parallel hierarchies of objects, one for the raw protobuf Messages and another for the domain objects. And these conversions aren't really amenable to any kind of automation through e.g. reflection.

It's not just a issue with filters either, that's just an example. In the real application we have custom field types for all kinds of networking concepts. IP addresses, MAC addresses, subnets, and so on. (And it's absolutely trivial to define new ones).

This isn't too big a deal in this program since it's just that one field right now. It'd be incredibly annoying in our real product that has tens of config variables that require this kind of handling. Manually maintaining the raw config vs. domain object mappings for all of them would be painful and a source of lots of errors.

So this part didn't turn out at all like I expected. The part that I was expecting to have some problems with was really nice at least given my config sizes. The part that I thought protobufs would work well with turned out to kind of suck. This is, of course, not a problem with protocol buffers but with me trying to use them for the wrong job. But it's strange that I never noticed this mismatch before.

Timers and other events

An event loop that integrates all kinds of events is certainly easy to program for. The event types in this program were timers, IO on fds, and signal handlers (run synchronously as part of the event loop, even if signal was asynchronous). The bits of libev I didn't need right now are things I could at least imagine needing at some point.

That's not how we write our traffic handling loops at work though. At the innermost level you'll have poll-mode packet handling, running on multiple interfaces and reading and handling at most a small number (e.g. 10) packets from each interface in one go. Go one layer outward, and we have a loop that interleaves the above packet handling with updating our idea of time and running timers if necessary [3]. That loop runs for a predetermined amount of time (e.g. 10ms) before yeilding to operational tasks for a short while. For example updating certain kinds of counters that are sample-based rather than event based (e.g. CPU utilization or NIC statistics), restarting child processes that have died, or handling RPCs. But of course at most one RPC reply before returning back to traffic handling.

As far as I know, this kind of setup isn't uncommon for single-threaded networking appliances. You need careful control of the event loop to make sure that work is batched together in sensibly sized chunks and that you're not away from the actual work for too long.

I couldn't figure out whether there would be a way to coerce libev to a structure like this. There are some options there for embedding event loops inside other event loops, but it feels like it'd be very ugly if it worked at all. So it's a really nice library, but probably not the right match for the application. (See also the digression about memory allocation later).

Closures and auto

The way I ended up using libev was with a couple of trivial wrapper classes, where the handler callback was passed in as a C++11 closure. (Yes, yes, every other language has had this since the '70s. Doesn't mean it's any less nice to finally have it in this environment too).

It's absolutely dreamy for defining timers, which need a callback function and some kind of state for the callback to operate on, usually passed in as an argument. So the difference is essentially between declaring a timer in a class like this:

    Timer tick_timer_;

And initializing it like this:

      tick_timer_(state, [this] (Timer*) { tick(); }),

Versus the relative tedium of defining a new class with the callback + the data as a poor man's closure, or by defining a separate trampoline function for every kind of timer.

And after having gone this route for event handlers, it was then very easy to slip into doing it all over the place. Instead of maintaining a vector of some sort of records that are later acted on, just maintain a vector of closures instead that do the right thing when called. For example the token bucket-style bandwidth throttler implementation doesn't know anything about packets. It just gets costs and opaque functions as input, and calls the function once the cost can be covered.

There's a small problem with the event handler definitions, since you need to allocate some memory for the closure in a separate block (the value cells for the closed over variables need to live somewhere). It's not obvious to me that any amount of template magic would allow working around that. And we tend to be a bit paranoid about lots of tiny separate allocations. So it's something I want, but maybe not something I can actually justify using.

The new for-loop and auto are just pure win, again not very surprising. I used both a lot, and even returning to the code now a year later didn't really see any places where the shorthands make the code harder to understand.

Build system

I've found that build systems sucking and being hated by all the users is a good baseline assumption to make. I didn't hate CMake though. In fact it seems perfect for a C++ program of this size. It came with enough batteries included to do everything I wanted to, and the (horribly named) CMakeLists.txt file for this project has very little boilerplate. It even appears to do transitive propagation of compiler / linker flags through dependencies in a sane manner, which is a problem that usually drives me completely nuts.

I wasn't able to build a mental model of how CMake works though. At least there was never any hope of being able to correctly guess how to do something new, or change the behavior of something. It was always off to a web search for vaguely plausible keywords. So my first impression was that it's a collection of very specialized bits of functionality, and once you have to do anything that's not already support, the complexity shoots up. But clearly it's worth giving CMake a good look the next time I want to burn our rat's nest of recursive makefiles to the ground and rebuild things.

Other experiments

I tried a couple of other funky things like using protocol buffers to represent the parsed packet headers rather than doing the usual trick of just casting the data to a packed C struct. Not sure it really bought anything in the end, I can't figure out what the motivation was back when I wrote the code. Maybe to use them as the disk serialization format for traces? But that's obviously stupid, since all of our analysis tools already work on pcap format. Or maybe I wanted to print some debug output, but didn't want to import our packet pretty-printer and all of its dependencies into this project? (The curse of the monorepo, again).

Conclusion

As I mentioned before, this tool was built for a specific purpose. But when I described it to someone last week, they came up with a completely different use that might be very interesting (though it'll require a bit of extra work to automatically create some much more intricate scenario definitions, so we'll see how practical it turns out in practice). So maybe this has slightly wider applicability than I initially thought. If you come up with a cool new use for the tool, I'd love to know about it.

Footnotes

[0] Ok, it's plausible that the RTT sensitivity only kicks in after a fairly high threshold. Say 2 seconds. So to be absolutely sure you'd need another test case with an even lower bandwidth limit, to force all of the connections into the 3-4 second territory.

[1] borgcfg / GCL

[2] It gets translated to C++ using a couple of cpp invocations, which rather sharply limits the syntax. Who has time for a real schema compiler when there's a product to be shipped?

[3] I was once explaining the guts of our system to some people to figure out how we might integrate their system and ours. And I could just sense the disapproval when I admitted that we didn't have a bound on the number of timers that could get triggered in a single timer tick. How irresponsible, we have no idea of how long we might go between calls to the packet processing layer of the IO loop!

Mobile TCP optimization - lessons learned in production

jsnell@iki.fi — Tue, 25 Aug 2015 15:00:00 GMT

I did a keynote presentation at the SIGCOMM'15 HotMiddlebox workshop, "Mobile TCP optimization - Lessons Learned in Production". The title was set before I had any idea of what I'd really be talking about, just that it'd be about some of the stuff we've been working on at Teclo. So apologies if the content isn't an exact match for the title.

This post contains my slides, interleaved with my speaker's notes for that slide. It won't be an exact transcription of what I actually ended up saying, they were just written to make sure that I had at least something coherent to say re: each slide. We've got an endless supply of network horror story anecdotes, and I can't actually remember which ones I ended up using in the talk :-/

I'm particularly happy that my points on transparency of optimization got a positive reception. To us it's a key part of making optimization be a good networking citizen, and has seemingly been getting short shrift so far. Hilariously the other TCP optimization talk at the workshop brought up a transparency issue we'd never had to consider, lack of MAC transparency causing a Wifi security gateway to think connections were being spoofed.

Thanks to Teclo for letting me talk about some of this stuff publicly, and to everyone who attended HotMiddlebox. It was a lot of fun, and I got a bunch of useful information from the hallway discussions.

Introduction
Background
Implementation 1/2
Implementation 2/2
TCP optimization
An optimized connection
Transparency
Simple optimizations
Speedups
Buffer management
Effect on RTTs and packet loss
Burst control
Things we learned along the way
Don't rely on hardware features
Two mobile networks are never equal
Reordering
Strange packet loss patterns
Bad or conflicting middleboxes
O&M is a lot of work

Presentation

Hi, good morning everyone. I'm Juho, and I'll be talking about the mobile TCP optimization system we've been working on at Teclo Networks.

I'll start with a tiny bit of background on the product, then show how it works, how we think about TCP optimization and some results. And finally I'll go through about some of the things we learned while growing this from a prototype to a product that can be deployed in real operator networks.

We're a Zurich based startup that's been working on TCP optimization for about 5 years now, with the first production deployments over 4 years ago so we've got a bit of experience at this point. We've been in live traffic in around 50 mobile networks with about 20 commercial deployments. This includes all kinds of radio technologies (2G, 3G, LTE, WiMAX, CDMA) and anything from small MVNOs with 100Mbps of traffic to multi-site installations at major operator groups with 100Gbps of traffic.

This is all done with standard hardware, normal Xeon CPUs and Intel 82580 or 82599 NICs. The only exotic component in our typical setup are NICs with optical bypass for failover. This scales up to 10 million connections and 20Gbps of optimization in a single 2U box.

Our normal method of integration is to function as a bump in the wire with no L2/L3 address, preferably on the Gi link next to the GGSN. This the last point in the core network that deals with raw IP before it gets GTP encapsulated. So we have two network ports, one is connected to the GGSN and the other is connected to the next hop switch. From their point of view it's just a very smart piece of wire.

We've got a completely custom user space TCP stack. We started from scratch rather than from an existing implementation since our method of splitting the connection into two separate parts without sacrificing transparency would be hard to retrofit into an existing stack. Packet IO is done with our own user space NIC drivers; basically map the PCI registers and a big chunk of physical memory for frame storage, and manipulate the NIC rx and tx descriptor rings directly. It's not a lot of code, less than 1000 lines, and has some really nice properties like complete zero copy implementation even for packets that we buffer for arbitrary amounts of time.

The operating system is only involved with the control plane, the data plane is all in user space. One reason for that is obviously performance, but we think it's also a big win all around. Everything is always so much easier when you're working in user space; programming, debugging, testing, deployment.

So what do I mean by TCP optimization?

An optimized connection looks something like this. We pass the initial handshake through unmodified; yellow SYN from client to server, yellow SYNACK, and the final ACK in blue. Up to this point we're just a totally transparent network element. If there's anything odd about the connection setup, we'll just leave that connection unoptimized and continue forwarding any packets straight through.

But in this case everything went fine, so from that point on we'll ACK any data packets and take responsibility for delivering them. So here in green we have the request; we send an ACK to the client, and the segment toward the server. The server sends the first part of the response which we ack, and then send a new batch of data to us.

So it's kind of a hybrid. Not a terminating split-TCP proxy, but thanks to acknowledging data is a lot more effective than a Snoop proxy.

Since we don't terminate the connection, we can be fully transparent in TCP options and sequence numbers. This provides us with some really nice advantages.

First, we can stop optimizing connections at almost any time without breaking them. As long as we have no undelivered data buffered for the connection, the endpoints agree on the connection state and can just pick it up. This means we can have pretty short idle timeouts, a couple of minutes rather than 15 minutes. And if something odd happens? Just stop optimizing. This is also really nice for upgrades; we just stop optimizing connections, wait about a minute for all the buffers to drain, and take the system to bypass without interruption of service.

It deals very nicely with asymmetric routing. Sometimes something goes wrong with the integration, and we only get the uplink or downlink packets for some of the traffic. If we see only the SYN and the ACK but not the SYNACK, we'll just skip optimizing the connection. The same if we see the there's a loopback route where we see the SYN twice, once in each direction.

One problem with middleboxes is that they make it hard to introduce new TCP options, say multipath TCP of TCP fast open. Terminating middleboxes will essentially eat the unknown options. In our design a SYN with unknown options will just be passed straight through and we'll let the endpoints take care of the rest. There's a closely related issue of protocols that claim in the IP header to be TCP in order to bypass firewalls, but actually aren't. Again these would be broken by termination.

Finally, terminating proxies will often end up with a different MSS on the two sides of the connection. So it's supposed to send packets of at most 1380 bytes toward the client, but the server sends it data in chunks of 1460 bytes. This repacketization increases load, but also increases protocol overhead when packets are split suboptimally. In the worst case there will be a substantial amount of tiny segments in the middle of the TCP flow, which is problematic for a lot of mobile networks. In our design the segments can just be passed through as-is, with at most a bit of tweaking to the IP and TCP headers.

Some of the optimizations we can do are standard fare. Latency splitting speeds up the initial phase of the connection, especially if the server is old and still has an initial congestion window of 2/3/4. Likewise if the connection is bottlenecked on the receive window, reducing the effective latency improves steady state throughput.

If there's packet loss, having the retransmission happen nearer to the edge means we react faster to it. We can also make better decisions thanks to knowing it's a radio network and applying some heuristics to the packet loss patterns (more on that later). We don't have any fancy congestion control algorithm, our experience is that it's just not an area where you can gain a lot.

One thing you have a lot in mobile networks is either the radio uplink or downlink freezing completely for very long times relative to normal RTT. This triggers a lot of bogus retransmit timeouts in vanilla TCP stacks. We never use retransmit timers, and detect tail losses using probing instead. This allows faster recovery in cases where the full window was really lost. There's also no risk of misinterpreting an ACK of the original data as an ACK of the retransmitted data, which might cause confusion otherwise.

How well does it work? Here's some results from a trial in a European LTE network last winter. These are average throughput numbers for downloads, bucketed by transfer size. Optimized results in blue, unoptimized in red. What we see is little if any acceleration for tiny files (there's not much scope for optimization when all the data fits in the initial window), and anywhere from 10-40% speedups for larger transfers.

Just to clarify, these are results from live traffic rather than any kind of synthetic testing, looking at the throughput of all TCP connections going through the operator's network with alternating days of optimizing 100% of the traffic with days of optimizing none of it.

(N.B. Measurement seemed to be a somewhat sensitive issue, with doubts expressed regarding whether we were really doing it in a statistically robust manner, and whether averages are really a sensible way to represent the information.

Performance measurements in mobile networks are indeed a harder problem than you'd think. Unfortunately it's also a subject I could talk about for an hour and this was a 40 minute talk that needed to cover a lot more ground. I might need to write a separate post on this subject.

But suffice to say that we do the live traffic measurements in a way that tries to minimize effects from weekday / month of day trends, and over a long enough period of time that the diurnal cycle is irrelevant. Averages are indeed not a good measure from an academic or even engineering point of view, but using anything else is miserable commercially. You don't want to spend the first 30 minutes of a meeting by explaining exactly how to interpret probability density function or CDF graphs.)

Here's a more interesting optimization. Buffer management which is our feature for mitigating buffer bloat. Mobile networks are usually tuned to prefer queueing over dropping packets. And internet servers in turn are always going to ignore queuing as a congestion signal since RTT-based congestion control schemes will always lose out in practice due to the tragedy of the commons. So the queues will be filled to the brim in even pretty normal use. The most extreme case we've seen had queues of 30 seconds. Obviously unusable for anything, but even much a few hundred milliseconds of extra RTT will make interactive use painful.

Now, in mobile networks these queues are almost always per-user rather than per-flow or global. So what we do is handle all of the TCP flows of a single user as a unit. We determine the amount of data in flight across all flows that both keeps RTTs at an acceptable level but also doesn't starve the radio network of packets. When the conditions change, we adjust that estimate. This quota of in flight bytes is then split between the all the flows of the user, and we give all connections their fair share of it. So the batch download won't completely crowd out the web browsing.

This is independent of per-flow congestion control, there are some practical reasons why you only want to do packet loss based congestion control on a flow level. It won't work when done per-subscriber.

How well does it work? Here's some results from the same network as the previous example. Here we have results from 4 days of testing, with time on the X axis. The samples in blue are from the days when all traffic was being optimized, the red samples from days with optimization turned off. The graph on the left has average RTT on the Y axis. On the optimized days it's pretty stable at around 165ms, on the unoptimized days there's more variation and the averages are much higher, 320ms. So it's an almost 50% reduction in average RTT. On the right hand graph we have retransmission rates on the Y axis. For optimized days they are mostly under 1% (average 0.8%) and for unoptimized mostly above 2% (average 2.6%). That's a roughly 70% reduction in retransmissions.

This data is from the same set of testing as the results from a couple of slides back. So these RTT reductions aren't coming at the expense of performance. Instead we're getting big improvements in both RTT and in throughput.

Another thing we've noticed is that even surprisingly modest bursts of traffic can cause packet loss, for example when traffic gets switched from 10G to 1G links. And there are all kinds of mechanisms in TCP that can cause the generation of such bursts. ACK bunching, losing a large number of consecutive ACKs, or losing a packet when the full receive window's worth of data is in flight (the delivery of the retransmitted packet will open up the full window in one go).

Whatever the mechanism, it turns out that you don't want the TCP implementation to send out hundreds of kilobytes for a single connection in a few microseconds. Instead it's better to spread the transmits over a larger period of time. So not 200kB at once but instead 20kB at 1ms intervals. In one network this kind of pacing reduced the observed packet loss rate on large test transfers from over 1% to under 0.2%.

Here's some of the things we learned the hard way.

Every time we depend on a hardware feature we end up regretting it. They can never be used to save on development effort, because next month there will be new requirements that the hardware feature isn't flexible enough to handle. You always need to implement a pure software fallback that's fast enough to handle production loads. And if you've already got a good enough software implementation, why go through the bother of doing a parallel hardware implementation? The only thing that'll happen is that you'll get inconsistent performance between use cases that get handled in hardware vs. use cases that get handled in software. A few examples:

Most common issue is needing to deal with more and more exotic forms of encapsulation. VLANs are fine, hardware will always support that. Double VLANs might or might not be fine. Some forms of fixed size encapsulation are fine. Multiple nested layers of MPLS, or GTP with its variable length header are a lot more problematic. A canonical example here is checksum offload for both rx and tx; there's hardware support that always eventually ends up being insufficient, and you can compute the checksums very fast with vector instructions on modern CPUs.

We've been using multiple RX queues for parallelization. So we assign one process to each core, with each one having a separate receive queue. We originally did this with basic RSS hashing, but that quickly became insufficient. Smart people would have stopped at that. But we're idiots, and read the documentation on this wonderful semi-programmable traffic distribution engine built in to the NICs we used, call the Flow Director. Up to 32k rules with various kinds of matching. Awesome.

But then we needed to deal with virtualization. SR-IOV looks just perfect for our needs; the virtual machine gets its virtual slice of the network card that's for all intents and purposes identical to the real hardware. Except... It only has a maximum of 8 queues when physical hardware can do up to 128. We need more parallelization than 8 queues. And of course encapsulation is an issue too, even the Flow Director isn't flexible enough for every use case we've seen. So everything needs to be architected around just a single RX queue and doing the traffic distribution fully in software.

We initially thought that we'd just write the system once, deploy it everywhere, and roll in money while doing no work. But it turns out that no two networks are quite the same. There's often new kinds of performance issues or the existing methods of integration don't work in a network. We love the first case from a commercial point of view, since every problem is an opportunity for a performance improvement. We hate the latter, since supporting new integration methods brings no real value, it's just a cost of doing business. But in both cases the end result is that there's a fair bit of code that's not exercised in normal use and is liable to code rot.

Automated deterministic unit testing has been absolutely key in maintaining our sanity with this, and my only regret there is not planning for it from the first line of code and having to retrofit it in. Compared to the normal network programming workflow it's just such a huge productivity boost to be able to run hundreds of TCP behavior unit and regression tests in a few seconds. Things that get fixed stay fixed. We're also able to automatically generate ipv6 test cases from our ipv4 tests. If we didn't do it, there's a good chance that our ipv6 support would rot away very quickly.

I'll give a couple of examples of performance issues I thought were interesting.

The folklore around mobile networks is that there will never be any reordering. So our initial design was actually based on aggressively making use of that assumption. Turns out not to be true in practice. For equipment from one particular vendor we've seen small packets get massively reordered ahead of large ones. We're talking of reordering by 30 segments or over 50ms. This is particularly bad if there's a terminating HTTP proxy in the mix, and MTU mismatches on the southbound and northbound connections cause the proxy to generate lots of small packets. And reordering is poison for TCP. So we had to develop special heuristics to detect and gracefully handle this case.

Then there's strange patterns of packet loss or packet corruption. I talked earlier about the burst control feature that mitigates problems caused by packet loss from 10G to 1G switching. But this one is maybe even more mysterious.

One network was regularly losing some or all packets right at the start of the connection. So the handshake would go in, the request would go out, and then the response was dropped somewhere in the RAN. Losing the initial window of packets is of course just about the worst thing you can do to TCP. And this only happened in one geographical region that was using a different radio vendor than the rest of the country. Our best guess was that it was somehow related to the 3G state machine transition from low power to high power mode.

We never did find out exactly what the issue was, getting that sorted out was the operator's job. But again we needed some specialized code, this time to handle packet loss after a period of no activity different from packet loss in the middle of a high activity period.

Operator core networks can have absurd numbers of chained middleboxes. So you'll have a series of traffic shaper, TCP optimizer, video optimizer, image and text compressor, caching proxies, a NAT and a firewall, all from different vendors. When something goes wrong, it can take a lot of effort to just locate which component is at fault. The default assumption always seems to be that it's the most recently added box, which to be fair is a pretty good heuristic to apply. Things are even worse if it's not strictly a problem with a single network element, but in the interactions of several nodes. And these middleboxes are configured once, probably not looked at again for a very long time unless someone notices a problem, and the combination is never tuned holistically.

Maybe a canonical example of this is MTU clamping. In mobile networks you generally want a maximum MSS of 1380 to account for the GTP protocol overhead. Often this is done by making e.g. the firewall clamp the MTU. This works great until a terminating proxy is added south of the firewall, such that the clamping doesn't apply to the communication between the client and the proxy but does apply to the communication between the proxy and the server. This is exactly the wrong way around, and it's easy to miss since things will still work, just inefficiently.

HTTP proxies are frequently really badly configured from a TCP standpoint, which makes some sense since that's not their core competence. A HTTP proxy is usually there to do a specific function like caching, compression or legal intercept. It's not there to optimize speed. But you'll see things like the proxy not using window scaling for the connection to the server, or still using a initial congestion window of 2. These are things that should not exist in 2015.

There are proxies that freeze the connection for a few seconds on receiving a zero window, which is normally not a big deal since zero windows are rare. But a bit of a problem if you have another device right next to the proxy ACKing the data - for example a TCP optimizer. (We had to develop a special mode that'd never emit zero windows and instead do flow control through delaying ACKs progressively more and more as the receive buffers fill up). We even saw a HTTP proxy that had been configured to retransmit 15 segments instead of one segment on a retransmit timeout. Since RTOs caused by delays rather than full packet loss are of course really common in mobile, this proxy was spewing out amazing amounts of spurious retransmissions.

One funny interaction we see all the time is having a TCP optimizer right next to a traffic shaper. So you have one box whose job it is to speed things up, and another that tries to slow things down. It's an insane way of doing things - these two tasks should maybe be done by the same device.

I've been just talking about the data plane in this presentation, since that's the part I'm responsible for. But it's really important to note that the data plane alone is not a product you can sell.

You need quite a lot of sophistication on the control plane too to get something that can be deployed in operator networks. It's a lot more work than one might think, in our case the management system probably took at least as much effort as the traffic handling, with three full rewrites and a fourth one ongoing. So there's a Juniper-style CLI for configuration, a web UI for simple configuration and statistics, a counter database, support for all kinds of protocols for getting operational data in and out of the management system, and so on.

If anyone here is planning on turning research into a product, you need to budget a lot more time for this stuff than you think.

Thanks a lot for your attention, we should have some time for questions now. If you want to get in touch with me for some reason, here's my contact details.

Unit testing a TCP stack

jsnell@iki.fi — Thu, 09 Jul 2015 15:00:00 GMT

Last year in an online discussion someone used in-kernel TCP stacks as a canonical example of code that you can't apply modern testing practices to. Now, that might be true but if so the operative phrase there is "in-kernel", not "TCP stack". When the TCP implementation is just a normal user-space application, there's no particular reason it can't be written in a way that's testable and amenable to a test driven development approach.

The first versions of Teclo's TCP stack were written as a classic monolithic systems application with lots of explicit and implicit global state, not as something that could be treated as a library let alone something you could reasonably mock out parts of. As such it was totally unsuited for any kind of automated testing. The best you could do was run the system in various kinds of simulated network environments and check that it was getting roughly the same speeds from one release to the next. We've also got over 50 configuration parameters for tweaking the behavior of the TCP algorithms, which would make for a hell of a test matrix. Repeated manual testing of all these parameters would probably require a person to do nothing but run those tests full time.

This was clearly not tenable in the long term, so getting some kind of deterministic and automatable tests up was a pretty high priority. So very soon after we were finished with the rush to ship a first version, we could refactor things a bit for better testability and had at least some rudimentary tests up.

How we write tests

What would make a TCP implementation particularly tricky to test?

Our TCP flow record has over 70 state variables and 10 timers (some of which can interact with each other). And we need two of those records for a single TCP connection, with the state of one flow potentially affecting the behavior of the other [1]. With this much interlinked state it is hard to feel confident about any testing that tries to artificially set up only the relevant variables. Even if such setup is done correctly right now, it would be very easy for those assumptions to break as the code changes, invalidating the tests.

In general the appropriate unit of testing here is then the TCP stack as a whole rather than e.g. somehow trying to test a feature like zero window probing in isolation just by calling a method that implements that feature. The latter would be an absurd idea, since interesting TCP features end up being a lot more cross cutting than that. Of course I don't mean testing the application as a whole either. We chop the application off at the the core event loop, which would normally handle polling the NICs for packets, update the system's idea of the current time between packets, run timers when appropriate, and occasionally receive RPC messages from the management system. All of this detail is luckily irrelevant for testing the core TCP algorithms.

Instead for testing we create an instance of the TCP stack that replaces the normal NIC-based IO backend with a callback based one. The test driver will inject packets directly to the TCP stack by calling the appropriate entry point. When the TCP stack wants to emit a packet, that triggers a callback in the test driver and we can check whether the contents of the packet are what were expected. Finally, the other entry point to the TCP stack is implicit, through timer callbacks. To handle this case, we need to replace the wall clock based time source with a virtual one, and give the test driver the responsibility for triggering it.

The second problem is expressing the test cases in a convenient manner. The first thought here was expressing the test cases as pcap format trace files [2]. The trace files would theoretically have exactly right information: the exact packet contents and microsecond accurate timing information for the test driver to work by. This approach turns out to not to be good for much. Artificial test cases are very hard to create and update, debugging test failures is painful, and testing for things other than packets that are output impossible (e.g. counters) [3]. No, the test cases really need to be expressed in code.

For that to work, there needs to be a simple way of describing packets both for the purposes of generating packets as well as comparing output packets to expectations. Now, we happened to have some code around for pretty-printing many kinds of packets as JSON. That sounds like a perfect tool for the job, we'd just need a bit of code to do the reverse operation. This is what the JSON looked like:

{
    "ether":{
        "source":"31:32:33:34:35:36",
        "dest":"41:42:43:44:45:46",
        "type":"ip"
    },
    "ip":{
        "version":4,"hlen":5,"tos.dscp":0,"tos.ecn":0,"len":1040,"id":0,
        "rf":0,"df":0,"mf":0,"offs":0,"ttl":255,"proto":"tcp","check":0,
        "src":"170.170.170.187","dst":"8.0.0.8"
    },
    "tcp":{
        "source":80,"dest":58999,"window":48000,"check":0,
        "seq":1001000,"ack_seq":1100,"urg_ptr":0,
        "cwr":0,"ece":0,"urg":0,"ack":1,"psh":0,"rst":0,"syn":0,"fin":0,
        "options": {},
        "data":"..."
    }
}

Right... That just won't work. First, it's way too verbose since even a simple test will involve tens of packets. You could eliminate some of the verbosity in the packet generation by using lots of defaulting. But that doesn't really work for checking outputs against expects. This kind of 1:1 mapping to raw fields in the packet is also not what's really needed. Things like advertised windows and sequence ranges are pretty much the things I care most about when specifying a test case. But the actual advertised window can't be determined from a single packet, you need to know the window scaling factor that's in use, which is only available in the SYN. Likewise the starting sequence number of a segment is in the TCP header, but the ending sequence number is implicit.

What we need is something designed for humans to read. The obvious choice here was to pattern it after the tcpdump output format since we read packet dumps in that format every day.

41:42:43:44:45:46 > 31:32:33:34:35:36, 8.0.0.8.58999 > 170.170.170.187.80: Flags [S], seq 999, win 32000 [mss 1460, sackOK, wscale 7, ts_val 123, ts_ecr 0]

That's still a bit chubby, but the way this works in practice is that we'd make a PacketGenerator object that has defaults for the fields that are generally going to be constant over the lifetime of the connection (but just generally, if they need to change, no problem):

// Set up a generator with defaults
PacketGenerator from_server(// MAC addresses
                            "123456", "ABCDEF",
                            // IP addresses
                            0xaaaaaabb, 0x8000008,
                            // Port numbers
                            80, 58999,
                            // Window scaling, traffic direction
                            4, true);

#define INJECT_FROM_SERVER(str) 
  tcp_test.inject(from_server.generate(str))

// Generate and inject a packet
INJECT_FROM_SERVER("[A], seq 1001701:1003101, ack 161, win 192000"));

What about expects then? In the normal mode of operation we simply push the string representation of expected packets to a per-interface queue. When the TCP stack tries to emit a packet, it ends up instead in a callback in the test driver. The callback pretty-prints the packet, and compares it to the first string in the queue for the output interface. If the strings don't match, or if the queue is empty, the test fails (and thanks to having readable string representations it's generally completely obvious where in the test the problem was, and what the difference between expected and actual results). To eliminate the repetition in the pretty-printed representation, we also typically have a small macro that fills in the layer 2 / layer 3 information.

#define EXPECT(M, E) tcp_test.expect(M, E, __FILE__, __LINE__)
#define EXPECT_TO_CLIENT(str) EXPECT(true, "31:32:33:34:35:36 > 41:42:43:44:45:46, 170.170.170.187.80 > 8.0.0.8.58999: Flags " str);

// Assert that the next packet to be output toward the client should look
// like this.
EXPECT_TO_CLIENT("[A], seq 1001701:1003101, ack 161, win 48000");

// N.B. packet generation is stateful when it comes to window scaling,
// pretty-printing is not. So this 48k corresponds to the unscaled 192k
// from the input.

The last bit of basic functionality is manipulating time. This is very simple, as long as it's easy to substitute some kind of a virtual clock for a real clock. Just move the clock forward by the requested amount, run any timers, and check that the expect queues are empty. The only tricky bit here is advancing the clock in the minimum timer quant rather than all at once. This matters since a timer getting run might cause a timer (either the same or different one) to be (re)scheduled.

void TcpTest::step(int ms, const char* file, int line) {
    int usec = TimerSet::USEC_PER_TICK;
    for (int i = 0; i < (ms * 1000) / usec; ++i) {
        time_source_->step_ms(usec);
        timers()->run_expired();
    }
    assert_all_expects_satified(file, line);
}

There are a few cases where the textual representation is insufficient. For example maybe some header field that needs to be changed is too obscure to bother including in the parser and the prettyprinter. For cases like this the packet structure returned by generate() can be modified before being injected. Likewise there's another version of expect that takes a callback function for doing arbitrary checks on the packet, rather than just a string comparison.

Finally, it turns out that when testing edge cases of TCP behavior it's often very convenient to run a bunch of alternate scenarios starting from some particular socket state. A normal solution here might be to make the state cloneable, but that's something we actively don't want to do in the normal application, and maintaining the copying code would be fragile and an unnecessary hassle. Instead for testing we have little BEGIN_FORK and END_FORK macros to run a block of code in a forked process and quitting the parent process if the child processes errors out, with the alternate scenarios each running in their own forked process. It's not an ideal setup, since forking makes the experience of using tools like gdb or valgrind a bit rough.

This also makes for pretty large tests. A typical test (containing several subtests through the fork hack) is around 100-150 lines long. Unsurprisingly the tests end up a lot longer than the code being tested. Code coverage of the relevant files is at about 93% which is good enough (most of the code that isn't covered is probably never executed in production; it's old experimental features hidden behind flags not enabled by default, paranoid error checking code for situations that would be very hard or impossible to write a test to trigger, etc).

What we can and can't test

One objection that was implied to testing TCP implementations is that you only really test completely trivial things, and most of the trouble comes from the nature of TCP being a system with complex and distributed state. So what kind of tests can you express using this setup? Let's use the earlier example of zero windows. Cases you might want to test for and which are easy enough to do (and some of which we really want to test multiple times with different configuration parameters):

Receiving a zero window in the SYNACK, with the window getting opened by a separate ACK only once the 3WHS finishes.
Receiving a zero window in the SYNACK, with the window never getting opened.
The window starting at a reasonable value, but shrinking to zero during the connection, then opening up again (both naturally or as a reaction to zero window probing)
Probes getting sent at the expected timeouts if the zero window condition persists for too long.
Advertising a zero window yourself to one of the endpoints when buffers are full. Check that anything sent in excess of the advertised window is properly dropped.
Correctly reacting to zero window probes sent by that endpoint. (Both the "still zero" and "I have some space now" cases).

Are these kind of tests interesting or useful? I'd like to think so. At least that list is a mix of things we got wrong at one time or another, things we've seen others get wrong, and tests done just in case. Thinking about the test cases also gets you into an adversarial mode of thought, where it's easier to see the cases that were left unhandled.

Of course this kind of tests has its limits, and couldn't possibly detect all failures. As I've written earlier, TCP is harder than it looks mainly because of the bizarre interoperability failures. Unit testing can catch algorithm bugs during development, but will at best act as a regression test for problems encountered with endpoints that behave in completely unexpected ways. Nor does this help at all with testing some other parts that are on the critical traffic path like our custom device drivers. But you can't let the perfect be the enemy of the good.

Even when there are corners of the system that you can't test, I've still found unit testing and a semi-TDD approach [4] to be hugely valuable in this problem space, and I've found myself leaning on writing the test cases before code much more heavily than in any other project before. In fact if we've got a bug report and a theory about what could be going on, the first step is writing a test case to verify or disprove the theory. It's just an order of magnitude faster to set up fully controlled test case with this system than it would be to try to recreate the hypothetical network conditions required for the bug to manifest.

There are some nice side benefits too in addition to the typical gains from testing. One is that we get IPv6 test coverage for essentially free. We can run the same tests twice, once with the packet generator making IPv4 packets and then with it generating IPv6 ones. It mostly just requires a bit of finesse with the packet pretty-printing / parsing to account for the different IP address size.

Conclusion

Anyway, I'm really happy with this setup for low level network programming. If it truly is the case that in-kernel TCP stacks are untestable, maybe that's just another reason to get networking out of the OS and into the userspace.

Footnotes

[1] Our TCP stack is part of a transparent performance enhancing proxy. It splits every TCP connection in two parts without terminating the connections. The TCP connection is only taken over by the proxy after the initial handshake finishes, so both endpoints end up having a compatible view of the TCP options and sequence numbers used for the connection. This means that we essentially run a separate and full TCP stack for both halves of the connection, but e.g. the amount of data that has been acked on one half affects how much window space we want to advertise on the other half.

[2] One file per interface for the inputs, one file per interface for expected outputs, have the test driver compare actual outputs to expected ones.

[3] Not just guessing, I know it's basically useless since I later implemented this model for creating regression tests for issues we already had example traces for. Theoretically this allowed creating new test cases with almost zero effort, but the tests were so annoying to validate and maintain that we only ever made 4 of them. This is odd because in past lives this general form of testing has been my tool of choice over lovingly handcrafted artisanal unit tests.

[4] Semi-TDD, since the diehards wouldn't be happy with testing essentially a single static entry point for klocs and klocs of code.

What's wrong with pcap filters?

jsnell@iki.fi — Mon, 18 May 2015 12:00:00 GMT

Introduction

I recently watched a video of a great talk on the early days of pcap by Steve McCanne. The bit on how the filtering language was designed - around the 26 minute mark but you might want to start at 20 minutes if you're unfamiliar with BPF - was one of the best stories about creating a new "little language" I've heard.

But that got me thinking a bit. This language is a tool that I use daily, that I'm generally happy with, but that also drives me absolutely crazy sometimes. This post is an attempt to look at some classes of problems that the pcap filtering language fails on, why those deficiencies exist, and why I continue using it even despite the flaws.

Just to be clear, libpcap is an amazing piece of software. It was originally written for one purpose, and it really is my fault that I end up too often using it for a different one. There's three very different use cases that I have for a packet filtering language (others may have more).

Small and simple filters to pick out a specific slice of traffic (single protocol, single flow, or single host). I believe it's fair to say that this is what the language was originally designed for.
Potentially complex filters for classifying traffic with real-time constraints and with no state, usually when using the filters for configuration rather than as an exploratory tool. This is where pcap is clumsy even when it generally works.
Offline analysis at higher protocol layers that'd benefit also benefit from tracking the high level protocol state between packets. You can sometimes coerce pcap to work for this use case, but it's super-awkward. It's also worth noting that features that are beneficial for this use case would not be welcome in the others. (Being able to run an arbitrary PCRE regexp on the packet payload? Great when doing offline analysis, unacceptable for real-time classification).

I try to do the third case with tools better suited for that, and only have a couple of complaints (e.g. VLAN support) on on first case. Mostly the pain comes from the middle case. So as we start the tour of annoyances, keep in mind that I'll often complain about a tool not doing a job it wasn't meant for.

VLANs

The VLAN support might be the oddest part of pcap filters.

First, let's start off with the way filters require the presence of VLANs to be explicitly specified. Forgetting to do that might be the most common mistake I've seen people make (or done myself). They do a tcpdump on all traffic on an interface, and get a bunch of packets. Then they try to specify a filter, and no matter how liberal the filter is no traffic shows up. The first suggestion to any report I get of a filter not matching properly is "did you remember to add a vlan directive". Note the contrast to e.g. IPv6 support, where pcap will automatically generate both IPv4 and IPv6 matching code for a filter that just e.g. has tcp directive.

But let's say that your users have managed to internalize this requirement, and generally remember to specify exactly the right number of vlan directives. The usage then looks pretty straightforward.

Match all TCP traffic on VLAN 11.

vlan 11 and tcp

Match all TCP traffic on any VLAN.

vlan and tcp

But there's a dark secret (a well-documented secret, mind you). This filter will never match any well-formed traffic:

tcp and vlan

The problem is that the vlan directive actually affects the remaining expression, but not the already compiled parts. (It adjusts the offset of all later memory lookups to account space taken by the VLAN header). So this filter first requires a packet to be a non-VLAN tagged TCP packet, and then requires it to be some kind of VLAN tagged packet. This is a pretty unlikely combination...

Ok, what about this filter for matching all TCP traffic, whether it's VLAN tagged or not:

(vlan and tcp) or tcp

Again that doesn't work. From reading the man page too generously, one might think that the change in the lookup offset is scoped just to the parenthesized sub-expression that the vlan directive appears in. That's not the case. Instead it really affects the rest of the full filter expression. Instead the source must be rearranged such that all the non-vlan options come first:

tcp or (vlan and tcp)

But wait, it gets worse!

vlan 11 or vlan 12

This does not match a packet with either a VLAN tag of 11 or 12, as one might expect. Actually it matches any packet with a tag of 11, or a double-tagged VLAN packet with an outer tag of anything and an inner tag of 12. This is because the offset tweaking is purely a compile-time effct. So the vlan 11 directive will advance the offset whether it matched or not.

One might expect that the right way to do this is something like vlan 11 or 12. But that doesn't even parse despite being analogous to other pcap filter constructs. Instead you need to dig out the relevant bits from the ethernet header manually.

vlan and (ether[14:2] & 0xfff == 11 or ether[14:2] & 0xfff == 12)

The generated BPF code is actually good since the repetition gets optimized away, it's just very awkward to write an expression like that.

And finally, the cruelest trick of the pcap VLAN support:

(not vlan) and tcp

Non-VLAN tagged TCP, right? But as should be obvious at this point, that's not how things work in this corner of pcap. Again the compiler advances the offset to account for the VLAN header even though we've explicitly specified no VLANs. And thus the lookups for the IP ethertype and TCP IP proto will be from the wrong locations.

It's all a bit of a mess. But even if anyone was willing to break compatibility by fixing this, it'd be tricky to do given the basic model of handling the VLANs statically at compile time. Instead it'd need to be done dynamically, either through lots of duplicate code or by maintaining a dynamic offset that's added to all subsequent memory lookups. (And this latter option would presumably make things a lot harder for the pcap BPF optimizer).

Even so, the behavior of vlan is in stark contrast to the extreme user-friendliness of most of the basic pcap filter constructs, as well as the design principles outlined in McCanna's talk.

No abstraction facilities

At Teclo a substantial part of the configuration for our mobile TCP accelerator consists of pcap filters. There's all kinds of decisions that'd be incredibly hard to cover with a static set of configuration parameter, but can easily be done with a simple filters. The filters are safe, and might even be familiar to the network administrators unlike our bespoke configuration options. What's not to like?

The problem is that once a filter-based decision making mechanism exists, the filters will soon stop being simple. The pcap language provides basically no tools at all for managing this complexity, beyond discouraging complex filters by making writing them painful. In fact the lack of any control structures virtually ensures massive code duplication and all the maintenance problems that this implies.

Let's take for example our optimize-filter, which can be used to select at SYN-handling time which TCP flows to optimize and which to just pass through. One use for such a filter might be to disable optimization for a few hosts that misbehave, are being tested, or something along those lines. Those are simple behaviors, and require only simple filters.

But needs change, and soon you'll get cases like wanting to optimize the traffic half the mobile subscribers, but not optimizing the traffic of the other half, and doing that split in some deterministic manner. The goal is to gather statistics for both groups of users, and hopefully show the benefits of optimization. That's simple, right? So let's see where that kind of an requirement could take us.

The filter gets only TCP SYNs as input from the application layer, so we don't need to check for that again in the filter. Just pull out the last two bytes of the source IP, mix them together, and check the lowest bit. If it's 1, optimize. If it's 0, don't optimize:

((ip[15] + ip[14]) & 1) == 1

There's an obvious problem. This is using the src address of the SYN, but it's possible that the mobile device is actually the recipient rather than sender of the SYN. To fix this, we need to check the subnets of the endpoints, and choose the bits from either the source or destination address.

((src net 10.0.0.0/16) and ((ip[15] + ip[14]) & 1) == 1) or
((dst net 10.0.0.0/16) and ((ip[19] + ip[18]) & 1) == 1)

But wait, what if both the sender and recipient are on the mobile subscriber subnet, and one is eligible for optimization and the other isn't? We'll end up optimizing the traffic regardless of the direction, which will distort the statistics. Instead we need to make it symmetric, based on either the sender or the receiver. And we should really be prepared for the case where we weren't given accurate info, and neither the sender or receiver is in the mobile address pool

((src net 10.0.0.0/16 and not dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) & 1) == 1) or
((dst net 10.0.0.0/16 and not src net 10.0.0.0/16) and
 ((ip[19] + ip[18]) & 1) == 1) or
((src net 10.0.0.0/16 and dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) & 1) == 1) or
((not src net 10.0.0.0/16 and not dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) & 1) == 1)

It's of course unrealistic for a mobile operator to have just one mobile address pool. There will be a couple of pools for premium users with public IPs, a private IP pool for the main APN, another private IP pool for a performance testing APN, and so on. So really you can't just be checking for this single subnet.

(((src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) & 1) == 1) or
(((dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[19] + ip[18]) & 1) == 1) or
(((src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) & 1) == 1) or
((not (src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) & 1) == 1)

You might be wondering about the ever increasing nesting level of parentheses. It looks like the kind of Lisp joke that was already stale in the '70s. But that's just what these filters end up looking like once they get complicated enough, since the users don't have confidence in getting the precedence right. And I'll spare you from the version with IPv6 support.

Once you get into filters like this, the benefit of accessability and familiarity goes out the window. Anyone with some Unix or networking background should be able to deal with simple pcap filters, but you can't expect most people to do so with a 10 line filter with parentheses nested 4 levels deep. The only way this would actually get entered is by someone copy-pasting from our documentation. And even that wouldn't necessarily work, since in there's so much stuff in that expression that needs to be customized for the specific network, and which needs to be updated in a consistent manner.

But the code generated isn't as horrible as the source code after the optimizer has had its way with it. And this filter could actually be really simple with just some way of eliminating the redundancy. It's so close to still being a good tool.

But maybe that's a theoretical case. How about only (reliably) matching packets that have the window scale TCP option set, another kind of thing you might hypothetically want to affect optimization settings in this use case? Well, assuming a maximum of 6 TCP options I think you'd need something like this:

tcp and ((tcp[tcpflags] & tcp-syn) != 0) and ((tcp[20] == 3) or ((tcp[20] != 1) and ((tcp[20 + tcp[21]] == 3) or ((tcp[20 + tcp[21]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + 1]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + 1] == 3))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 2]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 2] == 3))))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + 1]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + 1] == 3))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 3]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 3] == 3))))))))) or ((tcp[20 + tcp[21]] == 1) and ((tcp[20 + tcp[21] + 1] == 3) or ((tcp[20 + tcp[21] + 1] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + 1]] == 3))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + 1] == 3))))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1 + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 2]] == 3))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 2] == 3))))))) or ((tcp[20 + tcp[21] + 1] == 1) and ((tcp[20 + tcp[21] + 2] == 3) or ((tcp[20 + tcp[21] + 2] != 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] == 3) or ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] != 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + 1]] == 3))) or ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] == 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + 1] == 3))))) or ((tcp[20 + tcp[21] + 2] == 1) and ((tcp[20 + tcp[21] + 3] == 3) or ((tcp[20 + tcp[21] + 3] != 1) and ((tcp[20 + tcp[21] + 3 + tcp[20 + tcp[21] + 4]] == 3))) or ((tcp[20 + tcp[21] + 3] == 1) and ((tcp[20 + tcp[21] + 4] == 3))))))))))) or ((tcp[20] == 1) and ((tcp[21] == 3) or ((tcp[21] != 1) and ((tcp[21 + tcp[22]] == 3) or ((tcp[21 + tcp[22]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + 1]] == 3))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + 1] == 3))))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1 + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 2]] == 3))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 2] == 3))))))) or ((tcp[21 + tcp[22]] == 1) and ((tcp[21 + tcp[22] + 1] == 3) or ((tcp[21 + tcp[22] + 1] != 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] == 3) or ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] != 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + 1]] == 3))) or ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] == 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + 1] == 3))))) or ((tcp[21 + tcp[22] + 1] == 1) and ((tcp[21 + tcp[22] + 2] == 3) or ((tcp[21 + tcp[22] + 2] != 1) and ((tcp[21 + tcp[22] + 2 + tcp[21 + tcp[22] + 3]] == 3))) or ((tcp[21 + tcp[22] + 2] == 1) and ((tcp[21 + tcp[22] + 3] == 3))))))))) or ((tcp[21] == 1) and ((tcp[22] == 3) or ((tcp[22] != 1) and ((tcp[22 + tcp[23]] == 3) or ((tcp[22 + tcp[23]] != 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] == 3) or ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] != 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + 1]] == 3))) or ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] == 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + 1] == 3))))) or ((tcp[22 + tcp[23]] == 1) and ((tcp[22 + tcp[23] + 1] == 3) or ((tcp[22 + tcp[23] + 1] != 1) and ((tcp[22 + tcp[23] + 1 + tcp[22 + tcp[23] + 2]] == 3))) or ((tcp[22 + tcp[23] + 1] == 1) and ((tcp[22 + tcp[23] + 2] == 3))))))) or ((tcp[22] == 1) and ((tcp[23] == 3) or ((tcp[23] != 1) and ((tcp[23 + tcp[24]] == 3) or ((tcp[23 + tcp[24]] != 1) and ((tcp[23 + tcp[24] + tcp[23 + tcp[24] + 1]] == 3))) or ((tcp[23 + tcp[24]] == 1) and ((tcp[23 + tcp[24] + 1] == 3))))) or ((tcp[23] == 1) and ((tcp[24] == 3) or ((tcp[24] != 1) and ((tcp[24 + tcp[25]] == 3))) or ((tcp[24] == 1) and ((tcp[25] == 3))))))))))))

Does that 8kB monstrosity even work? Seems to be ok based on light testing and it all made sense when I wrote it, but I couldn't say for sure. Note that it doesn't implement the EOL TCP option or checking for overflowing into the payload area of the packet, so a proper solution would look even worse. This is a task that BPF would be well suited for, and which it shouldn't be an unreasonable task for a packet filter language. But it is unreasonable in pcap, and so anything dealing with TCP options must be done using means other than scripting.

Due to the limits of the pcap language, filters become an untenable as a method for configuration surprisingly quickly, and the system instead needs to trigger actions based on a set of static configuration items. Unfortunately you'll never have thought of every possibility in advance, and the next deployment could be one where you're once again wading deep into a too complex filter as a last resort.

What kind of language changes could help, assuming the the generally declarative flavor needs to be maintained? Just having some kind of binding construct or alias would help. Consider the first case with the results of the src net / dst net expressions being executed once, the result bound to an identifier, and then referred to using that identifier. It would be so much simpler to read, let alone write.

An if-else-expression or case-expression would allow prevent having to repeat the negation of a previous expression to get properly disjoint subexpressions. Some kind of pre-programmable macro facility (expanding to pcap code, not doing arbitrary computation) would allow extending the language a bit to cover use cases like this.

Changing the language like that seems unrealistic, so people wanting to need to do these things while still using BPF as the filtering mechanism need to compile from some other language or write raw BPF assembler. (And then you're really not talking of the kind of configuration that an end user could do, no matter what your definition of end user is).

Hostnames

Allowing some kind of external interaction is the bane of all programming languages that have intentionally limited power. Users will almost instinctively attempt to use such facilities to do something they're not supposed to. I think the only such facility in pcap is name resolution, and since the resolution happens at compile time, it's hard to see how to abuse it. (But users are a devious bunch, so don't quote me on that).

But it's not possible to cleanly disable the hostname resolution either. This is problematic in situations where it's unacceptable in situations with real-time constraints on the compilation phase too, not just the filter execution phase.

The first time I bumped into this, DNS was sufficiently flaky that name resolution was taking several seconds. When a config change including a filter with a hostname was propagated to the traffic handling processes, they tried compiling the filter, hung while doing the name lookup, a watchdog timer detected the system was unhealthy, and everything was restarted. Ok, not good.

But we're all responsible people, right? Let's just remember to always use addresses rather than names. That worked for about a month. Then somebody made a copy-paste error and ended up with some rubbish right after a host directive, something along the lines of host 10.0.0.1and port 80 except much longer. Since the name lookups happen during parsing, the name resolution was triggered before the system decided that the expression was malformed anyway. Time for the watchdog timer again!

What about doing the filter compilation in a separate thread from the actual work? The problem here is that at least for my uses it's very problematic if configuration changes can take tens of seconds or minutes to propagate everywhere. It's even worse if configuration change requests can queue up, and if the systems in charge of propagating configuration changes don't have a good idea of when a config change will make it through.

Fine, perhaps the filters need to be validated by an earlier phase in the configuration pipeline, for example at the moment the configuration change is submitted (the commit fails if the compilation fails). Unfortunately if the filters get compiled again after the validation for any reason, they lookup that worked originally can hang now. (Or outright fails due to a DNS record having been removed). It's an interesting question what kind of a fallback strategy you can use if a configuration that used to be valid suddenly becomes invalid. But it's better to not to be in a situation where you need such a strategy.

There are three workarounds that will actually work. Two of them try to ensure that all name resolution fails immediately, which ensures that no configuration using a host name will ever be accepted in the first place. The third tries to make sure that name resolution can't happen at a time when it'd cause trouble.

Hack out the name resolution support completely from the library, or go the extra mile and make it an option.
Make sure all hostname lookups always fails immediately, either through some global configuration or maybe by overriding the relevant library functions with LD_PRELOAD.
Always reify all filter source to compiled source at in a safe process and at a safe time (e.g. when the configuration change to use a new filter is made). Then pass the compiled program around, instead of the source. Or pass a source / binary pair, if it's important to be able to introspect the system and see what filters are in effect. This is actually a pretty decent solution, but requires a lot more changes.

So this is solvable by jumping through relatively minor hoops. But it does at least break the golden rule of designing little sandboxed languages, which is that no matter how insignificant an interface to the external world is, it should be possible for the program that embeds the language to disable the interface.

Just filtering, not classification

Anyone dealing with packets will soon not be satisfied with simply rejecting / accepting packets. Instead they'll have more than just two categories, and want to assign a packet to one of them. While the BPF instruction set can express this kind of operation through the use of multiple distinct return values, the pcap filtering language can't.

A pcap-style solution is to have multiple separate filters, apply them in order, and pick a category based on the first one to match. And it is a pretty clean solution. But the annoying part is that by doing that we're losing out on one of the main advantages of the language, which is the optimizer. In situations with multiple filter categories there's often a tremendous amount of overlap between the filters, all of which would get optimized away by the compiler if separate filters could be merged pre-optimization.

It makes perfect sense for this to be the case. The language doesn't really have a space for imperative constructs like a return directive. Probably the best you could have is some kind of implicit toplevel or for full toplevel expressions, and somehow assigning a distinct success code to each full expression. But that's straying quite far from the current form of the language.

Why use pcap filters then?

Given the above whining, why am I still writing pcap filters in pretty much any networking program that I write? Why not find some other filter language, write my own, or link in a jitted scripting language and write the packet filters in that? Well, pcap has a few big advantages; running in the kernel, safety, ease of use for the simple cases, and ubiquity.

If you want to run code in the kernel, it has to be through BPF. And pcap is the best way I know of to generate good BPF code. (Note: this might change with the recent introduction of eBPF to the Linux kernel, which e.g. has an LLVM backend. This could have major effects on language availability and ease of optimization).

To me personally the kernel use case is irrelevant, my packet handling happens in userspace.

Guaranteed termination, or even a guaranteed upper bound on execution time, is of course a huge deal for packet filtering applications. It's kind of bad if an accidental infinite loop in a packet capture filter configuration field crashes a router. This probably excludes 99.9% of non-bespoke languages; it's very hard to get people to give up their Turing completeness. It can be hard to even get basic sandboxing working for some scripting languages, so termination guarantees aren't even on the radar. But the safety advantage doesn't matter if the comparison is to other custom languages, which would certainly be designed from the ground up to have the same properties.

No, for me it's the combination of the last two that does it. Even when I'm writing something to work on live network traffic, during the early part of development it'll be mainly be tested on already existing trace files. Which are most likely in pcap format, so linking in libpcap to read those files is a natural choice. Once the program evolves a bit and there's a need for a simple filtering mechanism, it's only natural to use what's already there. Very soon there's a critical mass of filters around and there's no point in investigating other solutions. Hopefully all the filters remain simple, and we stay out of trouble.

And finally, to undermine my own point, the user-space network appliance kit Snabb Switch doesn't have any need to run code in the kernel, uses a distinct implementation of the pcap language that's not backed by BPF (and thus doesn't benefit from the pcap optimizer), and didn't just get linked early on to get some packets fed into.

Despite having a total green-field opportunity, not only did they choose to use the pcap filter language, but even went through the trouble of commissioning said from-scratch luajit-based pcap implementation which eliminates my convenience argument from above. (Andy Wingo's writeup on pflua on this is worth reading, very cool stuff). I asked Luke Gorrie why they went that route, and this is what he had to say:

I think the pflua implementation actually strikes a nice balance:

Safe for end-users to configure
Relatively general, familiar, efficient
Small dependency (at least if you are using LuaJIT)
Clean and simple code that can evolve in interesting directions (actions, profiling, fancy LuaJIT packet mangling library underneath, language cleanup, etc)

[...] I didn't want to carry libpcap as a dependency for too long but neither to say "please come back when we have invented the ultimate new packet filtering language".

All of which is hard to argue with, and certainly the whole-application performance improvements from pflua seen in Snabb are pretty impressive.

What's the alternative?

Normally this would be the point where I proclaim the miraculous discovery of a solution to exactly the problems I've been describing while still keeping the (not inconsiderable) benefits of pcap. But alas, no. If anyone has recommendations for filter languages that are as usable as pcap in the small, but that work better when used in the large, I'm all ears.

Podcast on mobile TCP optimization

jsnell@iki.fi — Sat, 14 Mar 2015 22:00:00 GMT

I was recently a guest on Ivan Pepelnjak's (ipspace.net) Software Gone Wild podcast, talking about TCP acceleration in mobile networks, as well as whining in general about how much radio networks suck ;-) Thanks a lot to Ivan for the opportunity, it was fun!

You can listen to the podcast episode here.

How buying a SSL certificate broke my entire email setup

jsnell@iki.fi — Fri, 05 Dec 2014 13:30:00 GMT

Everything looks ok to me, the error must be on your end!

Earlier this week I got a couple of emails from different people, both telling me that they'd just created a new user account, but hadn't yet received the validation email. In both cases my mail server logs showed that the destination SMTP server had rejected the incoming message due to a DNS error while trying to resolve the hostname part of the envelope sender. Since I knew very well that my DNS setup was working at the time, I was already ready to reply with something along the lines of "It's just some kind of a transient error somewhere, just wait a while and it'll sort itself out".

But I decided to check the outgoing mail queue just in case. It contained 2700 messages, going around 5 days back, most with error messages that looked at least a little bit DNS-related.

Oops.

Now, 5 days for this server is usually something like 50-60k outgoing email messages, so those 2700 queued messages represented a pretty decent chunk of traffic. The mail logs suggested that the errors had started weeks ago, around November 12th. And indeed while an A query for the hostname was working fine, a MX query returned no results.

I didn't touch anything, it just broke!

I was completely sure that the MX setup used to work just fine. And I had not done any kind of DNS changes at all for at least a year. Any computer setup will rot eventually, but it shouldn't be happening quite this fast.

Wait... Was that date November 12th? That's when I bought a new SSL certificate through my registrar, who is also doing my DNS hosting. Hmm... And I even chose to use DNS-based domain ownership validation rather than the 'email a confirmation code to hostmaster@example.com' method, and allowed my registrar to automatically create and delete the temporary authentication record.

Ok, so technically I did make a change, even if it was just to authorize another system to make an automated DNS configuration change on my behalf. But clearly my registrar must have screwed up these automated config changes, and completely deleted the MX record along the way!

That config looks valid, just apply it!

Well kind of, but not really.

When I logged into the DNS management UI yesterday, it turned out that the MX record was still there, but it was marked with a validation error complaining about a conflict with the CNAME. When I did the original configuration, I'd set up the relevant host with both MX and CNAME records. That is apparently not best practice and can cause problems with some mail servers. And who am I to argue with that, even if it had seemingly worked for the past year.

I changed the CNAME to an appropriate A record, the validation error was cleared, and as if by magic my queries were now working. 5 minutes later the outgoing mail queue was draining rapidly.

So clearly my service provider had added some helpful functionality to prevent bogus DNS setups from being exported. The old zone files would still be there working properly during the transition period, but the next time the user would make changes they'd need to fix the setup before it'd be exported.

That's a reasonable design, and I'm sure it would work marvelously most of the time. But in this case the zone file export was actually triggered by an automated process, so there was no way for me to notice that the configuration was now considered erroneous, or to fix it. The DNS host was serving a mostly functional zone; it was just missing one record, and even that one record only mattered for a fraction of my outgoing mail making the problem even harder to spot. (Just a fraction of my mail, since it looks like most mail servers either don't validate the sender domain, or validate it with some other kind of query).

There's a bit of guesswork involved in the last couple of paragraphs. The error can no longer be replicated, since management UI will no longer allow me to create a setup analogous to the original. So it's hard to be completely certain of the mechanisms on that management UI's side of the story. But I'm still fairly confident that this is at least a pretty close approximation of what happened. The timings of me buying the certificate and the start of a spike in DNS-related mail delivery errors match up way too well for any other explanation to be credible.

Of course frobnicating the wibblerizer could break the dingit! Everyone knows that...

There's all kinds of morals one could draw from this story. Proper monitoring would have detected this immediately, the registrar should have accounted for this corner case, I should maybe not have the default assumption that the other party is at fault when something breaks, you should always check the results of any kind of automated config change when it's done for the first time, and probably many other excellent lessons in either life or systems engineering.

But really I'm just telling this story because I find the endpoints in the chain of causality completely hilarious. In a sensible world my action A really should not have led to the final result B, but it did. It's unfortunate that the title of this blog post ended up looking like linkbait of the worst kind, when it's actually the best 10 word summary I could think of :-)

Juho Snellman's Weblog

Why PS4 downloads are so slow

Background

Experiment #1

Experiment #2

Conclusions

Speculation

Footnotes

The mystery of the hanging S3 downloads

The packet captures

TCP Timestamps

Conclusion

Footnotes

The hidden cost of QUIC and TOU

What you lose with encrypted headers

Conclusion

Footnotes

The many ways of handling TCP RST packets

Background

Receiving a RST

Interlude on RST handling in TCP aware middleboxes

Sending RSTs

Some statistics

Conclusion

Flow disruptor - a deterministic per-flow network condition simulator

Introduction

Motivation and an example

Implementation

Configuration

Timers and other events

Closures and auto

Build system

Other experiments

Conclusion

Footnotes

Mobile TCP optimization - lessons learned in production

Table of Contents

Presentation

Unit testing a TCP stack

How we write tests

What we can and can't test

Conclusion

Footnotes

What's wrong with pcap filters?

Introduction

VLANs

No abstraction facilities

Hostnames

Just filtering, not classification

Why use pcap filters then?

What's the alternative?

Podcast on mobile TCP optimization

How buying a SSL certificate broke my entire email setup

Everything looks ok to me, the error must be on your end!

I didn't touch anything, it just broke!

That config looks valid, just apply it!

Of course frobnicating the wibblerizer could break the dingit! Everyone knows that...