<rss version='2.0'><channel><title>Juho Snellman's Weblog</title><link>https://www.snellman.net/blog/</link><description>Lisp, Perl Golf</description><item><title>Why PS4 downloads are so slow</title><link>https://www.snellman.net/blog/archive/2017-08-19-slow-ps4-downloads/</link><description>
  &lt;p&gt;
    Game downloads on PS4 have a reputation of being very slow, with many people
    reporting downloads being an order of magnitude faster on Steam or
    Xbox. This had long been on my list of things to look into, but at
    a pretty low priority.  After all, the PS4 operating system is
    based on a reasonably modern FreeBSD (9.0), so there should not be
    any crippling issues in the TCP stack. The implication is that the
    problem is something boring, like an inadequately dimensioned CDN.
  &lt;/p&gt;

  &lt;p&gt;
    But then I heard that people were successfully using local HTTP
    proxies as a workaround. It should be pretty rare for that to
    actually help with download speeds, which made this sound like a
    much more interesting problem.
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;p&gt;
    This is going to be a long-winded technical post.  If you&#039;re not
    interested in the details of the investigation but just want a
    recommendation on speeding up PS4 downloads, skip straight to the
    &lt;a href=&#039;#conclusions&#039;&gt;conclusions&lt;/a&gt;.
  &lt;/p&gt;

  &lt;h3&gt;Background&lt;/h3&gt;

  &lt;p&gt;
    Before running any experiments, it&#039;s good to have a mental model
    of how the thing we&#039;re testing works, and where the problems might
    be. If nothing else, it will guide the initial experiment design.
  &lt;/p&gt;

  &lt;p&gt;
    The speed of a steady-state TCP connection is basically defined by
    three numbers. The amount of data the client is will to receive on
    a single round-trip (TCP receive window), the amount of data the
    server is willing to send on a single round-trip (TCP congestion
    window), and the round trip latency between the client and the server (RTT).
    To a first approximation, the connection speed will be:
  &lt;/p&gt;

  &lt;pre&gt;
    speed = min(rwin, cwin) / RTT
&lt;/pre&gt;

  &lt;p&gt;
    With this model, how could a proxy speed up the connection?  Well,
    with a proxy the original connection will be split into two mostly
    independent parts; one connection between the client and the
    proxy, and another between the proxy and the server. The speed of
    the end-to-end connection will be determined by the slower of
    those two independent connections:
  &lt;/p&gt;

  &lt;pre&gt;
    speed_proxy_client = min(client rwin, proxy cwin) / client-proxy RTT
    speed_server_proxy = min(proxy rwin, server cwin) / proxy-server RTT
    speed = min(speed_proxy_client, speed_server_proxy)
&lt;/pre&gt;

  &lt;p&gt;
    With a local proxy the client-proxy RTT will be very low; that
    connection is almost guaranteed to be the faster one. The
    improvement will have to be from the server-proxy connection being
    somehow better than the direct client-server one. The RTT will not
    change, so there are just two options: either the client has a
    much smaller receive window than the proxy, or the client is
    somehow causing the server&#039;s congestion window to
    decrease. (E.g. the client is randomly dropping received packets,
    while the proxy isn&#039;t).
  &lt;/p&gt;

  &lt;p&gt;
    Out of these two theories, the receive window one should be much
    more likely, so we should concentrate on it first. But that just
    replaces our original question with a new one: why would the
    client&#039;s receive window be so low that it becomes a noticeable
    bottleneck? There&#039;s a fairly limited number of causes for low
    receive windows that I&#039;ve seen in the wild, and they don&#039;t really
    seem to fit here.
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; Maybe the client doesn&#039;t support the TCP window scaling option,
      while the proxy does. Without window scaling, the receive window
      will be limited to 64kB. But since we know Sony started with a
      TCP stack that supports window scaling, they would have had to
      go out of their way to disable it. Slow downloads, for no benefit.
    &lt;li&gt; Maybe the actual downloader application is very slow. The operating
      system is supposed to have a certain amount of buffer space available
      for each connection. If the network is delivering data to the OS
      faster than the application is reading it, the buffer will start to
      fill up, and the OS will reduce the receive window as a form
      of back-pressure. But this can&#039;t be the reason; if the application
      is the bottleneck, it&#039;ll be a bottleneck with or without the
      proxy.
    &lt;li&gt; The operating system is trying to dynamically scale the
      receive window to match the actual network conditions, but
      something is going wrong. This would be interesting, so it&#039;s
      what we&#039;re hoping to find.
  &lt;/ul&gt;

  &lt;p&gt;
    The initial theories are in place, let&#039;s get digging.
  &lt;/p&gt;

  &lt;h3&gt;Experiment #1&lt;/h3&gt;

  &lt;p&gt;
    For our first experiment, we&#039;ll start a PSN download on a baseline
    non-Slim PS4, firmware 4.73. The network connection of the PS4 is
    bridged through a Linux machine, where we can add latency to the
    network using &lt;code&gt;tc netem&lt;/code&gt;. By varying the added latency,
    we should be able to find out two things: whether the receive
    window really is the bottleneck, and whether the receive window
    is being automatically scaled by the operating system.
  &lt;/p&gt;

  &lt;p&gt;
    This is what the client-server RTTs (measured from a packet
    capture using TCP timestamps) look like for the experimental
    period. Each dot represents 10 seconds of time for a single
    connection, with the Y axis showing the minimum RTT seen for that
    connection in those 10 seconds.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-rtt-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-rtt-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The next graph shows the amount of data sent by the server in one
    round trip in red, and the receive windows advertised by the
    client in blue.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-win-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-win-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    First, since the blue dots are staying constantly at about 128kB,
    the operating system doesn&#039;t appear to be doing any kind of
    receive window scaling based on the RTT. (So much for that
    theory). Though at the very right end of the graph the receive
    window shoots out to 650kB, so it isn&#039;t totally
    fixed either.
  &lt;/p&gt;

  &lt;p&gt;
    Second, is the receive window the bottleneck here? If so, the
    blue dots would be close to the red dots. This is the case
    until about 10:50. And then mysteriously the bottleneck moves to
    the server.
  &lt;/p&gt;

  &lt;p&gt;
    So we didn&#039;t find quite what we were looking for, but there are a
    couple of very interesting things that are correlated with events
    on the PS4.
  &lt;/p&gt;

  &lt;p&gt;The download was in the foreground for the whole duration of the
    test. But that doesn&#039;t mean it was the only thing running on the
    machine. The Netflix app was still running in the background,
    completely idle &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]. When the background app was
    closed at 11:00, the receive window increased dramatically. This
    suggests a second experiment, where different applications are
    opened / closed / left running in the background.
  &lt;/p&gt;

  &lt;p&gt;
    The time where the receive window stops being the bottleneck is
    very close to the PS4 entering rest mode. That looks like another
    thing worth investigating. Unfortunately, that&#039;s not true, and
    rest mode is a red herring here. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]
  &lt;/p&gt;

  &lt;h3&gt;Experiment #2&lt;/h3&gt;

  &lt;p&gt;
    Below is a graph of the receive windows for a second download,
    annotated with the timing of various noteworthy events.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl2-rwin-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl2-rwin-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The differences in receive windows at different times are
    striking. And more important, the changes in the receive
    windows correspond very well to specific things I did on
    the PS4.
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; When the download was started, the game Styx: Shards of
      Darkness was running in the background (just idling in the title
      screen). The download was limited by a receive window of under
      7kB. This is an incredibly low value; it&#039;s basically going to
      cause the downloads to take &lt;b&gt;100 times longer than they should&lt;/b&gt;.
      And this was not a coincidence, whenever that game
      was running, the receive window would be that low.
    &lt;li&gt; Having an app running (e.g. Netflix, Spotify) limited the
      receive window to 128kB, for about a 5x reduction in potential
      download speed.
    &lt;li&gt; Moving apps, games, or the download window to the foreground
      or background didn&#039;t have any effect on the receive window.
    &lt;li&gt; Launching some other games (Horizon: Zero Dawn, Uncharted 4,
      Dreadnought) seemed to have the same effect as running an app.
    &lt;li&gt; Playing an online match in a networked game (Dreadnought) caused the
      receive window to be artificially limited to 7kB.
    &lt;li&gt; Playing around in a non-networked game (Horizon: Zero Dawn)
      had a very inconsistent effect on the receive window, with the
      effect seemingly depending on the intensity of gameplay. This
      looks like a genuine resource restriction (download process
      getting variable amounts of CPU), rather than an artificial
      limit.
    &lt;li&gt; I ran a speedtest at a time when downloads were limited to
      7kB receive window. It got a decent receive window of over
      400kB; the conclusion is that the artificial receive window
      limit appears to only apply to PSN downloads.
    &lt;li&gt; Putting the PS4 into rest mode had no effect.
    &lt;li&gt; Built-in features of the PS4 UI, like the web browser,
      do not count as apps.
    &lt;li&gt; When a game was started (causing the previously running game
      to be stopped automatically), the receive window could increase
      to 650kB for a very brief period of time. Basically it appears
      that the receive window gets unclamped when the old game stops,
      and then clamped again a few seconds later when the new game
      actually starts up.
  &lt;/ul&gt;

  &lt;p&gt;
    I did a few more test runs, and all of them seemed
    to support the above findings. The only additional information
    from that testing is that the rest mode behavior was dependent
    on the PS4 settings. Originally I had it set up to suspend apps
    when in rest mode. If that setting was disabled, the apps would
    be closed when entering in rest mode, and the downloads would
    proceed at full speed.
  &lt;/p&gt;

  &lt;p&gt;A 7kB receive window will be absolutely crippling for any user.
     A 128kB window might be ok for users who have CDN servers very
     close by, or who don&#039;t have a particularly fast internet. For
     example at my location, a 128kB receive window would cap the downloads at about
     35Mbp to 75Mbps depending on which CDN the DNS RNG happens to give me.
     The lowest two speed tiers for my ISP are 50Mbps and 200Mbps.
     So either the 128kB would not be a noticeable problem (50Mbps) or it&#039;d mean
     that downloads are artificially limited to to 25% speed (200Mbps).
  &lt;/p&gt;

  &lt;a name=&#039;conclusions&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Conclusions&lt;/h3&gt;

  &lt;p&gt;If any applications are running, the PS4 appears to change the
    settings for PSN store downloads, artificially restricting their
    speed. Closing the other applications will remove the limit. There
    are a few important details:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; Just leaving the other applications running in the background will
      &lt;b&gt;not help&lt;/b&gt;. The exact same limit is applied whether the download
      progress bar is in the foreground or not.
    &lt;li&gt; Putting the PS4 into rest mode might or might not help,
      depending on your system settings.
    &lt;li&gt;The artificial limit applies only to the PSN store downloads.
      It does &lt;b&gt;not&lt;/b&gt; affect e.g. the built-in speedtest. This
      is why the speedtest might report much higher speeds than the
      actual downloads, even though both are delivered from the same
      CDN servers.
    &lt;li&gt; Not all applications are equal; most of them will cause the
      connections to slow down by up to a factor of 5. Some
      games will cause a difference of about a factor of 100. Some
      games will start off with the factor of 5, and then migrate to
      the factor of 100 once you leave the start menu and start playing.
    &lt;li&gt; The above limits are artificial. In addition to that,
      actively playing a game can cause game downloads to slow down.
      This appears to be due to a genuine lack of CPU resources (with
      the game understandably having top priority).
  &lt;/ul&gt;

  &lt;p&gt;
    So if you&#039;re seeing slow downloads, just closing all the running
    applications might be worth a shot. (But it&#039;s obviously not
    guaranteed to help. There are other causes for slow downloads as
    well, this will just remove one potential bottleneck).
    To close the running applications, you&#039;ll need to
    long-press the PS button on the controller, and then select &amp;quot;Close
    applications&amp;quot; from the menu.
  &lt;/p&gt;

  &lt;p&gt;
    The PS4 doesn&#039;t make it very obvious exactly what programs are
    running. For games, the interaction model is that opening a new
    game closes the previously running one. This is not how other apps
    work; they remain in the background indefinitely until you
    explicitly close them.

  &lt;p&gt;
    And it&#039;s gets worse than that. If your PS4 is configured to
    suspend any running apps when put to rest mode, you can seemingly
    power on the machine into a clean state, and still have a hidden
    background app that&#039;s causing the OS to limit your PSN download
    speeds.
  &lt;/p&gt;

  &lt;p&gt;
    This might explain some of the superstitions about this on the
    Internet. There are people who swear that putting the machine to
    rest mode helps with speeds, others who say it does nothing. Or
    how after every firmware update people will report increased
    download speeds. Odds are that nothing actually changed in the
    firmware; it&#039;s just that those people had done their first full
    reboot in a while, and finally had a system without a background
    app running.
  &lt;/p&gt;

  &lt;h3&gt;Speculation&lt;/h3&gt;

  &lt;p&gt;
    Those were the facts as I see them. Unfortunately this raises some
    new questions, which can&#039;t be answered experimentally. With no
    facts, there&#039;s no option except to speculate wildly!
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: Is this an intentional feature? If so, what its purpose?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    Yes, it must be intentional. The receive window changes very
    rapidly when applications or games are opened/closed, but not for
    any other reason. It&#039;s not any kind of subtle operating system
    level behavior; it&#039;s most likely the PS4 UI explicitly
    manipulating the socket receive buffers.
  &lt;/p&gt;

  &lt;p&gt;
    But why? I think the idea here must be to not allow the network
    traffic of background downloads to take resources away from the
    foreground use of the PS4. For example if I&#039;m playing an online
    shooter, it makes sense to harshly limit the background download
    speeds to make sure the game is getting ping times that are
    both low and predictable. So there&#039;s at least some point in that
    7kB receive window limit in some circumstances.
  &lt;/p&gt;

  &lt;p&gt;
    It&#039;s harder to see what the point of the 128kB receive window
    limit for running any app is. A single game download from some
    random CDN isn&#039;t going to muscle out Netflix or Youtube... The
    only thing I can think of is that they&#039;re afraid that multiple
    simultaneous downloads, e.g. due to automatic updates, might cause
    problems for playing video. But even that seems like a stretch.
  &lt;/p&gt;

  &lt;p&gt;
    There&#039;s an alternate theory that this is due to some non-network
    resource constraints (e.g. CPU, memory, disk). I don&#039;t think that
    works. If the CPU or disk were the constraint, just having the
    appropriate priorities in place would automatically take care of
    this. If the download process gets starved of CPU or disk
    bandwidth due to a low priority, the receive buffer would fill up
    and the receive window would scale down dynamically, exactly when
    needed. And the amounts of RAM we&#039;re talking about here are
    miniscule on a machine with 8GB of RAM; less than a megabyte.
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: Is this feature implemented well?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    Oh dear God, no. It&#039;s hard to believe just how sloppy this
    implementation is.
  &lt;/p&gt;

  &lt;p&gt;
    The biggest problem is that the limits get applied based just on
    what games/applications are currently running.  That&#039;s just
    insane; what matters should be which games/applications someone is
    currently using. Especially in a console UI, it&#039;s a totally
    reasonable expectation that the foreground application gets
    priority. If I&#039;ve got the download progress bar in the foreground,
    the system had damn well give that download priority. Not some
    application that was started a month ago, and hasn&#039;t been used
    since. Applying these limits in rest mode with suspended
    apps is beyond insane.
  &lt;/p&gt;

  &lt;p&gt;
    Second, these limits get applied per-connection.  So if you&#039;ve got
    a single download going, it&#039;ll get limited to 128kB of receive
    window. If you&#039;ve got five downloads, they&#039;ll all get 128kB, for a
    total of 640kB. That means the efficiency of the &amp;quot;make sure downloads
    don&#039;t clog the network&amp;quot; policy depends purely on how many downloads
    are active. That&#039;s rubbish. This is all controlled on the
    application level, and the application knows how many downloads
    are active. If there really were an optimal static receive window
    X, it should just be split evenly across all the downloads.
  &lt;/p&gt;

  &lt;p&gt;
    Third, the core idea of applying a static receive window as a
    means of fighting bufferbloat is just fundamentally broken.
    Using the receive window as the rate limiting mechanism just
    means that the actual transfer rate will depend on the RTT
    (this is why a local proxy helps). For this kind of thing to
    work well, you can&#039;t have the rate limit depend on the
    RTT. You also can&#039;t just have somebody come up with a number
    once, and apply that limit to everyone. The limit needs to
    depend on the actual network conditions.
  &lt;/p&gt;

   &lt;p&gt;
    There are ways to detect how congested the downlink is in the
    client-side TCP stack. The proper fix would be to implement them,
    and adjust the receive window of low-priority background downloads
    if and only if congestion becomes an issue. That would actually be
    a pretty valuable feature for this kind of appliance. But I can
    kind of forgive this one; it&#039;s not an off the shelf feature, and
    maybe Sony doesn&#039;t employ any TCP kernel hackers.
  &lt;/p&gt;

  &lt;p&gt;
    Fourth, whatever method is being used to decide on whether a game
    is network-latency sensitive is broken. It&#039;s absurd that a demo of
    a single-player game idling in the initial title screen would
    cause the download speeds to be totally crippled. This really
    should be limited to actual multiplayer titles, and ideally just
    to periods where someone is actually playing the game online.
    Just having the game running should not be enough.
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: How can this still be a problem, 4 years after launch?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    I have no idea. Sony must know that the PSN download speeds have
    been a butt of jokes for years. It&#039;s probably the biggest
    complaint people have with the system. So it&#039;s hard to believe
    that nobody was ever given the task of figuring out why it&#039;s
    slow. And this is not rocket science; anyone bothering to look
    into it would find these problems in a day.&lt;/p&gt;

  &lt;p&gt;
    But it seems equally impossible that they know of the cause, but
    decided not to apply any of the the trivial fixes to it. (Hell, it
    wouldn&#039;t even need to be a proper technical fix. It could just be
    a piece of text saying that downloads will work faster with all
    other apps closed).
  &lt;/p&gt;

  &lt;p&gt;
    So while it&#039;s possible to speculate in an informed manner about
    other things, this particular question will remain as an open
    mystery.  Big companies don&#039;t always get things done very
    efficiently, eh?
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;

    &lt;p&gt; &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;]
      How idle? So idle that I hadn&#039;t even logged in, the app
        was in the login screen.
    &lt;/p&gt;

    &lt;p&gt;
      &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] To be specific, the
        slowdown is caused by the artifical latency changes. The PS4
        downloads files in chunks, and each chunk can be served from a
        different CDN. The CDN that was being used from 10:51 to 11:00
        was using a delay-based congestion control algorithm, and
        reacting to the extra latency by reducing the amount of data
        sent. The CDN used earlier in the connection was using a
        packet-loss based congestion control algorithm, and did not
        slow down despite seeing the latency change in exactly the same
        pattern.
    &lt;/p&gt;
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><category>GAMES</category><pubDate>Sat, 19 Aug 2017 19:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-08-19-slow-ps4-downloads/</guid></item><item><title>The mystery of the hanging S3 downloads</title><link>https://www.snellman.net/blog/archive/2017-07-20-s3-mystery/</link><description>
  &lt;p&gt;
    A coworker was experiencing a strange problem with their Internet
    connection at home. Large downloads from most sites worked
    fine. The exception was that downloads from a Amazon S3 would get
    up to a good speed (500Mbps), stall completely for a few seconds,
    restart for a while, stall again, and eventually hang
    completely. The problem seemed to be specific to S3,
    downloads from generic AWS VMs were ok.
  &lt;/p&gt;

  &lt;p&gt;
    What could be going on? It shouldn&#039;t be a problem with
    the ISP, or anything south of that: after all, connections to other
    sites were working. It should not be a problem between the ISP
    and Amazon, or there would have been problems with AWS too.
    But it also seems very unlikely that S3 would
    have a trivially reproducible problem causing large downloads to hang.
    It&#039;s not like this is some minor use case of the service.
  &lt;/p&gt;

  &lt;p&gt;
    If it had been a problem with e.g. viewing Netflix, one might
    suspect some kind of targeted traffic shaping. But an ISP
    throttling or forcibly closing connections to S3 but not to AWS in
    general? That&#039;s just silly talk.
  &lt;/p&gt;

  &lt;p&gt;
    The normal troubleshooting tips like reducing the MTU didn&#039;t help
    either. This sounded like a fascinating networking whodunit,
    so I couldn&#039;t resist butting in after hearing about it through
    the grapevine.
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;h3&gt;The packet captures&lt;/h3&gt;

  &lt;p&gt;
    The first step of debugging pretty much any networking problem is getting
    a packet capture from as many points in the network as possible. In this
    case we only had one capture point: the client machine. The problem
    could not be reproduced on anything but S3, and obviously taking a capture
    from S3 was not an option. Nor did we have access to any devices elsewhere
    on the traffic path. &lt;a id=&#039;fnref0&#039;&gt;[&lt;a href=&#039;#fn0&#039;&gt;0&lt;/a&gt;]
  &lt;/p&gt;

  &lt;p&gt;
    A superficial check of the ACK stream showed the following pattern.
    The traffic would be humming along nicely, from the sequence numbers
    we can see that about 57MB have already been downloaded in the first
    2.5 seconds.
  &lt;/p&gt;

&lt;pre&gt;00:00:02.543596 client &amp;gt; server: Flags [.], ack &lt;b&gt;57657817&lt;/b&gt;
00:00:02.543623 client &amp;gt; server: Flags [.], ack 57661318
00:00:02.543682 client &amp;gt; server: Flags [.], ack 57667046
&lt;/pre&gt;

  &lt;p&gt;Then, a single packet loss occurs. We can tell from the SACK block that 1432 bytes of payload are missing. That&#039;s almost certainly a single packet.&lt;/p&gt;
&lt;pre&gt;&lt;b&gt;00:00:02.543734&lt;/b&gt; client &amp;gt; server: Flags [.], ack &lt;b&gt;57667046&lt;/b&gt;,
    options [sack 1 {&lt;b&gt;57668478&lt;/b&gt;:57669910}]
&lt;/pre&gt;

  &lt;p&gt;After the single packet loss, more data continues to be delivered
    with no problems. In the next 100ms a further 6MB gets delivered. But the
    missing data never arrives.&lt;/p&gt;
&lt;pre&gt;...
00:00:02.648316 client &amp;gt; server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63829515}]
&lt;b&gt;00:00:02.648371&lt;/b&gt; client &amp;gt; server: Flags [.], ack 57667046,
    options [sack 1 {57668478:&lt;b&gt;63830947&lt;/b&gt;}]
&lt;/pre&gt;

  &lt;p&gt;In fact, no further ACKs are sent for 4 seconds. And even then it&#039;s not done
    by one 1432 byte packet like we expected, but by two 512 byte packets and one
    408 byte one. There&#039;s also a RTT-sized delay between the first and second
    packets.
  &lt;/p&gt;

  &lt;pre&gt;00:00:&lt;b&gt;06.751691&lt;/b&gt; client &amp;gt; server: Flags [.], ack &lt;b&gt;57667558&lt;/b&gt;,
    options [sack 1 {57668478:63830947}]
00:00:&lt;b&gt;06.792592&lt;/b&gt; client &amp;gt; server: Flags [.], ack &lt;b&gt;57668070&lt;/b&gt;,
    options [sack 1 {57668478:63830947}]
00:00:06.796277 client &amp;gt; server: Flags [.], ack &lt;b&gt;63830947&lt;/b&gt;
&lt;/pre&gt;

  &lt;p&gt;
    After that, the connection continues merrily along, but the exact same thing
    happens 3 seconds later.
  &lt;/p&gt;

  &lt;p&gt;What can we tell from this? Clearly the actual server would be
    retransmitting the lost packet much more quickly than with a 4 second
    delay. It also would not be re-packetizing the 1432 byte packet into three
    pieces. Instead what must be happening is that each retransmitted copy
    is getting lost. After a few seconds RFC 4821-style path MTU probing kicks in,
    and a smaller packet gets retransmitted. For some reason this retransmission
    makes it through; this makes the sender believe that the path MTU has been
    reduced, and it starts sending smaller packets.&lt;/p&gt;

  &lt;p&gt;Again this suggests there&#039;s something dodgy going on with MTUs, but as
    mentioned in the beginning, reducing the MTU did not help.&lt;/p&gt;

  &lt;p&gt;But it also suggests a mechanism for why the connection eventually hangs
    completely, rather than alternating between stalling and recovering.
    There&#039;s a limit to how far
    the MSS can be reduced. If nothing else, the segments will need to
    have at least one byte of payload. In practice most operating systems have
    a much higher limit on the MSS (something in the 80-160 byte range is
    typical). If even packets of the minimum size aren&#039;t making it through,
    the server can&#039;t react by sending smaller packets.&lt;/p&gt;

  &lt;p&gt;With the information from the ACK stream exhausted, it&#039;s time
    to look at the packets in both directions. And what do you know?
    We actually see the earlier retransmissions at the client, with
    beautiful exponential backoff.
    The packets were not lost in the network, but were silently rejected by
    the client for some reason.&lt;/p&gt;

 &lt;pre&gt;00:00:02.685557 server &amp;gt; client: Flags [.], seq 57667046:&lt;b&gt;57668478&lt;/b&gt;, ack 4257, length 1432
00:00:02.960249 server &amp;gt; client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:03.500500 server &amp;gt; client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:04.580168 server &amp;gt; client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:06.751657 server &amp;gt; client: Flags [.], seq 57667046:&lt;b&gt;57667558&lt;/b&gt;, ack 4257, length 512
00:00:06.751691 client &amp;gt; server: Flags [.], ack 57667558, win 65528,
    options [sack 1 {57668478:63830947}]
00:00:06.792565 server &amp;gt; client: Flags [.], seq &lt;b&gt;57667558:57668070&lt;/b&gt;, ack 4257, length 512
00:00:06.792567 server &amp;gt; client: Flags [.], seq &lt;b&gt;57668070:57668478&lt;/b&gt;, ack 4257, length 408
00:00:06.792592 client &amp;gt; server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]
&lt;/pre&gt;

  &lt;p&gt;There are really just two reasons this would happen. The IP or
    TCP checksum could be wrong. But how could it be wrong for the
    same packet six times in a row? That&#039;s crazy talk, the expected packet
    corruption rate is more like one in a million. Alternatively
    the packet is too large. But damn it, we know that&#039;s not the
    problem, no matter how well this case is matching the common pattern.
    Let&#039;s just have a look at the checksums, to rule it out...&lt;/p&gt;

&lt;pre&gt;server &amp;gt; client: Flags [.], cksum &lt;b&gt;0x0000&lt;/b&gt; (incorrect -&amp;gt; &lt;b&gt;0xd7a7&lt;/b&gt;), seq 57667046:57668478, ack 4257, length 1432
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
...
&lt;/pre&gt;

  &lt;p&gt;Oh... Every single copy of that packet had a checksum of 0 instead of the
    expected checksum of 0xd7a7. (Checksums of 0 are often not real errors,
but just artifacts of checksum offload. The packets being captured by software
before the checksum is computed by hardware.
That&#039;s not the case here; these are packets we&#039;re receiving rather than
transmitting.). And it gets crazier, when we look at the next
    instance of the problem a few seconds later.&lt;/p&gt;

&lt;pre&gt;server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server &amp;gt; client: Flags [.], cksum 0x0000 (incorrect -&amp;gt; 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
...
&lt;/pre&gt;

  &lt;p&gt;It&#039;s the exact same problem, all the way down to the problem
    appearing specifically with a TCP checksum of 0xd7a7. Further
    analysis of the captures verified that this was a systematic
    problem and not a coincidence. &lt;b&gt;Packets with an expected checksum of
    0xd7a7 would always have the checksum replaced with
    0. Packets with any other expected checksum would work just fine.&lt;/b&gt;
    &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;].&lt;/p&gt;

  &lt;p&gt;This explains why the path MTU probing temporarily fixes the problem:
    the repacketized segments have different checksums, and make it through
    unharmed.&lt;/p&gt;

  &lt;h3&gt;TCP Timestamps&lt;/h3&gt;

  &lt;p&gt;So, a problem internal to S3 is causing this very specific kind
    of packet corruption then?&lt;/p&gt;

  &lt;p&gt;Not so fast! It turns out that most TCP implementations would
    work around this kind of corruption by accident. The reason for
    that is TCP Timestamps. And while you don&#039;t need to actually know
    much about TCP Timestamps to understand this story, I have been
    looking for an excuse to rant about them.
  &lt;/p&gt;

  &lt;p&gt;With TCP Timestamps, every TCP packet will contain a TCP option
    with two extra values. One of them is the sender&#039;s latest
    timestamp. The other is an echo of the latest timestamp the sender
    received from the other party. For example here the client is
    sending the timestamp 805, and the server is echoing it back:&lt;/p&gt;

  &lt;pre&gt;client &amp;gt; server: Flags [.], ack 89,
    options [TS val &lt;b&gt;805&lt;/b&gt; ecr 10087]
server &amp;gt; client: Flags [P.], seq 89:450, ack 569,
    options [TS val 10112 ecr &lt;b&gt;805&lt;/b&gt;]
&lt;/pre&gt;

  &lt;p&gt;
    TCP Timestamps were added to TCP very early on, for two
    reasons, neither of which was very compelling in retrospect.&lt;/p&gt;

  &lt;p&gt;Reason number one was PAWS, Protection Against Wrapped-Around
    Sequence-Numbers. The idea was that very fast connections might
    require huge TCP window sizes, and minor packet reordering/duplication
    might cause an old packet to be interpreted as a new packet, due to the
    32 bit sequence number having wrapped around. I don&#039;t think that
    world ever really arrived, and PAWS is irrelevant to practically
    all TCP use cases.&lt;/p&gt;

  &lt;p&gt;The other original reason for timestamps was to enable TCP
    senders to measure RTTs in the presence of packet loss. But this
    can also be done with TCP Selective ACKs, a feature that&#039;s much
    more useful in general (and thus was widely deployed a lot sooner,
    despite being standardized later).
  &lt;/p&gt;

  &lt;p&gt;In exchange for these dubious benefits, every TCP packet (both
    data segments and pure control packets) is bloated by 12 bytes.
    This is in contrast to something like selective ACKs, where most
    packets don&#039;t grow in size. You only pay for selective ACKs when
    packets are lost or reordered. I &lt;a href=&#039;https://www.snellman.net/blog/archive/2016-12-01-quic-tou/&#039;&gt;think that the debuggability
    of network protocols is important&lt;/a&gt;, but with TCP you get basically
    everything you need from other sources. TCP timestamps have a high
    fixed cost, but give very little additional power.
  &lt;/p&gt;

  &lt;p&gt;If TCP Timestamps suck so much, why does everyone use them
    them? I don&#039;t know for sure anyone else&#039;s reasons. I ended up
    implementing them purely due to an interoperability issue with the
    FreeBSD TCP stack. Basically FreeBSD uses a small static receive
    window for connections without TCP timestamps, while with TCP
    timestamps on it&#039;d scale the receive window up as necessary.
    With connections with even a bit of latency, you needed
    TCP timestamps to avoid the receive window becoming a bottleneck.
    (This was &lt;a href=&#039;https://svnweb.freebsd.org/base?view=revision&amp;revision=316676&#039;&gt;fixed in FreeBSD a few months ago&lt;/a&gt;. Yay!).&lt;/p&gt;

  &lt;p&gt;Now, performance of FreeBSD clients isn&#039;t a big deal for me as long as
    the connections work. But you know who else uses a FreeBSD-derived
    TCP stack? Apple. And when it comes to mobile networks, performance
    of iOS devices is about as important as it gets. Anyone who cares about
    large transfers to iOS or OS X clients must use TCP Timestamps,
    no matter how distasteful they find the feature.&lt;/p&gt;

  &lt;p&gt;&lt;i&gt;&quot;But Juho, what does any of this have to do with S3?&quot;&lt;/i&gt;, you ask.
    Well, S3 is one of those rare services that disable
    timestamps. And that actually makes for a big difference
    in this case. With timestamps, each retransmitted copy of a packet would use a
    different timestamp value &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;].
    And when any part of the TCP header changes, odds are that the
      checksum changes as well. Even if some packets are lost due to the
      having the magic checksum, at least the retransmissions will
      make it through promptly.
  &lt;/p&gt;

  &lt;p&gt;To check this theory, I asked for a test with TCP timestamps
    disabled on the client. And immediately large downloads from
    anywhere - even the ISP&#039;s own speedtest server - started hanging.
    Success!&lt;/p&gt;

  &lt;h3&gt;Conclusion&lt;/h3&gt;

  &lt;p&gt;With this information I suggested my coworker call his ISP, and
    report the problem.
    He was smarter than that, and ran one more test: switching the
    cable modem from router mode to bridging mode. Bam, the problem
    was gone. In retrospect this makes sense: in router mode the cable
    modem needs to update the checksums for each packet that pass
    through the device. In bridging mode there&#039;s no NAT, so no
    checksum update is needed.
  &lt;/p&gt;

  &lt;p&gt;And that&#039;s how a dodgy cable modem caused downloads to fail with
    one service, but one service only. I&#039;ve seen many kinds of packet
    corruption before, but never anything that was so absurdly specific.
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] There are techniques
  around for routing the traffic such that we would have had a
  measurement point. One would have been using something like a VPN or
  a Socks proxy. But that&#039;s such a fundamental change to the traffic
  pattern that it doesn&#039;t make for a very interesting test. Odds are
  that the problem would just go away when you do that. The other
  option would be to use a fully transparent generic TCP proxy on some
  server with a public IP, have the client connect to the TCP
  proxy and the proxy connect to the actual server. But setting that
        up is tedious; certainly not worth doing as a first step.
    &lt;/p&gt;

    &lt;p&gt;It&#039;s also pretty common to only have one trace point to start
      with.  For analysis I&#039;d do for actual work purposes, we pretty
      often have just a trace from somewhere in the middle of the
      path, but nothing from the client or the server. Getting traces
      from multiple points is so much trouble that we usually need
      to roughly pinpoint the problem first with single-point packet
      capture, and only then ask for more trace points.&lt;/p&gt;

    &lt;p&gt;
  &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] As far as I can tell 0xd7a7 has no interesting special
    properties. The bytes are not printable ASCII characters. 0xd7a7
    isn&#039;t a value with any special significance in another TCP header
    field either. There are ways to screw up TCP checksum computations, but
    I think they&#039;re mostly to do with the way 0x0 and 0xffff are both
    zero values in a one&#039;s complement system.

    &lt;p&gt;
  &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] Assuming sensible timestamp resolution. Not the rather unpractical 500ms tick that e.g. OpenBSD uses.
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Thu, 20 Jul 2017 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-07-20-s3-mystery/</guid></item><item><title>The hidden cost of QUIC and TOU</title><link>https://www.snellman.net/blog/archive/2016-12-01-quic-tou/</link><description>
&lt;p&gt;
  Application specific UDP-based protocols have always been around,
  but with traffic volumes that are largely rounding errors. Recently the
  idea of using UDP has become a lot more respectable.
  IETF has started the ball rolling on &lt;a
  href=&#039;https://tools.ietf.org/html/draft-tsvwg-quic-protocol-00&#039;&gt;standardizing
  QUIC&lt;/a&gt;, Google&#039;s UDP-based combination of TCP+TLS+HTTP/2. And
  Facebook published Linux kernel patches to add an encrypted UDP
  encapsulation of TCP, &lt;a
  href=&#039;https://tools.ietf.org/html/draft-herbert-transports-over-udp-00&#039;&gt;TOU (Transports
  over UDP)&lt;/a&gt;. On a very high level, the approaches are dramatically
  different.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
  QUIC is a totally new design that can really experiment on the
  protocol level, but requires implementation to start from
  scratch. Some of the new features are compelling (e.g. proper
  multiplexing of multiple data streams), a few I have my doubts
  on (e.g. forward error correction). TOU is a conservative evolution,
  and pretty much just includes one actual new feature. But it can
  fully leverage the host TCP stack on the server. The client would
  still require a user space TCP stack and user space TOU
  encapsulation.

&lt;p&gt; But despite the difference in designs, the goals are very
  similar. Both proposals attempt to speed up protocol evolution by
  decoupling the protocol from the client OS, and moving it to the
  application. (The companies that designed these protocols happen to
  control the servers and the client application program, but not
  really the client OS). They&#039;d also both add support for connection
  migration in a way that should more deployable than multipath
  TCP. It&#039;s hard to argue against either of these ideas.

&lt;p&gt;
  And then there&#039;s the third big commonality. Both proposals encrypt
  and authenticate the layer 4 headers. This is the bit that I&#039;m
  uneasy about.

&lt;p&gt;
  The recent movement to get all traffic encrypted has of course been
  great for the Internet. But the use of encryption in these protocols
  is different than in TLS. In TLS, the goal was to ensure the privacy
  and integrity of the payload. It&#039;s almost axiomatic that third
  parties should not be able to read or modify the web page you&#039;re
  loading over HTTPS. QUIC and TOU go further. They encrypt the
  control information, not just the payload. This provides no meaningful
  privacy or security benefits.

&lt;p&gt;
  Instead the apparent goal is to break the back of &lt;a href=&#039;https://en.wikipedia.org/wiki/Middlebox&#039;&gt;middleboxes&lt;/a&gt; &lt;a id=&#039;fnref0&#039;&gt;[&lt;a href=&#039;#fn0&#039;&gt;0&lt;/a&gt;].
  The idea is that TCP can&#039;t evolve due to middleboxes and is pretty
  much fully ossified. They interfere with connections in all kinds of
  ways, like stripping away unknown TCP options or dropping packets
  with unknown TCP options or with specific rare TCP flags set. The
  possibilities for breakage are endless, and any protocol extensions
  have to jump through a lot of hoops to try to minimize the damage.

&lt;p&gt;
  It&#039;s almost an extension of the &lt;a href=&#039;https://en.wikipedia.org/wiki/End-to-end_principle&#039;&gt;end-to-end principle&lt;/a&gt;. Not only
  should protocols be defined such that functionality that can&#039;t be
  implemented correctly in the network is defined in the application.
  Protocols should in addition be defined such that it&#039;s not possible
  for the network to know anything about the traffic, lest somebody
  try to add any features at that level. Dumb pipes all the way!

&lt;p&gt;
  It&#039;s a compelling story. I&#039;m even pretty sympathetic to it, since in
  my line of work I see a lot of cases where obsolete or badly
  configured middleboxes cause major performance degradation. (See
  this &lt;a href=&#039;https://news.ycombinator.com/item?id=11766875&#039;&gt;HN
  comment&lt;/a&gt; for an example).

&lt;p&gt;
  But let&#039;s take the recent findings about the &lt;a href=&#039;https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf&#039;&lt;/a&gt;deployability of TCP
  Fast Open&lt;/a&gt; as an example. The headline number is absolutely
  horrific: 20% failure rate! But actually that appears to be 20%
  where TCP Fast Open can&#039;t be successfully negotiated, not 20%
  where connections fail. And this is for the absolute worst case;
  it&#039;s not just new TCP options, but effectively modifies the TCP
  state machine for the handshake. I&#039;ve implemented a bunch of TCP
  extensions over the years. TCP Fast Open was by far the hardest
  to get right.

&lt;p&gt;
  Compared to the reported 8% failure rates to negotiate a QUIC
  connection, that number looks totally reasonable. (In both cases
  there is a fallback to negotiate a different type of connection, and
  blacklists will be used to directly go to the fallback method the
  next time around). But somehow one of these is deemed acceptable,
  while the other is a sign of terminal ossification. &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;].

&lt;h3&gt;What you lose with encrypted headers&lt;/h3&gt;

&lt;p&gt;
  What&#039;s wrong with encrypted transport headers? One possible argument
  is that middleboxes actually serve a critical function in the network,
  and crippling them isn&#039;t a great idea. Do
  you really want a world where firewalls are unviable?  But I work on
  middleboxes, so of course I&#039;d say that. (Disclaimer: these are my
  own opinions, not my employer&#039;s). So let&#039;s ignore that. Even so,
  readable headers have one killer feature: troubleshooting.

&lt;p&gt;
  The typical network problem that my team gets to
  troubleshoot is some kind of traffic either not working at
  all, or working slower than it should be. So something like
  the following &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]:

&lt;ul&gt;
  &lt;li&gt; Users are complaining that Youtube videos only play in SD, but are
    choppy in HD.
  &lt;li&gt; Speedtest is showing 10Mbps on an LTE connection
    that should be able to do 50Mbps.
  &lt;li&gt;Large FTP transfers between machines in Germany and Singapore
    are only getting speeds of 2Mbps.
  &lt;li&gt; Uploads over a satellite link are so slow that they stall and
    get terminated rather than ever finish.
&lt;/ul&gt;

&lt;p&gt;
  To debug issues like this I start with a packet capture from the
  points in the network I have access to. Most of the time that&#039;s just
  a point in the middle (e.g. a mobile operator&#039;s core network). From
  just one trace, we can determine things such as the following:

&lt;ul&gt;
&lt;li&gt;Determine packet loss rates (on both sides, i.e. packets lost on
  the server -&amp;gt; core hop, and on the core -&amp;gt; client hop).
&lt;li&gt;Correlate packet loss with other events.
&lt;li&gt;Detect packet reordering rates (on both sides).
&lt;li&gt;Detect packet corruption rates (on both sides).
&lt;li&gt;Determine RTTs continuously over the lifetime of a connection, not
  just during a connection handshake (e.g. to use queuing as a
  congestion signal to establish the downlink as the bottleneck).
&lt;li&gt;Estimate sender congestion windows from observed delivery rates
  (to determine whether congestion control is the bottleneck).
&lt;li&gt;Inspect the TCP options (e.g. window scaling, mss) and the receive
  windows to determine whether the software on the client or the server
  is the bottleneck.
&lt;li&gt;Distinguish between pure control packets and data packets (e.g. to
  distinguish multiple separate HTTPS requests within a single TCP
  connection).
&lt;li&gt;Detect the presence of middleboxes that are interfering with the
  connection. (But only occasionally; more often you&#039;ll need multiple
  traces for this).
&lt;/ul&gt;

&lt;p&gt;
  We do most this with some specialized tools. But it&#039;s essentially no
  different from opening up the trace in Wireshark, following a
  connection with disappointing performance, and figuring out what
  happened. That&#039;s something that every network engineer probably
  does on a regular basis.

&lt;p&gt;
  With encrypted control information you can&#039;t figure out any of this. The
  only solid data you get is the throughput (not even the goodput). For anything
  more you, need traces from multiple points in the network. Those are
  hard to get, sometimes it&#039;s even outright impossible.
  And to do the analysis, you need to correlate those
  multiple traces with each other. That&#039;s a significantly higher barrier
  than just opening up Wireshark.
  In practice the network becomes a total black box, even to the
  people who are supposed to keep it running. That&#039;s not going to be
  a great place to be in.

&lt;h3&gt;Conclusion&lt;/h3&gt;

&lt;p&gt; To conclude, I think encrypting the L4 headers is a step too
  far. If these protocols get deployed widely enough (a distinct
  possibility with standardization), the operational pain will be
  significant.
&lt;/p&gt;


&lt;p&gt; There would be a reasonable middle ground where the headers are
  authenticated but not encrypted. That prevents spoofing and
  modifying packets, but still leaves open the possibility of
  understanding what&#039;s actually happening to the traffic.

&lt;h3&gt;Footnotes&lt;/h3&gt;

&lt;div class=&#039;footnotes&#039;&gt;

&lt;p&gt;
&lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] Whoops, that&#039;s not quite accurate. There is one specific kind of
  middlebox that a company like Google or Facebook needs: a load
  balancer. And very conveniently, both protocols introduce a new
  field containing the information a load balancer needs, and give it
  special treatment as the one field that gets to live outside the
  encryption envelope.

&lt;p&gt;
&lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] There seems to be a bit of a difference in how this is rolled
  out. If you want TCP Fast Open in Chrome, you&#039;ll need to enable it
  via a flag. Meanwhile my understanding is that QUIC is effectively
  rolled out by geographical regions; the flip that&#039;s getting switched is
  server side. Presumably the latter rollout procedureincludes
  working with the main service providers in that area to make sure
  any problems get fixed in advance. A process like that would be a
  lot more tractable than trying to fix the whole world at once.

&lt;p&gt;
&lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;]
  In order, the bottlenecks were:
&lt;ul&gt;
&lt;li&gt; The Google cache not sending data quickly enough. That&#039;s where
     our visibility ended, but it was still enough to say that the
     actual mobile network was fine and things were out of the
     operator&#039;s hands.
&lt;li&gt; Large amounts of packet loss in the access network, correlated
  with burstiness of traffic. That insight was sufficient to allow
  the customer to locate some specific switches with insufficient
  buffer space. (And we could add a feature that mitigated the problem
  centrally, rather than require upgrading thousands of network nodes).
&lt;li&gt; We found no indications of a protocol or network level bottleneck,
  so the problem had to be either with the application
  programs or OS configuration. Switching to a different FTP server
  did in fact solve the problem.
&lt;li&gt; A massive proportion (more than 5%) of packets from a subset of
  satellite endpoints had TCP checksum errors. This was a specific enough
  diagnosis to enable a binary search through the network path for the
  problematic link or device.
&lt;/ul&gt;
&lt;p&gt;
  What I&#039;m getting at here is that there&#039;s a seemingly unending supply
  of different potential network problems. So many that you need to have
  an idea of the nature of the underlying issue before you can try to
  pinpoint it exactly.
&lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Thu, 01 Dec 2016 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2016-12-01-quic-tou/</guid></item><item><title>The many ways of handling TCP RST packets</title><link>https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/</link><description>
      &lt;p&gt;
        What could be a simpler networking concept than TCP&#039;s RST
        packet? It just crudely closes down a connection, nothing
        subtle about it.  Due to some odd RST behavior we saw
        at &lt;a href=&#039;https://www.teclo.net/&#039;&gt;work&lt;/a&gt;,
        I went digging in RFCs to check what&#039;s the technically correct
        behavior and in different TCP implementations to see what&#039;s
        actually done in practice.
      &lt;/p&gt;

      &lt;read-more&gt;&lt;/read-more&gt;

      &lt;h2&gt;Background&lt;/h2&gt;

      &lt;p&gt;
        In the original TCP
        specification, &lt;a href=&#039;https://tools.ietf.org/html/rfc793&#039;&gt;RFC
        793&lt;/a&gt;, RSTs are defined in terms of the following TCP state
        variables:
      &lt;/p&gt;

      &lt;ul&gt;
        &lt;li&gt;RCV.NXT - The sequence number of the next byte of data the receiver is expecting from the sender
        &lt;li&gt;RCV.WND - The amount of receive window space the receiver is advertising
        &lt;li&gt;RCV.NXT + RCV.WND - The sequence number of the last byte of data the receiver is willing to accept at the moment.
        &lt;li&gt;SND.UNA - The first sequence number that the sender has not yet seen the receiver acknowledge.
        &lt;li&gt;SND.NXT - The sequence number of the next byte of payload data the sender would transmit.
      &lt;/ul&gt;

      &lt;p&gt;
        An RST is accepted if the sequence number is in the receiver&#039;s
        window (i.e. RCV.NXT =&amp;lt; SEG.SEQ &amp;lt; RCV.NXT+RCV.WND). The effect
        of the RST is to immediately close the connection. This is
        slightly different from a FIN, which just says that the other
        endpoint will no longer be transmitting any new data but can
        still receive some.
      &lt;/p&gt;

      &lt;p&gt;
        There are two types of event that cause a RST to be
        emitted. A) the connection is explicitly aborted by the
        endpoint, e.g. the process holding the socket being killed
        (just closing the socket normally is not grounds for RST, even
        if there is still unreceived data). B) the TCP stack receiving
        certain kinds of invalid packets, e.g. a non-RST packet for a
        connection that doesn&#039;t exist or has already been closed.

      &lt;p&gt;
        The RST packet that should be generated is slightly
        different for these two cases. For case A the sequence number
        for the RST packet should be SND.NXT of the connection. For
        case B the sequence number should be set to the sequence
        number ACKed by the received packet. In the latter case the
        ACK bit will not be set in the RST. (Where the distiction
        matters, I&#039;ll call the first one RST-ABORT, the second a
        RST-REPLY).  &lt;/p&gt;

      &lt;h2&gt;Receiving a RST&lt;/h2&gt;

      &lt;p&gt;
        The RCV.NXT state variable is doing double duty in the
        original RFC 793. It&#039;s defined as &amp;quot;next sequence number
        expected on an incoming connection&amp;quot;, but it&#039;s also
        implied that it&#039;s the most recent acknowledgement number sent
        out. This was true early on, but not after the introduction of
        delayed ACKs (&lt;a href=&#039;https://tools.ietf.org/html/rfc1122&#039;&gt;RFC
        1122&lt;/a&gt;). Which of these two interpretations should be used
        for checking whether the RST is in window?
      &lt;/p&gt;

      &lt;p&gt;
        Linux goes with the latter one, and splits the two roles
        out. RCV.NXT is strictly defined as the next expected sequence
        number, and RCV.WUP is the highest sequence number for which
        an ACK has actually been sent. RST handling is done using
        RCV.WUP, and in fact the following comment implies that RSTs
        are the main reason for this mechanism.
      &lt;/p&gt;

&lt;pre&gt;
 * Also, controls (RST is main one) are accepted using RCV.WUP instead
 * of RCV.NXT. Peer still did not advance his SND.UNA when we
 * delayed ACK, so that hisSND.UNA&amp;lt;=ourRCV.WUP.
 * (borrowed from freebsd)
&lt;/pre&gt;

With code like this:

&lt;pre&gt;
return !before(end_seq, tp-&amp;gt;rcv_wup) &amp;&amp;
       !after(seq, tp-&amp;gt;rcv_nxt + tcp_receive_window(tp));
&lt;/pre&gt;

&lt;p&gt;
The use of RCV.NXT instead of RCV.WUP for the second subexpression is a
clever ruse; due to the way tcp_receive_window() is defined, the
final result is relative to RCV.WUP instead of RCV.NXT. But it is a
slightly mysterious piece of code. Why is SND.UNA relevant? It should
be SND.NXT that matters.

&lt;p&gt;
  Sure, you can construct cases where a RST would be emitted with a
  sequence number where this makes a difference (but only for
  RST-REPLY, not for RST-ABORTs). In these particular you&#039;d save one
  roundtrip. An example would be something like: &lt;/p&gt;

&lt;table&gt;
  &lt;tr&gt;&lt;td&gt;Client&lt;td&gt;Server&lt;/tr&gt;
  &lt;tr&gt;&lt;td&gt;
&lt;tr&gt;&lt;td&gt;sends 1000:1100&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sends 1100:1200&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;td&gt;receives 1000:1100&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;td&gt;ACKs 1100&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;td&gt;receives 1100:1200&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;td&gt;delays ACK&lt;/tr&gt;

&lt;tr&gt;&lt;td&gt;aborts the connection&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;does not send a FIN&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sends RST seqnr 1200&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;td&gt;RST is lost&lt;/tr&gt;

&lt;tr&gt;&lt;td&gt;receives ACK for 1100&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sends RST 1100 (not 1200)&lt;td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td&gt;&lt;td&gt;receives RST 1100&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;td&gt;&lt;b&gt;closes connection iff using RCV.WUP&lt;/b&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td&gt;&lt;td&gt;ACKs 1200&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;receives ACK for 1200&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sends RST 1200&lt;td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;td&gt;receives RST 1200&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;
But that&#039;s extremely contrived, and would be easily broken by
delays (all that needs to happen is for that delayed ACK to actually
get emitted before the second RST is received, and it would get
rejected).

&lt;p&gt;
I tried to do a bit of archaeology to see if historical context would
help figure this out, but since it all happened in the dark pre-git
ages, it&#039;s a bit tricky. The commit is
in &lt;a href=&#039;https://kernel.googlesource.com/pub/scm/linux/kernel/git/davem/netdev-vger-cvs/+/585d5180%5E%21/&#039;&gt;585d5180
in netdev-vger-cvs&lt;/a&gt;, and while I couldn&#039;t track down any discussion
from 2001/2002 when the change was made, there is a mailing list post
from this decade
(&lt;a href=&#039;http://lists.openwall.net/netdev/2010/01/06/35&#039;&gt;here&lt;/a&gt;).
After reading these, I was still not really enlightened. Was/is there
kit around that generates RSTs using SND.UNA rather than SND.NXT when
closing a connection?

(I mean genuine endpoints. There&#039;s of plenty
of &lt;a href=&#039;https://www.isoc.org/isoc/conferences/ndss/09/slides/08.pdf&#039;&gt;middleboxes&lt;/a&gt;
around that generate RSTs with all kinds of sequence numbers, but
those might not be the RSTs you want to be reacting to anyway.)

&lt;p&gt;
The
original &lt;a href=&#039;https://github.com/freebsd/freebsd/commit/bc0a68481777596d794e4114b6e17059938c1c16&#039;&gt;FreeBSD
commit&lt;/a&gt; where this concept was introduced has some additional
detail:

&lt;pre&gt;
* Note: this does not take into account delayed ACKs, so
*   we should test against last_ack_sent instead of rcv_nxt.
*   &lt;b&gt;Also, it does not make sense to allow reset segments with
*   sequence numbers greater than last_ack_sent to be processed
*   since these sequence numbers are just the acknowledgement
*   numbers in our outgoing packets being echoed back at us,&lt;/b&gt;
*   and these acknowledgement numbers are monotonically
*   increasing.
&lt;/pre&gt;

&lt;p&gt;
The bit I bolded does make it appear that the motivation for the
change was specifically RST-REPLYs, not RST-ABORTs. If you&#039;re
looking for an exact match with some ACK you&#039;ve sent, it&#039;d be insane
to try to match an ACK that had never been sent. But if
the other kind of RST had been considered, it would have been obvious
that in the presence of delayed ACKs a valid RST could have a
sequence number higher than last_ack_sent.

&lt;p&gt;
This rule of exact match was indeed later changed in various ways;
first reverted to accepting all RSTs in window (so just moving the window),
then re-enabled while socket is in ESTABLISHED state but being less strict
in other states, then loosened to accepting either RCV.NXT or last_ack_sent,
or accepting &lt;a href=&#039;https://github.com/freebsd/freebsd/commit/fb4a7a64bd2806ed348b750074a59b56589da9a&#039;&gt;either of those &amp;plusmn; 1&lt;/a&gt;, etc.

&lt;p&gt;
If you&#039;re already tracking last_ack_sent / RCV.WUP, doing the window
checks on that instead of RCV.NXT seems sensible. But it does feel
like an optional thing, rather than one of those pieces of necessary
TCP behavior that were only standardized in folklore.

&lt;p&gt;
But that&#039;s ancient history. In
&lt;a href=&#039;https://tools.ietf.org/html/rfc5961&#039;&gt;RFC 5961&lt;/a&gt; the
suggested rules for accepting a RST were changed to make RST spoofing
attacks harder. There are now three possible reactions:

&lt;ul&gt;
&lt;li&gt; RST completely out of window, do nothing
&lt;li&gt; RST matches RCV.NXT exactly, close connection
&lt;li&gt; RST in-window but not RCV.NXT, send an ACK (if the RST was genuine, the other endpoint will send a new one but this time with the correct sequence number).
&lt;/ul&gt;

&lt;p&gt;
This means that even for the above contrived case using a RST.WUP as
the start of the window doesn&#039;t make a substantial difference. One
extra round-trip is required no matter what; whether the ACK is the
delayed ACK or the challenge ACK to a semi-bogus RST is immaterial.

&lt;p&gt;
That&#039;s pretty fertile ground for different implementations. One
standard, one ambiguity that appears to have never been officially
clarified, and one proposed standard. I could go into more detail on
the archaeology or on how to perform the experiments for if the
source code isn&#039;t available. But that&#039;d get tedious, so here&#039;s a
summary of my best understanding of how various operating systems
work for RSTs received in the ESTABLISHED state (the picture assumes
a last ACK sent of 12, last sequence number received of 17, and a
window of 10).
&lt;/p&gt;

&lt;img src=&#039;/blog/stc/images/rst-spans.png&#039;&gt;&lt;/img&gt;

&lt;p&gt;
Yes, seven operating systems and no two even agree on what an
acceptable RST looks like! And there&#039;s still some scope for other
reasonable but different implementations beyond those. RFC 5961
handling but with the window based on RCV.NXT, and RFC 5961 handling
that accepts both RCV.WUP and RCV.NXT as the sequence number would be
obvious examples. (To be fair, some of these differences will
collapse away if no ACKs are currently being delayed when the RST is
received).

&lt;p&gt;
The one data point in that table that seems particularly strange is
RCV.NXT + 1 in OpenBSD. If you&#039;re going to widen the window of
acceptable sequence numbers by 1, I would have thought that adjusting
it downwards rather than up would be the obvious
choice. The &lt;a href=&#039;http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/netinet/tcp_input.c#rev1.200&#039;&gt;
commit message&lt;/a&gt; is pretty spartan, just blaming Windows.

&lt;p&gt;
You could get into some further complications with the window handling
for RSTs that include data (sigh, the awesome idea in RFC 1122 to
embed the reason for the RST as ASCII payload), but AFAIK those don&#039;t
exist in the wild so maybe this is enough for now.


&lt;h2&gt;Interlude on RST handling in TCP aware middleboxes&lt;/h2&gt;

&lt;p&gt;
The above was just considering RST handling from the point of view of
the endpoints, where the main question is whether to react to an RST,
not how you react to it. Things get different yet again for
middlebox. The middlebox might not actually have the same
understanding of the state variables as either of the endpoints. If
the reaction of the middlebox to the RST is somehow destructive, the
middlebox has to be careful. As a trivial example, imagine a firewall
that blocks all packets for unknown connections. It would be quite bad
for that firewall to react to a RST packet that the actual endpoint
rejects.


&lt;p&gt;
What if the middlebox is not just TCP aware, but an active
participant in the connection? At work we do a fully transparent
non-terminating &lt;a href=&#039;http://teclo.net/mobile-operator-solutions/products/&#039;&gt;TCP optimizer for mobile networks&lt;/a&gt;
(see &lt;a href=&#039;https://www.snellman.net/blog/archive/2015-08-25-tcp-optimization-in-mobile-networks/&#039;&gt;previous
post for more technical details&lt;/a&gt; if that&#039;s your kind of thing). Despite not
terminating the connections, we still ACK data and thus take
responsibility for delivering it. How should a device like that
react to a RST if we still have undelivered but ACKed data? It
actually depends on which directions we&#039;ve seen a valid RST in. To a
first approximation we should stop sending new data to the direction
that we&#039;ve gotten RSTs from, but not close the connection or forward
the RST packet until all data that we&#039;ve acknowledged has been delivered
to the destination of the RST. (With the edge case of valid RSTs in both
directions forcing connection closure even with undelivered data).

&lt;h2&gt;Sending RSTs&lt;/h2&gt;

&lt;p&gt;
So there are a bunch of different ways to process a RST, but surely
sending RSTs is trivial? The rules on that are simple, unambiguous,
and unlike for processing an incoming RST there should be no
motivation to deviate from the rules. There are good reasons to be
stricter about accepting RSTs, there should be no reason to send RSTs
that will get rejected.

&lt;p&gt;
Ha-ha, just kidding.

&lt;p&gt;
First, what does SND.NXT really mean? RFC 793 only describes it as the
&amp;quot;next sequence number to be sent&amp;quot;. On the face of it this
could mean two things; either it&#039;s the highest sequence number that
has never been sent. Or it could be the next sequence number to be
transmitted including retransmissions. The latter interpretation seems
morally bankrupt. The RFC describes a mechanism by which SND.NXT
increases, but not one by which it decreases. The glossary also
explicitly mentions retransmissions as falling between SND.UNA and
SND.NXT, which implies SND.NXT should not decrease.

&lt;p&gt;
But in reality you see a lot of clients reduce SND.NXT if they believe
there&#039;s been a full tail loss. Something like this:

&lt;pre&gt;
00:00:53.693026 client &gt; server: Flags [P.], seq 973096734:&lt;b&gt;973096882&lt;/b&gt;, ack 2107240749, win 4200, length 148
00:00:53.693043 server &gt; client: Flags [.], ack &lt;b&gt;973095334&lt;/b&gt;, win 4000, options [sack 1 {973096734:&lt;b&gt;973096882&lt;/b&gt;},eol], length 0
00:00:53.789417 client &gt; server: Flags [.], seq 973095334:&lt;b&gt;973096734&lt;/b&gt;, ack 2107240749, win 4200, length 1400
00:00:53.789434 server &gt; client: Flags [.], ack 973096882, win 4000, length 0
00:00:53.789455 client &gt; server: Flags [.], seq 973095334:973096734, ack 2107240749, win 4200, length 1400
00:00:53.789458 server &gt; client: Flags [.], ack 973096882, win 4000, length 0
00:00:53.789475 client &gt; server: Flags [R.], seq &lt;b&gt;973096734&lt;/b&gt;, ack 2107240749, win 0, length 0
&lt;/pre&gt;

&lt;p&gt;
In this example when the client sends the RST, SND.UNA should be
973095334 and SND.NXT should be 973096882 - highest sequence number
sent, and in fact selectively acked. The RST is not generated with
either of those, but the in-between sequence number of 973096734.

&lt;p&gt;
There&#039;s another problem, which will matter especially for FIN+RST
connection teardowns. Here&#039;s what closing down a connection
prematurely looks like on OS X:
&lt;/p&gt;

&lt;pre&gt;
00:43:09.299678 client &amp;gt; server: Flags [.], ack 3343152432, win 32722, length 0
00:43:09.300615 client &amp;gt; server: Flags [F.], seq &lt;b&gt;773912652&lt;/b&gt;, ack 3343155284, win 32768, length 0
00:43:09.300617 server &amp;gt; client: Flags [.], ack 773912653, win 6000, length 0
00:43:09.300981 client &amp;gt; server: Flags [R], seq &lt;b&gt;773912652&lt;/b&gt;, win 0, length 0
00:43:09.300983 client &amp;gt; server: Flags [R], seq 773912652, win 0, length 0
00:43:09.300984 client &amp;gt; server: Flags [R], seq 773912652, win 0, length 0
00:43:09.301449 client &amp;gt; server: Flags [R], seq &lt;b&gt;773912653&lt;/b&gt;, win 0, length 0
&lt;/pre&gt;

&lt;p&gt;
The first RST has the same sequence number as the FIN. That&#039;s the
correct behavior, since that RST was sent in reply to a packet that
was already in flight to the client by the time the server received
the packet. Apperently OS X had the RST get triggered by an incoming
packet before it got around to sending an RST with a useful sequence
number. So there&#039;s an extra round trip&#039;s delay, and possibly an
extra round trip&#039;s worth of packets. But that&#039;s not a huge deal.

&lt;p&gt;
What would really suck is if an operating system were to totally
ignore the rule to reply with an RST when receiving a packet on a
closed socket. Which is of course exactly what iOS does. I don&#039;t know
the exact parameters of when it happens (e.g. it might be something
that happens just when connected over cellular, not over Wifi), but
what you get is something like this:

&lt;pre&gt;
00:00:15.099703 client &amp;gt; server: Flags [.], ack 3593074068, win 8192, length 0
00:00:15.099705 client &amp;gt; server: Flags [F.], seq &lt;b&gt;80108342&lt;/b&gt;, ack 3593074068, win 8192, length 0
00:00:15.099706 server &amp;gt; client: Flags [.], ack 80108343, win 124, length 0
00:00:15.099706 client &amp;gt; server: Flags [R], seq &lt;b&gt;80108342&lt;/b&gt;, win 0, length 0
00:00:15.228490 server &amp;gt; client: Flags [.], seq 3593074068:3593075068, ack 80108343, win 125, length 1000
... [ crickets, while the same data gets regularly retransmitted to the client ] ...
00:02:11.646208 server &amp;gt; client: Flags [.], seq 3593074068:3593075068, ack 80108343, win 125, length 1000
&lt;/pre&gt;

&lt;p&gt;
We get a single RST, which for an active connection almost guaranteed
to have a sequence number that will be rejected. Without at least one
valid RST, the server has to keep the connection open, and data gets
wastefully retransmitted over and over. This lasts until the server
times out the connection, which might take anything from ten seconds
to two minutes.

&lt;p&gt;
And just to be clear, the phone was still connected and functional for
the full duration of that trace. There were other active connections
to it, and those continued happily along. It&#039;s just the closed sockets
that become total black holes.

&lt;p&gt;
Note that the RCV.WUP trickery discussed in the earlier section alone
would not be sufficient to handle this kind of a situations. In
addition you need to either delay ACKs to FINs (OS X), be more
forgiving about RSTs after receiving a FIN (OSX, FreeBSD 10.2), or
have a small amount of slack in the sequence number check (&amp;plusmn;1
in FreeBSD 10.2; OpenBSD has slack but in the wrong direction).

&lt;h2&gt;Some statistics&lt;/h2&gt;

&lt;p&gt;
To see how much of a difference these different variations have, I ran
some simulations against a trace from a varied real world traffic mix
and looked at what percentage of first RSTs received by the server
would have been accepted, vs been dropped or caused a challenge
ACK. (I grouped challenge ACKs together with dropping packets, since
the point here is to see which policy is the most efficient at
actually closing the connection as soon as possible. It&#039;s not to see
which ones manage to do it eventually. Either they&#039;ll all manage to do
it later since some packets will be sent eventually, or none of them
will do it since the other device isn&#039;t properly sending followup
RSTs. Also, an after the fact simulation can&#039;t possibly tell anything
about the efficiency of the challenge ACKs).

&lt;p&gt;
The data was filtered such that we only looked at connections matching
the following criteria. This was about 35k connections after the
filtering, so not a huge data set. This was almost exactly 1/3
RST-ABORTs and 2/3 RST-REPLYs.

&lt;ul&gt;
&lt;li&gt; The connection received an RST from the client at some point
&lt;li&gt; The RST arrived before the server had sent a FIN or a RST to the
  client
&lt;li&gt; The RST arrived after at least one data packet (not during
  handshake)
&lt;/ul&gt;

&lt;p&gt;
Also not that this is only comparing the different policies for RST
acceptance, not for example the effects of different delayed ACK
behavior in different TCP implementations.

&lt;p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td colspan=2&gt;% of first RSTs accepted&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;RST receive policy&lt;/td&gt;&lt;td&gt;RST-ABORT&lt;td&gt;RST-REPLY&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FreeBSD 10.2&lt;td&gt;96.60&lt;td&gt;96.81&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;FreeBSD CURRENT&lt;td&gt;94.86&lt;td&gt;83.90&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Illumos&lt;td&gt;99.91&lt;td&gt;81.20&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Linux / Windows&lt;td&gt;96.45&lt;td&gt;81.20&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;OpenBSD&lt;td&gt;96.58&lt;td&gt;83.90&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;OS X&lt;td&gt;95.05&lt;td&gt;83.90&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;
Illumos is the only implementation here using the standard check
against the massive full window (tens or hundreds of kilobytes,
vs. 1-6 bytes for everything else. So it&#039;s not much of a surprise that
it&#039;s more effective than the others at closing the connection on just
about every RST-ABORT it receives. The tradeoff is that it&#039;s a lot more
susceptible to RST spoofing attacks.

&lt;p&gt;
When handling a RST-ABORT, we&#039;ll almost always have seen a FIN just
before. This means that RCV.WUP and RCV.NXT will be the same, so all
the other options are very close to each other. The minor differences
there come from:

&lt;ul&gt;
  &lt;li&gt; FreeBSD 10.2 accepts a superset of what OpenBSD does.
  &lt;li&gt; OpenBSD accepts superset of what Linux and Windows do.
  &lt;li&gt; When there is no FIN, checking RCV.NXT is more predictive
    of the sequence number of a RST-ABORT than RCV.WUP would be,
    so anything not checking RCV.NXT loses out.
  &lt;li&gt; OS X relaxes the checks after receiving FIN, FreeBSD CURRENT
    no longer does.
&lt;/ul&gt;

&lt;p&gt;
For RST-REPLY, practically everyone gets the case of no FIN right.
But if there&#039;s a FIN (like in the iOS example above), FreeBSD 10.2 is
vastly more effective than anything else thanks to accepting RCV.NXT -
1. There&#039;s not much of a tradeoff. It&#039;s a good idea, and I&#039;m a bit
surprised that not only did it never spread outside of FreeBSD, but
has now been removed from there too.

&lt;p&gt;
The other results are boring, with just the RCV.WUP
vs. RCV.NXT difference (with the opposite results compared to RST-ABORTs).

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;
Almost every time I talk about TCP optimization to the general
development public, the reaction is something along the lines of
&amp;quot;ah, you&#039;re one of the guys who breaks the standard?&amp;quot;.
That&#039;s technically true but useless; nobody actually implements all the standards
exactly as written. Hopefully this dive into just one relatively
simple aspect shows a) everyone doesn&#039;t interpret the standards the
same way, b) there are reasons to intentionally deviate from the
standards, and c) everybody does it. You just can&#039;t be abusive when
deviating from the standards.

&lt;p&gt;
Oh, and I&#039;d already implemented accepting RSTs with sequence number
RCV.NXT - 1 before looking at these details (resetting the connections
promptly is kind of important for us due to reasons that I can&#039;t go into
here). But this investigation did make me feel a lot better about that
change.

</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Mon, 01 Feb 2016 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/</guid></item><item><title>Flow disruptor - a deterministic per-flow network condition simulator</title><link>https://www.snellman.net/blog/archive/2015-10-01-flow-disruptor/</link><description>
&lt;h2&gt;Introduction&lt;/h2&gt;

&lt;p&gt;
I finally got around to open sourcing
&lt;a href=&#039;https://github.com/teclo/flow-disruptor&#039;&gt;flow disruptor&lt;/a&gt;,
a tool I wrote at &lt;a href=&#039;https://www.teclo.net/&#039;&gt;work&lt;/a&gt; late last
year. What does it do? Glad you asked! Flow disruptor is a
deterministic per-flow network condition simulator.

&lt;p&gt;
To unpack that description a bit, per-flow means that the network
conditions are simulated separately for each TCP connection rather
than on the link layer. Deterministic means that we normalize as many
network conditions as possible (e.g. RTT, bandwidth), and any changes
in those conditions happen at preconfigured times rather than
randomly. For example the configuration could specify that the
connection experiences a packet loss exactly 5s after it was
initiated, and then a packet loss every 1s after that. Or that packet
loss happens at a specifed level of bandwidth limit based queueing.

&lt;p&gt;
You can check the Github repo linked above for the code and for
documentation on e.g. configuration. This blog post is more on why
this tool exists and why it looks the way it does.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h2&gt;Motivation and an example&lt;/h2&gt;

&lt;p&gt;
Why write yet another network simulator? The use case here is fairly
specific; we want to compare the behavior of different TCP
implementations (our own and others&#039;) under controlled
conditions. Doing this with random packet loss, or with different
connections having different RTTs would be quite tricky. We also want
each TCP connection to be as isolated from the other traffic as possible.
With link level network simulation it&#039;s really easy to have test cases
bleed into each other or some background traffic to mess things up,
unless you&#039;re really careful.

&lt;p&gt;
Here&#039;s an example. This is looking at the behavior of half a dozen
large CDNs, and a few operating system installations with no tuning at
all by downloading an equally large file from them all.
This particular scenario set up a 200ms base RTT (200ms no matter how
far away from the test client the server is, the amount of extra delay
added depends on the observed latencies from the proxy to the client
and the server), and alternating between a 2Mbps and a 4Mbps bandwidth
limit every 5 seconds.

&lt;pre&gt;
profile {
    id: &quot;200ms-variable-bandwidth&quot;
    filter: &quot;tcp and port 80 or 443&quot;
    target_rtt: 0.20
    dump_pcap: true

    downlink {
        throughput_kbps: 2000
    }
    uplink {
        throughput_kbps: 1000
    }

    timed_event {
        trigger_time: 5.0
        duration: 5.0
        repeat_interval: 10.0

        downlink {
            throughput_kbps_change: 2000
        }
    }
}
&lt;/pre&gt;

&lt;p&gt;
Below is the pattern of RTTs produced for a download of a 10MB
file. Time in seconds on X axis, time between segment sent and segment
acked on Y axis. It might look like a line graph, but actually it&#039;s a
scatterplot where every dot is a single segment. It&#039;s just that there
are a lot of RTT samples. I&#039;ve removed the legend from the graph,
since the data is pretty old and the point here isn&#039;t an argument
about what&#039;s reasonable TCP behavior and what isn&#039;t. But just so that
you have a frame of reference, the hot pink dots (4th from the top)
are Linux 3.16.

&lt;p&gt;
&lt;a href=&#039;/blog/stc/images/flow-disruptor/rtt-200ms-variable-bandwidth.png&#039; target=&#039;_blank&#039;&gt;&lt;img src=&quot;/blog/stc/images/flow-disruptor/rtt-200ms-variable-bandwidth.thumb.png&quot; /&gt;&lt;/a&gt;

&lt;p&gt;
What does that graph tell us? Well, when the available bandwidth
changes, it&#039;s the RTT that changes rather than the amount of data in
flight. This suggests that none of the tested servers uses a RTT-based
congestion control algorithm &lt;a href=&#039;#ftnt0&#039;
name=&#039;ftnt_ref0&#039;&gt;[0]&lt;/a&gt;. Which is of course not much of a
surprise given the challenges in deploying one of those, but you never
know. With no packet loss and no RTT feedback, it&#039;s not surprising
that there&#039;s some major bufferbloat happening.

&lt;p&gt;
For another point of view, a similar graph but this time looking at
the amount of data in flight (i.e. unacked) the moment each packet was
sent:

&lt;p&gt;
&lt;a href=&#039;/blog/stc/images/flow-disruptor/in-flight-200ms-variable-bandwidth.png&#039; target=&#039;_blank&#039;&gt;&lt;img src=&quot;/blog/stc/images/flow-disruptor/in-flight-200ms-variable-bandwidth.thumb.png&quot; /&gt;&lt;/a&gt;

&lt;p&gt;
You can see that a lot of CDNs clamp the congestion window
to reasonably sensible values (at around 256kB). Some others
appear to have neither any limit to the congestion window nor any
moral equivalent of a slow start threshold. Especially for a CDN that seems
a bit questionable; CDNs are supposed to have endpoints near to the user,
and there should be no need to keep such large amounts of data in flight
to a nearby user.

&lt;p&gt;
And does overstuffing the buffers like this help? At a first glance it
might appear to. After all the &quot;lines&quot; with high amounts of data in
flight also end earlier. But that&#039;s just an artifact of them having
5-10s of data still undelivered when the last payload packet is
sent. But if you look at the low right corner of the graph, you can
see some orphan dots corresponding to the connection teardown. And
thos are all happening at roughly the same time, as one would expect.

&lt;p&gt;
Another interesting thing is how some connections show up as one
smooth line, while others instead appear to consist of vertical
lines spaced out at fairly even intervals. This is probably the
difference between the server reacting to an ACK by sending out data
immediately, and the server reacting to an ACK only once sufficient
congestion window space has been opened up and then filling it all at
once. So for example see the olive green connection is extremely
bursty, sending out chunks of 256kB at a go.

&lt;p&gt;
So that&#039;s the kind of transport layer investigation this tool was
built for. You set up an interesting controlled scenario, run it
against a bunch of different implementations, and see if anything
crazy happens.  What happens with an early packet loss, what happens
with persistent packet losses, with different lengths of connection
freezes, and so on. That&#039;s also why the simulator has a facility for
generating trace files directly. It&#039;s rather useful for generating
graphs like this, and have them pre-split by connection.

&lt;h2&gt;Implementation&lt;/h2&gt;

&lt;p&gt;
This program was a bit of an experiment for me. On one hand it&#039;s
pretty similar to a lot of programs we write at work, and could have
shared a lot of code with the rest of the codebase. On the other hand
(as I&#039;ve written before) once a bit of code has been embedded in a
monorepo, it can be really hard to extract it out.

&lt;p&gt;
So I set out to write this in a separate repository, preferring open
source libraries instead of internal ones. In places where I wanted to
use our existing code I just copied it over. If it felt ugly or out of
place (as old but working code so often does), I rewrote it with no
regard for keeping compatibility with all the existing clients of that
code. You could say that the goal with this was to see whether life
would be more pleasant if some of that old baggage was replaced.

&lt;p&gt;
Some of those experiments were:

&lt;ul&gt;
  &lt;li&gt; Protocol buffers for configuration instead of our existing
    JSON infrastructure.
  &lt;li&gt; libev for the event loop, timers and signal handling instead
    of our in-house equivalents.
  &lt;li&gt; CMake for building.
  &lt;li&gt; More aggressive use of C++11 features than we can use in
    our real code, due to needing to support gcc 4.4.
&lt;/ul&gt;

&lt;h3&gt;Configuration&lt;/h3&gt;

&lt;p&gt;
Most of my time at Google was spent working on an in-house programming
language used for configuration &lt;a href=&#039;#ftnt1&#039;
name=&#039;ftnt_ref1&#039;&gt;[1]&lt;/a&gt;. It&#039;d take in 10KLOC programs
split across multiple files that described the configurations of
systems, compile them to 100KLOC protocol buffers, and those protobufs
would then be used as the actual configuration of record. So I&#039;m
pretty comfortable with using protocol buffers for configuration, but
less so for using the text protocol buffer format for it. There&#039;s a
reason configuration ended up requiring a separate language there. But
this is a simple task where configs are unlikely to be anywhere near
that size, so perhaps ascii protobufs would work just fine?

&lt;p&gt;
The protocol buffer text format actually turned out to be pretty
pleasant to use for this, which makes some sense given how similar it
is to JSON. One difference that didn&#039;t really matter is the lack of
maps. I never want them in my configs, so that&#039;s ok. The second big
difference was lists vs. repeated fields. Repeated fields of scalar
values are indeed kind of awkward. Compare this:

&lt;pre&gt;
&quot;a&quot;: [1, 2, 3]
&lt;/pre&gt;

Versus this:

&lt;pre&gt;
a: 1
a: 2
a: 3
&lt;/pre&gt;

&lt;p&gt;
But I rarely have lists of that sort in a configuration. It&#039;s always
a collection of some kind of complex compound structures. And there
the repeated field syntax feels more natural since it removes so much
of the noise.

&lt;pre&gt;
&quot;profiles&quot;: [{
  &quot;id&quot;: &quot;p80&quot;,
  &quot;filter&quot;: &quot;tcp and port 80&quot;,
  ...
}, {
  &quot;id&quot;: &quot;p443&quot;,
  &quot;filter&quot;: &quot;tcp and port 443&quot;,
  ...
}]
&lt;/pre&gt;

&lt;pre&gt;
profile {
  id: &quot;p80&quot;
  filter: &quot;tcp and port 80&quot;
}
profile {
  id: &quot;p443&quot;
  filter: &quot;tcp and port 443&quot;
  ...
}
&lt;/pre&gt;

&lt;p&gt;
What didn&#039;t end up working well was accessing the protocol buffers
using the generated code. It&#039;s of course great for accessing the
contents of the configuration tree in its raw form. But it turns out I
rather like for the reified configuration objects to have some smarts
built into them, not be just a pile of data. Here&#039;s what the protobuf
schema for the profiles looks like:

&lt;pre&gt;
message FlowDisruptorProfile {
    required string id = 1;
    optional string filter = 2;
    ...
}
&lt;/pre&gt;

&lt;p&gt;
This is how the same definition fragment would look like for our
existing JSON schemas:

&lt;pre&gt;
DECLARE_FIELD(string, id, &quot;&quot;)
DECLARE_FIELD(config_packet_filter_t, filter, config_packet_filter_t())
...
&lt;/pre&gt;

&lt;p&gt;
Yes, that&#039;s some ghetto syntax &lt;a href=&#039;#ftnt2&#039;
name=&#039;ftnt_ref2&#039;&gt;[2]&lt;/a&gt;. But the real difference is that
we&#039;ve been able to declare a more useful data type for
the &lt;code&gt;filter&lt;/code&gt; field than just string. When we parse a JSON
message with this field, the &lt;code&gt;filter&lt;/code&gt; field is
automatically parsed as a pcap filter, compiled to BPF, and the actual
compiled BPF program is then accessible through the config object. If
compilation fails, parsing of the JSON object as a whole fails. If a
configuration is copied or two configurations are merged together,
deep copies of the compiled filters will be done as appropriate.

&lt;p&gt;
As far as I can tell this isn&#039;t really possible with the protocol
buffer solution. As soon as any data gets some slightly more complex
semantic meaning, you need a separate and manual validation layer. Or
even worse, in this case since we don&#039;t just want to validate the data,
but actually want to generate a complex domain object from it at parse
time. This means there will actually be two parallel hierarchies of
objects, one for the raw protobuf Messages and another for the domain
objects. And these conversions aren&#039;t really amenable to any kind of
automation through e.g. reflection.

&lt;p&gt;
It&#039;s not just a issue with filters either, that&#039;s just an example. In
the real application we have custom field types for all kinds of
networking concepts. IP addresses, MAC addresses, subnets, and so
on. (And it&#039;s absolutely trivial to define new ones).

&lt;p&gt;
This isn&#039;t too big a deal in this program since it&#039;s just that one
field right now. It&#039;d be incredibly annoying in our real product that
has tens of config variables that require this kind of handling.
Manually maintaining the raw config vs. domain object mappings for all
of them would be painful and a source of lots of errors.

&lt;p&gt;
So this part didn&#039;t turn out at all like I expected. The part that I
was expecting to have some problems with was really nice at least
given my config sizes. The part that I thought protobufs would work
well with turned out to kind of suck. This is, of course, not a
problem with protocol buffers but with me trying to use them for the
wrong job. But it&#039;s strange that I never noticed this mismatch before.

&lt;a name=&#039;timers&#039;&gt;&lt;/a&gt;
&lt;h3&gt;Timers and other events&lt;/h3&gt;

&lt;p&gt;
An event loop that integrates all kinds of events is certainly easy to
program for. The event types in this program were timers, IO on fds,
and signal handlers (run synchronously as part of the event loop, even
if signal was asynchronous). The bits of libev I didn&#039;t need right now
are things I could at least imagine needing at some point.

&lt;p&gt;
That&#039;s not how we write our traffic handling loops at work though. At
the innermost level you&#039;ll have poll-mode packet handling, running on
multiple interfaces and reading and handling at most a small number
(e.g. 10) packets from each interface in one go. Go one layer outward,
and we have a loop that interleaves the above packet handling with
updating our idea of time and running timers if necessary &lt;a href=&#039;#ftnt3&#039;
name=&#039;ftnt_ref3&#039;&gt;[3]&lt;/a&gt;. That loop
runs for a predetermined amount of time (e.g. 10ms) before yeilding to
operational tasks for a short while. For example updating certain
kinds of counters that are sample-based rather than event based
(e.g. CPU utilization or NIC statistics), restarting child processes
that have died, or handling RPCs. But of course at most one RPC reply
before returning back to traffic handling.

&lt;p&gt;
As far as I know, this kind of setup isn&#039;t uncommon for
single-threaded networking appliances. You need careful control of the
event loop to make sure that work is batched together in sensibly
sized chunks and that you&#039;re not away from the actual work for too
long.

&lt;p&gt;
I couldn&#039;t figure out whether there would be a way to coerce libev to
a structure like this. There are some options there for embedding
event loops inside other event loops, but it feels like it&#039;d be very
ugly if it worked at all. So it&#039;s a really nice library, but probably
not the right match for the application. (See also the digression
about memory allocation later).

&lt;h3&gt;Closures and auto&lt;/h3&gt;

&lt;p&gt;
The way I ended up using libev was with a couple of
&lt;a href=&#039;https://github.com/teclo/flow-disruptor/blob/master/src/state.h#L34&#039;&gt;
trivial wrapper classes&lt;/a&gt;, where the handler callback was passed in
as a C++11 closure. (Yes, yes, every other language has had this since
the &#039;70s. Doesn&#039;t mean it&#039;s any less nice to finally have it in this
environment too).

&lt;p&gt;
It&#039;s absolutely dreamy for defining timers, which need a callback
function and some kind of state for the callback to operate on,
usually passed in as an argument. So the difference is essentially
between declaring a timer in a class like this:

&lt;pre&gt;
    Timer tick_timer_;
&lt;/pre&gt;

&lt;p&gt;
And initializing it like this:

&lt;pre&gt;
      tick_timer_(state, [this] (Timer*) { tick(); }),
&lt;/pre&gt;

&lt;p&gt;
Versus the relative tedium of defining a new class with the callback +
the data as a poor man&#039;s closure, or by defining a separate trampoline
function for every kind of timer.

&lt;p&gt;
And after having gone this route for event handlers, it was then very
easy to slip into doing it all over the place. Instead of maintaining
a vector of some sort of records that are later acted on, just
maintain a vector of closures instead that do the right thing when
called. For example the token bucket-style bandwidth throttler
implementation doesn&#039;t know anything about packets. It just gets costs
and opaque functions as input, and calls the function once the cost
can be covered.

&lt;p&gt;
There&#039;s a small problem with the event handler definitions, since you
need to allocate some memory for the closure in a separate block (the
value cells for the closed over variables need to live
somewhere). It&#039;s not obvious to me that any amount of template magic
would allow working around that. And we tend to be a bit paranoid
about lots of tiny separate allocations. So it&#039;s something I want, but
maybe not something I can actually justify using.

&lt;p&gt;
The new &lt;code&gt;for&lt;/code&gt;-loop and &lt;code&gt;auto&lt;/code&gt; are just pure win,
again not very surprising. I used both a lot, and even returning to
the code now a year later didn&#039;t really see any places where the
shorthands make the code harder to understand.

&lt;h3&gt;Build system&lt;/h3&gt;

&lt;p&gt;
  I&#039;ve found that build systems sucking and being hated by all the
  users is a good baseline assumption to make. I didn&#039;t hate CMake
  though.  In fact it seems perfect for a C++ program of this size. It
  came with enough batteries included to do everything I wanted to,
  and the (horribly named) CMakeLists.txt file for this project has
  very little boilerplate. It even appears to do transitive
  propagation of compiler / linker flags through dependencies in a
  sane manner, which is a problem that usually drives me completely
  nuts.

&lt;p&gt;
  I wasn&#039;t able to build a mental model of how CMake works though.  At
  least there was never any hope of being able to correctly guess how
  to do something new, or change the behavior of something. It was
  always off to a web search for vaguely plausible keywords. So my
  first impression was that it&#039;s a collection of very specialized bits
  of functionality, and once you have to do anything that&#039;s not
  already support, the complexity shoots up. But clearly it&#039;s
  worth giving CMake a good look the next time I want to burn our rat&#039;s
  nest of recursive makefiles to the ground and rebuild things.

&lt;h3&gt;Other experiments&lt;/h3&gt;

&lt;p&gt;
  I tried a couple of other funky things like using protocol buffers
  to represent the parsed packet headers rather than doing the usual
  trick of just casting the data to a packed C &lt;code&gt;struct&lt;/code&gt;.
  Not sure it really bought anything in the end, I can&#039;t figure out
  what the motivation was back when I wrote the code. Maybe to use
  them as the disk serialization format for traces?  But that&#039;s
  obviously stupid, since all of our analysis tools already work on
  pcap format. Or maybe I wanted to print some debug output, but
  didn&#039;t want to import our packet pretty-printer and all of its
  dependencies into this project?  (The curse of the monorepo, again).

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;
As I mentioned before, this tool was built for a specific purpose. But
when I described it to someone last week, they came up with a
completely different use that might be very interesting (though it&#039;ll
require a bit of extra work to automatically create some much more
intricate scenario definitions, so we&#039;ll see how practical it turns
out in practice). So maybe this has slightly wider applicability than
I initially thought. If you come up with a cool new use for the tool,
I&#039;d love to know about it.

&lt;h3&gt;Footnotes&lt;/h3&gt;

&lt;div class=&#039;footnotes&#039;&gt;
&lt;p&gt;
&lt;a href=&#039;#ftnt_ref0&#039; name=&#039;ftnt0&#039;&gt;[0]&lt;/a&gt;
Ok, it&#039;s plausible that the RTT sensitivity only kicks in after
a fairly high threshold. Say 2 seconds. So to be absolutely sure
you&#039;d need another test case with an even lower bandwidth limit,
to force all of the connections into the 3-4 second territory.

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref1&#039; name=&#039;ftnt1&#039;&gt;[1]&lt;/a&gt;
borgcfg / GCL

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref2&#039; name=&#039;ftnt2&#039;&gt;[2]&lt;/a&gt;
It gets translated to C++ using a couple of &lt;code&gt;cpp&lt;/code&gt;
invocations, which rather sharply limits the syntax. Who has time for
a real schema compiler when there&#039;s a product to be shipped?

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref3&#039; name=&#039;ftnt3&#039;&gt;[3]&lt;/a&gt;
I was once explaining the guts of our system to some people to
figure out how we might integrate their system and ours. And I could
just sense the disapproval when I admitted that we didn&#039;t have a bound
on the number of timers that could get triggered in a single timer
tick. How irresponsible, we have no idea of how long we might go
between calls to the packet processing layer of the IO loop!

&lt;/div&gt;

</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Thu, 01 Oct 2015 15:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-10-01-flow-disruptor/</guid></item><item><title>Mobile TCP optimization - lessons learned in production</title><link>https://www.snellman.net/blog/archive/2015-08-25-tcp-optimization-in-mobile-networks/</link><description>
&lt;p&gt;

I did a keynote presentation at the &lt;a
href=&#039;http://conferences.sigcomm.org/sigcomm/2015/hotmiddlebox.php&#039;&gt;SIGCOMM&#039;15
HotMiddlebox workshop&lt;/a&gt;, &amp;quot;Mobile TCP optimization -
Lessons Learned in Production&amp;quot;. The title was set before
I had any idea of what I&#039;d really be talking about, just that it&#039;d be
about some of the stuff we&#039;ve been working on at &lt;a
href=&#039;https://www.teclo.net/&#039;&gt;Teclo&lt;/a&gt;. So apologies if the content
isn&#039;t an exact match for the title.

&lt;p&gt;
This post contains my slides, interleaved with my speaker&#039;s notes for
that slide. It won&#039;t be an exact transcription of what I actually
ended up saying, they were just written to make sure that I had at
least something coherent to say re: each slide. We&#039;ve got an
endless supply of network horror story anecdotes, and I can&#039;t actually
remember which ones I ended up using in the talk :-/

&lt;p&gt;
I&#039;m particularly happy that my points
on &lt;a href=&#039;https://www.snellman.net/blog/archive/2015-08-25-tcp-optimization-in-mobile-networks/#slide-7&#039;&gt;transparency of optimization&lt;/a&gt; got a positive
reception. To us it&#039;s a key part of making optimization be a good
networking citizen, and has seemingly been getting short shrift so
far. Hilariously the other TCP optimization talk at the workshop
brought up a transparency issue we&#039;d never had to consider, lack of
MAC transparency causing a Wifi security gateway to think connections
were being spoofed.

&lt;p&gt;
Thanks to Teclo for letting me talk about some of this stuff publicly,
and to everyone who attended HotMiddlebox. It was a lot of fun, and I
got a bunch of useful information from the hallway discussions.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h2&gt;Table of Contents&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-1&#039;&gt;Introduction&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-2&#039;&gt;Background&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-3&#039;&gt;Implementation 1/2&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-4&#039;&gt;Implementation 2/2&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-5&#039;&gt;TCP optimization&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-6&#039;&gt;An optimized connection&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-7&#039;&gt;Transparency&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-8&#039;&gt;Simple optimizations&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-9&#039;&gt;Speedups&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-10&#039;&gt;Buffer management&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-11&#039;&gt;Effect on RTTs and packet loss&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-12&#039;&gt;Burst control&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-13&#039;&gt;Things we learned along the way&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-14&#039;&gt;Don&#039;t rely on hardware features&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-15&#039;&gt;Two mobile networks are never equal&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-16&#039;&gt;Reordering&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-17&#039;&gt;Strange packet loss patterns&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-18&#039;&gt;Bad or conflicting middleboxes&lt;/a&gt;
  &lt;li&gt; &lt;a href=&#039;#slide-19&#039;&gt;O&amp;amp;M is a lot of work&lt;/a&gt;
&lt;/ul&gt;

&lt;h2&gt;Presentation&lt;/h2&gt;

&lt;a id=&#039;slide-0&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-0.png&#039; /&gt;

&lt;p&gt;
  Hi, good morning everyone. I&#039;m Juho, and I&#039;ll be talking about the
  mobile TCP optimization system we&#039;ve been working on at Teclo
  Networks.

&lt;a id=&#039;slide-1&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-1.png&#039; /&gt;

&lt;p&gt;
  I&#039;ll start with a tiny bit of background on the product, then show how
  it works, how we think about TCP optimization and some results. And
  finally I&#039;ll go through about some of the things we learned while
  growing this from a prototype to a product that can be deployed in
  real operator networks.

&lt;a id=&#039;slide-2&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-2.png&#039; /&gt;

&lt;p&gt;
  We&#039;re a Zurich based startup that&#039;s been working on TCP optimization
  for about 5 years now, with the first production deployments over 4
  years ago so we&#039;ve got a bit of experience at this point. We&#039;ve been
  in live traffic in around 50 mobile networks with about 20 commercial
  deployments. This includes all kinds of radio technologies (2G, 3G,
  LTE, WiMAX, CDMA) and anything from small MVNOs with 100Mbps of
  traffic to multi-site installations at major operator groups with
  100Gbps of traffic.

&lt;a id=&#039;slide-3&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-3.png&#039; /&gt;

&lt;p&gt;
  This is all done with standard hardware, normal Xeon CPUs and Intel
  82580 or 82599 NICs. The only exotic component in our typical setup
  are NICs with optical bypass for failover. This scales up to 10
  million connections and 20Gbps of optimization in a single 2U box.

&lt;p&gt;
  Our normal method of integration is to function as a bump in the
  wire with no L2/L3 address, preferably on the Gi link next to the
  GGSN. This the last point in the core network that deals with raw IP
  before it gets GTP encapsulated. So we have two network ports, one
  is connected to the GGSN and the other is connected to the next hop
  switch. From their point of view it&#039;s just a very smart piece of
  wire.

&lt;a id=&#039;slide-4&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-4.png&#039; /&gt;

&lt;p&gt;
  We&#039;ve got a completely custom user space TCP stack. We started from
  scratch rather than from an existing implementation since our method
  of splitting the connection into two separate parts without
  sacrificing transparency would be hard to retrofit into an existing
  stack. Packet IO is done with our own user space NIC drivers;
  basically map the PCI registers and a big chunk of physical memory for
  frame storage, and manipulate the NIC rx and tx descriptor rings
  directly. It&#039;s not a lot of code, less than 1000 lines, and has some
  really nice properties like complete zero copy implementation even for
  packets that we buffer for arbitrary amounts of time.

&lt;p&gt;
  The operating system is only involved with the control plane, the
  data plane is all in user space. One reason for that is obviously
  performance, but we think it&#039;s also a big win all around. Everything
  is always so much easier when you&#039;re working in user space;
  programming, debugging, testing, deployment.

&lt;a id=&#039;slide-5&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-5.png&#039; /&gt;

&lt;p&gt;
  So what do I mean by TCP optimization?

&lt;a id=&#039;slide-6&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-6.png&#039; /&gt;

&lt;p&gt;
  An optimized connection looks something like this. We pass the
  initial handshake through unmodified; yellow SYN from client to
  server, yellow SYNACK, and the final ACK in blue. Up to this point
  we&#039;re just a totally transparent network element. If there&#039;s
  anything odd about the connection setup, we&#039;ll just leave that
  connection unoptimized and continue forwarding any packets straight
  through.

&lt;p&gt;
  But in this case everything went fine, so from that point
  on we&#039;ll ACK any data packets and take responsibility for delivering
  them. So here in green we have the request; we send an ACK to the
  client, and the segment toward the server. The server sends the
  first part of the response which we ack, and then send a new batch
  of data to us.

&lt;p&gt;
  So it&#039;s kind of a hybrid. Not a terminating split-TCP proxy, but
  thanks to acknowledging data is a lot more effective than a Snoop
  proxy.

&lt;a id=&#039;slide-7&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-7.png&#039; /&gt;

&lt;p&gt;
  Since we don&#039;t terminate the connection, we can be fully transparent
  in TCP options and sequence numbers. This provides us with some
  really nice advantages.

&lt;p&gt;
  First, we can stop optimizing connections at almost any time without
  breaking them. As long as we have no undelivered data buffered for
  the connection, the endpoints agree on the connection state and can
  just pick it up. This means we can have pretty short idle timeouts,
  a couple of minutes rather than 15 minutes. And if something odd
  happens? Just stop optimizing. This is also really nice for
  upgrades; we just stop optimizing connections, wait about a minute
  for all the buffers to drain, and take the system to bypass without
  interruption of service.

&lt;p&gt;
  It deals very nicely with asymmetric routing. Sometimes something
  goes wrong with the integration, and we only get the uplink or
  downlink packets for some of the traffic. If we see only the SYN and
  the ACK but not the SYNACK, we&#039;ll just skip optimizing the
  connection. The same if we see the there&#039;s a loopback route where we
  see the SYN twice, once in each direction.

&lt;p&gt;
  One problem with middleboxes is that they make it hard to introduce
  new TCP options, say multipath TCP of TCP fast open. Terminating
  middleboxes will essentially eat the unknown options. In our
  design a SYN with unknown options will just be passed straight through
  and we&#039;ll let the endpoints take care of the rest. There&#039;s a closely
  related issue of protocols that claim in the IP header to be TCP in
  order to bypass firewalls, but actually aren&#039;t. Again these would be
  broken by termination.

&lt;p&gt;
  Finally, terminating proxies will often end up with a different MSS
  on the two sides of the connection. So it&#039;s supposed to send packets
  of at most 1380 bytes toward the client, but the server sends it
  data in chunks of 1460 bytes. This repacketization increases load,
  but also increases protocol overhead when packets are split
  suboptimally. In the worst case there will be a substantial amount
  of tiny segments in the middle of the TCP flow, which is problematic
  for a lot of mobile networks. In our design the segments can just be
  passed through as-is, with at most a bit of tweaking to the IP and
  TCP headers.

&lt;a id=&#039;slide-8&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-8.png&#039; /&gt;


&lt;p&gt;
  Some of the optimizations we can do are standard fare. Latency
  splitting speeds up the initial phase of the connection, especially
  if the server is old and still has an initial congestion window of
  2/3/4. Likewise if the connection is bottlenecked on the receive
  window, reducing the effective latency improves steady state
  throughput.

&lt;p&gt;
  If there&#039;s packet loss, having the retransmission happen nearer to
  the edge means we react faster to it. We can also make better
  decisions thanks to knowing it&#039;s a radio network and applying some
  heuristics to the packet loss patterns (more on that later). We
  don&#039;t have any fancy congestion control algorithm, our experience is
  that it&#039;s just not an area where you can gain a lot.

&lt;p&gt;
  One thing you have a lot in mobile networks is either the radio
  uplink or downlink freezing completely for very long times relative
  to normal RTT. This triggers a lot of bogus retransmit timeouts in
  vanilla TCP stacks. We never use retransmit timers, and detect tail
  losses using probing instead. This allows faster recovery in cases
  where the full window was really lost. There&#039;s also no risk of
  misinterpreting an ACK of the original data as an ACK of the
  retransmitted data, which might cause confusion otherwise.

&lt;a id=&#039;slide-9&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-9.png&#039; /&gt;

&lt;p&gt;
  How well does it work? Here&#039;s some results from a trial in a
  European LTE network last winter. These are average throughput
  numbers for downloads, bucketed by transfer size. Optimized results
  in blue, unoptimized in red. What we see is little if any
  acceleration for tiny files (there&#039;s not much scope for optimization
  when all the data fits in the initial window), and anywhere from
  10-40% speedups for larger transfers.

&lt;p&gt;
  Just to clarify, these are results from live traffic rather than any
  kind of synthetic testing, looking at the throughput of all TCP
  connections going through the operator&#039;s network with alternating days
  of optimizing 100% of the traffic with days of optimizing none of it.

&lt;p&gt;
  (N.B. Measurement seemed to be a somewhat sensitive issue, with
  doubts expressed regarding whether we were really doing it
  in a statistically robust manner, and whether averages are really
  a sensible way to represent the information.

&lt;p&gt;
  Performance measurements in mobile networks are indeed a harder
  problem than you&#039;d think. Unfortunately it&#039;s also a subject I could
  talk about for an hour and this was a 40 minute talk that needed to
  cover a lot more ground. I might need to write a separate post on
  this subject.

&lt;p&gt;
  But suffice to say that we do the live traffic measurements in a way
  that tries to minimize effects from weekday / month of day trends,
  and over a long enough period of time that the diurnal cycle is
  irrelevant. Averages are indeed not a good measure from an academic or
  even engineering point of view, but using anything else is miserable
  commercially. You don&#039;t want to spend the first 30 minutes of a
  meeting by explaining exactly how to interpret probability density
  function or CDF graphs.)

&lt;a id=&#039;slide-10&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-10.png&#039; /&gt;

&lt;p&gt;
  Here&#039;s a more interesting optimization. Buffer management which is
  our feature for mitigating buffer bloat. Mobile networks are usually
  tuned to prefer queueing over dropping packets. And internet servers
  in turn are always going to ignore queuing as a congestion signal
  since RTT-based congestion control schemes will always lose out in
  practice due to the tragedy of the commons. So the queues will be
  filled to the brim in even pretty normal use. The most extreme case
  we&#039;ve seen had queues of 30 seconds. Obviously unusable for
  anything, but even much a few hundred milliseconds of extra RTT will
  make interactive use painful.

&lt;p&gt;
  Now, in mobile networks these queues are almost always per-user
  rather than per-flow or global. So what we do is handle all of the
  TCP flows of a single user as a unit. We determine the amount of
  data in flight across all flows that both keeps RTTs at an
  acceptable level but also doesn&#039;t starve the radio network of
  packets. When the conditions change, we adjust that estimate. This
  quota of in flight bytes is then split between the all the flows of
  the user, and we give all connections their fair share of it. So the
  batch download won&#039;t completely crowd out the web browsing.

&lt;p&gt;
  This is independent of per-flow congestion control, there are some
  practical reasons why you only want to do packet loss based
  congestion control on a flow level. It won&#039;t work when done
  per-subscriber.

&lt;a id=&#039;slide-11&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-11.png&#039; /&gt;

&lt;p&gt;
  How well does it work? Here&#039;s some results from the same network as
  the previous example. Here we have results from 4 days of testing,
  with time on the X axis. The samples in blue are from the days when
  all traffic was being optimized, the red samples from days with
  optimization turned off. The graph on the left has average RTT on
  the Y axis. On the optimized days it&#039;s pretty stable at around
  165ms, on the unoptimized days there&#039;s more variation and the
  averages are much higher, 320ms. So it&#039;s an almost 50% reduction in
  average RTT. On the right hand graph we have retransmission rates on
  the Y axis. For optimized days they are mostly under 1% (average
  0.8%) and for unoptimized mostly above 2% (average 2.6%). That&#039;s a
  roughly 70% reduction in retransmissions.

&lt;p&gt;
  This data is from the same set of testing as the results from a
  couple of slides back. So these RTT reductions aren&#039;t coming at the
  expense of performance. Instead we&#039;re getting big improvements in both
  RTT and in throughput.

&lt;a id=&#039;slide-12&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-12.png&#039; /&gt;

&lt;p&gt;
  Another thing we&#039;ve noticed is that even surprisingly modest bursts
  of traffic can cause packet loss, for example when traffic gets
  switched from 10G to 1G links. And there are all kinds of mechanisms
  in TCP that can cause the generation of such bursts. ACK bunching,
  losing a large number of consecutive ACKs, or losing a packet when
  the full receive window&#039;s worth of data is in flight (the delivery
  of the retransmitted packet will open up the full window in one go).

&lt;p&gt;
  Whatever the mechanism, it turns out that you don&#039;t want the TCP
  implementation to send out hundreds of kilobytes for a single
  connection in a few microseconds. Instead it&#039;s better to spread the
  transmits over a larger period of time. So not 200kB at once but
  instead 20kB at 1ms intervals. In one network this kind of pacing
  reduced the observed packet loss rate on large test transfers from
  over 1% to under 0.2%.

&lt;a id=&#039;slide-13&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-13.png&#039; /&gt;

&lt;p&gt;
  Here&#039;s some of the things we learned the hard way.

&lt;a id=&#039;slide-14&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-14.png&#039; /&gt;

&lt;p&gt;
  Every time we depend on a hardware feature we end up regretting
  it. They can never be used to save on development effort, because
  next month there will be new requirements that the hardware feature
  isn&#039;t flexible enough to handle. You always need to implement a pure
  software fallback that&#039;s fast enough to handle production loads. And
  if you&#039;ve already got a good enough software implementation, why go
  through the bother of doing a parallel hardware implementation? The
  only thing that&#039;ll happen is that you&#039;ll get inconsistent
  performance between use cases that get handled in hardware vs. use
  cases that get handled in software. A few examples:

&lt;p&gt;
  Most common issue is needing to deal with more and more exotic forms
  of encapsulation. VLANs are fine, hardware will always support
  that. Double VLANs might or might not be fine. Some forms of fixed
  size encapsulation are fine. Multiple nested layers of MPLS, or GTP
  with its variable length header are a lot more problematic. A
  canonical example here is checksum offload for both rx and tx; there&#039;s
  hardware support that always eventually ends up being insufficient,
  and you can compute the checksums very fast with vector instructions
  on modern CPUs.

&lt;p&gt;
  We&#039;ve been using multiple RX queues for parallelization. So we
  assign one process to each core, with each one having a separate
  receive queue. We originally did this with basic RSS hashing, but
  that quickly became insufficient. Smart people would have stopped at
  that. But we&#039;re idiots, and read the documentation on this wonderful
  semi-programmable traffic distribution engine built in to the NICs
  we used, call the Flow Director. Up to 32k rules with various kinds
  of matching. Awesome.

&lt;p&gt;
  But then we needed to deal with
  virtualization. SR-IOV looks just perfect for our needs; the virtual
  machine gets its virtual slice of the network card that&#039;s for all
  intents and purposes identical to the real hardware. Except... It only
  has a maximum of 8 queues when physical hardware can do up to
  128. We need more parallelization than 8 queues. And of course
  encapsulation is an issue too, even the Flow Director isn&#039;t flexible
  enough for every use case we&#039;ve seen. So everything needs to be
  architected around just a single RX queue and doing the traffic
  distribution fully in software.

&lt;a id=&#039;slide-15&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-15.png&#039; /&gt;

&lt;p&gt;
  We initially thought that we&#039;d just write the system once, deploy it
  everywhere, and roll in money while doing no work. But it turns out
  that no two networks are quite the same. There&#039;s often new kinds of
  performance issues or the existing methods of integration don&#039;t work
  in a network. We love the first case from a commercial point of
  view, since every problem is an opportunity for a performance
  improvement. We hate the latter, since supporting new integration
  methods brings no real value, it&#039;s just a cost of doing
  business. But in both cases the end result is that there&#039;s a fair
  bit of code that&#039;s not exercised in normal use and is liable to code
  rot.

&lt;p&gt;
  Automated deterministic unit testing has been absolutely key in
  maintaining our sanity with this, and my only regret there is not
  planning for it from the first line of code and having to retrofit
  it in. Compared to the normal network programming workflow it&#039;s
  just such a huge productivity boost to be able to run hundreds of
  TCP behavior unit and regression tests in a few seconds. Things that
  get fixed stay fixed. We&#039;re also able to automatically generate ipv6
  test cases from our ipv4 tests. If we didn&#039;t do it, there&#039;s a good
  chance that our ipv6 support would rot away very quickly.

&lt;a id=&#039;slide-16&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-16.png&#039; /&gt;

&lt;p&gt;
  I&#039;ll give a couple of examples of performance issues I thought were
  interesting.

&lt;p&gt;
  The folklore around mobile networks is that there will never be any
  reordering. So our initial design was actually based on aggressively
  making use of that assumption. Turns out not to be true in
  practice. For equipment from one particular vendor we&#039;ve seen small
  packets get massively reordered ahead of large ones. We&#039;re talking of
  reordering by 30 segments or over 50ms. This is particularly bad if
  there&#039;s a terminating HTTP proxy in the mix, and MTU mismatches on the
  southbound and northbound connections cause the proxy to generate lots
  of small packets. And reordering is poison for TCP. So we had to
  develop special heuristics to detect and gracefully handle this case.

&lt;a id=&#039;slide-17&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-17.png&#039; /&gt;

&lt;p&gt;
  Then there&#039;s strange patterns of packet loss or packet corruption. I
  talked earlier about the burst control feature that mitigates
  problems caused by packet loss from 10G to 1G switching. But this
  one is maybe even more mysterious.

&lt;p&gt;
  One network was regularly losing some or all packets right at the
  start of the connection. So the handshake would go in, the request
  would go out, and then the response was dropped somewhere in the
  RAN. Losing the initial window of packets is of course just about
  the worst thing you can do to TCP. And this only happened in one
  geographical region that was using a different radio vendor than the
  rest of the country. Our best guess was that it was somehow related
  to the 3G state machine transition from low power to high power
  mode.

&lt;p&gt;
  We never did find out exactly what the issue was, getting that
  sorted out was the operator&#039;s job. But again we needed some
  specialized code, this time to handle packet loss after a period of
  no activity different from packet loss in the middle of a high
  activity period.

&lt;a id=&#039;slide-18&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-18.png&#039; /&gt;

&lt;p&gt;
  Operator core networks can have absurd numbers of chained
  middleboxes. So you&#039;ll have a series of traffic shaper, TCP
  optimizer, video optimizer, image and text compressor, caching
  proxies, a NAT and a firewall, all from different vendors. When
  something goes wrong, it can take a lot of effort to just locate
  which component is at fault. The default assumption always seems to
  be that it&#039;s the most recently added box, which to be fair is a
  pretty good heuristic to apply. Things are even worse if it&#039;s not
  strictly a problem with a single network element, but in the
  interactions of several nodes. And these middleboxes are configured
  once, probably not looked at again for a very long time unless someone
  notices a problem, and the combination is never tuned holistically.

&lt;p&gt;
  Maybe a canonical example of this is MTU clamping. In mobile
  networks you generally want a maximum MSS of 1380 to account for the
  GTP protocol overhead. Often this is done by making e.g. the
  firewall clamp the MTU. This works great until a terminating proxy
  is added south of the firewall, such that the clamping doesn&#039;t apply
  to the communication between the client and the proxy but does apply
  to the communication between the proxy and the server. This is
  exactly the wrong way around, and it&#039;s easy to miss since
  things will still work, just inefficiently.

&lt;p&gt;
  HTTP proxies are frequently really badly configured from a TCP
  standpoint, which makes some sense since that&#039;s not their core
  competence. A HTTP proxy is usually there to do a specific function
  like caching, compression or legal intercept. It&#039;s not there to
  optimize speed. But you&#039;ll see things like the proxy not using window
  scaling for the connection to the server, or still using a initial
  congestion window of 2. These are things that should not exist in
  2015.

&lt;p&gt;
  There are proxies that freeze the connection for a few seconds on
  receiving a zero window, which is normally not a big deal since zero
  windows are rare. But a bit of a problem if you have another device
  right next to the proxy ACKing the data - for example a TCP
  optimizer. (We had to develop a special mode that&#039;d never emit zero
  windows and instead do flow control through delaying ACKs
  progressively more and more as the receive buffers fill up). We even
  saw a HTTP proxy that had been configured to retransmit 15 segments
  instead of one segment on a retransmit timeout. Since RTOs caused by
  delays rather than full packet loss are of course really common in
  mobile, this proxy was spewing out amazing amounts of spurious
  retransmissions.

&lt;p&gt;
  One funny interaction we see all the time is having a TCP optimizer
  right next to a traffic shaper. So you have one box whose job it is
  to speed things up, and another that tries to slow things down. It&#039;s
  an insane way of doing things - these two tasks should maybe be done
  by the same device.

&lt;a id=&#039;slide-19&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-19.png&#039; /&gt;

&lt;p&gt;
  I&#039;ve been just talking about the data plane in this presentation,
  since that&#039;s the part I&#039;m responsible for. But it&#039;s really important
  to note that the data plane alone is not a product you can sell.

&lt;p&gt;
  You need quite a lot of sophistication on the control plane too to
  get something that can be deployed in operator networks. It&#039;s a lot
  more work than one might think, in our case the management system
  probably took at least as much effort as the traffic handling, with
  three full rewrites and a fourth one ongoing. So there&#039;s a
  Juniper-style CLI for configuration, a web UI for simple
  configuration and statistics, a counter database, support for all
  kinds of protocols for getting operational data in and out of the
  management system, and so on.

&lt;p&gt;
  If anyone here is planning on turning research into a product, you
  need to budget a lot more time for this stuff than you think.

&lt;a id=&#039;slide-20&#039; /&gt;
&lt;img class=&#039;slide&#039; src=&#039;/blog/stc/images/tcp-optimization-slides/slide-20.png&#039; /&gt;

&lt;p&gt;
  Thanks a lot for your attention, we should have some time
  for questions now. If you want to get in touch with me for some reason,
  here&#039;s my contact details.
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Tue, 25 Aug 2015 15:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-08-25-tcp-optimization-in-mobile-networks/</guid></item><item><title>Unit testing a TCP stack</title><link>https://www.snellman.net/blog/archive/2015-07-09-unit-testing-a-tcp-stack/</link><description>
&lt;p&gt;
Last year in an online discussion someone used in-kernel TCP stacks as
a canonical example of code that you can&amp;#39;t apply modern testing
practices to. Now, that might be true but if so the operative phrase
there is &amp;quot;in-kernel&amp;quot;, not &amp;quot;TCP stack&amp;quot;. When the
TCP implementation is just a normal user-space application,
there&amp;#39;s no particular reason it can&amp;#39;t be written in a way
that&amp;#39;s testable and amenable to a test driven development
approach.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
The first versions of &lt;a href=&#039;https://www.teclo.net&#039;&gt;Teclo&lt;/a&gt;&#039;s
TCP stack were written as a classic monolithic systems application
with lots of explicit and implicit global state, not as something that
could be treated as a library let alone something you could reasonably
mock out parts of. As such it was totally unsuited for any kind of
automated testing. The best you could do was run the system in various
kinds of simulated network environments and check that it was getting
roughly the same speeds from one release to the next. We&#039;ve also got
over 50 configuration parameters for tweaking the behavior of the TCP
algorithms, which would make for a hell of a test matrix. Repeated
manual testing of all these parameters would probably require a person
to do nothing but run those tests full time.

&lt;p&gt;
This was clearly not tenable in the long term, so getting some kind of
deterministic and automatable tests up was a pretty high priority. So
very soon after we were finished with the rush to ship a first
version, we could refactor things a bit for better testability and
had at least some rudimentary tests up.

&lt;h3&gt;How we write tests&lt;/h3&gt;

&lt;p&gt;
What would make a TCP implementation particularly tricky to test?

&lt;p&gt;
Our TCP flow record has over 70 state variables and 10 timers (some of
which can interact with each other). And we need two of those records
for a single TCP connection, with the state of one flow potentially
affecting the behavior of the other &lt;a href=&#039;#ftnt1&#039;
name=&#039;ftnt_ref1&#039;&gt;[1]&lt;/a&gt;. With this much interlinked state it is hard
to feel confident about any testing that tries to artificially set up
only the relevant variables. Even if such setup is done correctly right
now, it would be very easy for those assumptions to break as the code
changes, invalidating the tests.

&lt;p&gt;
In general the appropriate unit of testing here is then the TCP stack
as a whole rather than e.g. somehow trying to test a feature like zero
window probing in isolation just by calling a method that implements
that feature. The latter would be an absurd idea, since interesting
TCP features end up being a lot more cross cutting than that. Of
course I don&amp;#39;t mean testing the application as a whole either. We
chop the application off at the the core event loop, which would
normally handle polling the NICs for packets, update the system&amp;#39;s
idea of the current time between packets, run timers when appropriate,
and occasionally receive RPC messages from the management system. All
of this detail is luckily irrelevant for testing the core TCP
algorithms.

&lt;p&gt;
Instead for testing we create an instance of the TCP stack that
replaces the normal NIC-based IO backend with a callback based one. The
test driver will inject packets directly to the TCP stack by calling
the appropriate entry point. When the TCP stack wants to emit a
packet, that triggers a callback in the test driver and we can check
whether the contents of the packet are what were expected. Finally,
the other entry point to the TCP stack is implicit, through timer
callbacks. To handle this case, we need to replace the wall clock
based time source with a virtual one, and give the test driver the
responsibility for triggering it.

&lt;p&gt;
The second problem is expressing the test cases in a convenient
manner. The first thought here was expressing the test cases as pcap
format trace files &lt;a href=&#039;#ftnt2&#039;
name=&#039;ftnt_ref2&#039;&gt;[2]&lt;/a&gt;. The trace files would theoretically
have exactly right information: the exact packet contents and
microsecond accurate timing information for the test driver to work
by. This approach turns out to not to be good for much. Artificial
test cases are very hard to create and update, debugging test failures
is painful, and testing for things other than packets that are output
impossible (e.g. counters) &lt;a href=&#039;#ftnt3&#039;
name=&#039;ftnt_ref3&#039;&gt;[3]&lt;/a&gt;. No, the test cases really need to be
expressed in code.

&lt;p&gt;
For that to work, there needs to be a simple way of describing packets
both for the purposes of generating packets as well as comparing
output packets to expectations. Now, we happened to have some code
around for pretty-printing many kinds of packets as JSON. That sounds
like a perfect tool for the job, we&#039;d just need a bit of code to
do the reverse operation. This is what the JSON looked like:

&lt;pre&gt;
{
    &amp;quot;ether&amp;quot;:{
        &amp;quot;source&amp;quot;:&amp;quot;31:32:33:34:35:36&amp;quot;,
        &amp;quot;dest&amp;quot;:&amp;quot;41:42:43:44:45:46&amp;quot;,
        &amp;quot;type&amp;quot;:&amp;quot;ip&amp;quot;
    },
    &amp;quot;ip&amp;quot;:{
        &amp;quot;version&amp;quot;:4,&amp;quot;hlen&amp;quot;:5,&amp;quot;tos.dscp&amp;quot;:0,&amp;quot;tos.ecn&amp;quot;:0,&amp;quot;len&amp;quot;:1040,&amp;quot;id&amp;quot;:0,
        &amp;quot;rf&amp;quot;:0,&amp;quot;df&amp;quot;:0,&amp;quot;mf&amp;quot;:0,&amp;quot;offs&amp;quot;:0,&amp;quot;ttl&amp;quot;:255,&amp;quot;proto&amp;quot;:&amp;quot;tcp&amp;quot;,&amp;quot;check&amp;quot;:0,
        &amp;quot;src&amp;quot;:&amp;quot;170.170.170.187&amp;quot;,&amp;quot;dst&amp;quot;:&amp;quot;8.0.0.8&amp;quot;
    },
    &amp;quot;tcp&amp;quot;:{
        &amp;quot;source&amp;quot;:80,&amp;quot;dest&amp;quot;:58999,&amp;quot;window&amp;quot;:48000,&amp;quot;check&amp;quot;:0,
        &amp;quot;seq&amp;quot;:1001000,&amp;quot;ack_seq&amp;quot;:1100,&amp;quot;urg_ptr&amp;quot;:0,
        &amp;quot;cwr&amp;quot;:0,&amp;quot;ece&amp;quot;:0,&amp;quot;urg&amp;quot;:0,&amp;quot;ack&amp;quot;:1,&amp;quot;psh&amp;quot;:0,&amp;quot;rst&amp;quot;:0,&amp;quot;syn&amp;quot;:0,&amp;quot;fin&amp;quot;:0,
        &amp;quot;options&amp;quot;: {},
        &amp;quot;data&amp;quot;:&amp;quot;...&amp;quot;
    }
}
&lt;/pre&gt;

&lt;p&gt;
Right... That just won&#039;t work. First, it&#039;s way too verbose since even a
simple test will involve tens of packets. You could eliminate some of
the verbosity in the packet generation by using lots of
defaulting. But that doesn&#039;t really work for checking outputs against
expects. This kind of 1:1 mapping to raw fields in the packet is also
not what&#039;s really needed. Things like advertised windows and sequence
ranges are pretty much the things I care most about when specifying a
test case. But the actual advertised window can&#039;t be determined from a
single packet, you need to know the window scaling factor that&#039;s in
use, which is only available in the SYN. Likewise the starting
sequence number of a segment is in the TCP header, but the ending
sequence number is implicit.

&lt;p&gt;
What we need is something designed for humans to read. The obvious
choice here was to pattern it after the &lt;code&gt;tcpdump&lt;/code&gt; output
format since we read packet dumps in that format every day.

&lt;pre&gt;
41:42:43:44:45:46 &amp;gt; 31:32:33:34:35:36, 8.0.0.8.58999 &amp;gt; 170.170.170.187.80: Flags [S], seq 999, win 32000 [mss 1460, sackOK, wscale 7, ts_val 123, ts_ecr 0]
&lt;/pre&gt;

&lt;p&gt;
That&#039;s still a bit chubby, but the way this works in practice is that we&#039;d make a PacketGenerator object that has defaults for the fields that are generally going to be constant over the lifetime of the connection (but just generally, if they need to change, no problem):

&lt;pre&gt;
// Set up a generator with defaults
PacketGenerator from_server(// MAC addresses
                            &amp;quot;123456&amp;quot;, &amp;quot;ABCDEF&amp;quot;,
                            // IP addresses
                            0xaaaaaabb, 0x8000008,
                            // Port numbers
                            80, 58999,
                            // Window scaling, traffic direction
                            4, true);

#define INJECT_FROM_SERVER(str) 
  tcp_test.inject(from_server.generate(str))

// Generate and inject a packet
INJECT_FROM_SERVER(&amp;quot;[A], seq 1001701:1003101, ack 161, win 192000&amp;quot;));
&lt;/pre&gt;

&lt;p&gt;
What about expects then? In the normal mode of operation we simply
push the string representation of expected packets to a per-interface
queue. When the TCP stack tries to emit a packet, it ends up instead
in a callback in the test driver. The callback pretty-prints the
packet, and compares it to the first string in the queue for the
output interface. If the strings don&#039;t match, or if the queue is
empty, the test fails (and thanks to having readable string
representations it&#039;s generally completely obvious where in the test
the problem was, and what the difference between expected and actual
results). To eliminate the repetition in the pretty-printed
representation, we also typically have a small macro that fills in
the layer 2 / layer 3 information.

&lt;pre&gt;
#define EXPECT(M, E) tcp_test.expect(M, E, __FILE__, __LINE__)
#define EXPECT_TO_CLIENT(str) EXPECT(true, &amp;quot;31:32:33:34:35:36 &gt; 41:42:43:44:45:46, 170.170.170.187.80 &gt; 8.0.0.8.58999: Flags &amp;quot; str);

// Assert that the next packet to be output toward the client should look
// like this.
EXPECT_TO_CLIENT(&amp;quot;[A], seq 1001701:1003101, ack 161, win 48000&amp;quot;);

// N.B. packet generation is stateful when it comes to window scaling,
// pretty-printing is not. So this 48k corresponds to the unscaled 192k
// from the input.
&lt;/pre&gt;

&lt;a name=&#039;time&#039;&gt;&lt;/a&gt;
&lt;p&gt;
The last bit of basic functionality is manipulating time. This is very
simple, as long as it&#039;s easy to substitute some kind of a virtual
clock for a real clock. Just move the clock forward by the requested
amount, run any timers, and check that the expect queues are
empty. The only tricky bit here is advancing the clock in the minimum
timer quant rather than all at once. This matters since a timer
getting run might cause a timer (either the same or different one) to
be (re)scheduled.

&lt;pre&gt;
void TcpTest::step(int ms, const char* file, int line) {
    int usec = TimerSet::USEC_PER_TICK;
    for (int i = 0; i &lt; (ms * 1000) / usec; ++i) {
        time_source_-&gt;step_ms(usec);
        timers()-&gt;run_expired();
    }
    assert_all_expects_satified(file, line);
}
&lt;/pre&gt;

&lt;p&gt;
There are a few cases where the textual representation is
insufficient. For example maybe some header field that needs to be
changed is too obscure to bother including in the parser and the
prettyprinter. For cases like this the packet structure returned by
generate() can be modified before being injected. Likewise there&#039;s
another version of expect that takes a callback function for doing
arbitrary checks on the packet, rather than just a string comparison.

&lt;p&gt;
Finally, it turns out that when testing edge cases of TCP behavior
it&#039;s often very convenient to run a bunch of alternate scenarios
starting from some particular socket state. A normal solution here
might be to make the state cloneable, but that&#039;s something we actively
don&#039;t want to do in the normal application, and maintaining the
copying code would be fragile and an unnecessary hassle. Instead for
testing we have little &lt;code&gt;BEGIN_FORK&lt;/code&gt;
and &lt;code&gt;END_FORK&lt;/code&gt; macros to run a block of code in a forked
process and quitting the parent process if the child processes errors
out, with the alternate scenarios each running in their own forked
process. It&#039;s not an ideal setup, since forking makes the experience
of using tools like gdb or valgrind a bit rough.

&lt;p&gt;
This also makes for pretty large tests. A typical test (containing
several subtests through the fork hack) is around 100-150 lines
long. Unsurprisingly the tests end up a lot longer than the code being
tested. Code coverage of the relevant files is at about 93% which is
good enough (most of the code that isn&#039;t covered is probably never
executed in production; it&#039;s old experimental features hidden behind
flags not enabled by default, paranoid error checking code for
situations that would be very hard or impossible to write a test
to trigger, etc).

&lt;h3&gt;What we can and can&#039;t test&lt;/h3&gt;

&lt;p&gt;
One objection that was implied to testing TCP implementations is that
you only really test completely trivial things, and most of the
trouble comes from the nature of TCP being a system with complex and
distributed state. So what kind of tests can you express using this
setup? Let&#039;s use the earlier example of zero windows. Cases you might
want to test for and which are easy enough to do (and some of which we
really want to test multiple times with different configuration parameters):

&lt;ul&gt;
&lt;li&gt; Receiving a zero window in the SYNACK, with the window getting opened by a separate ACK only once the 3WHS finishes.
&lt;li&gt; Receiving a zero window in the SYNACK, with the window never getting opened.
&lt;li&gt; The window starting at a reasonable value, but shrinking to zero during the connection, then opening up again (both naturally or as a reaction to
zero window probing)
&lt;li&gt; Probes getting sent at the expected timeouts if the zero window condition persists for too long.
&lt;li&gt; Advertising a zero window yourself to one of the endpoints when buffers
  are full. Check that anything sent in excess of the advertised window is
  properly dropped.
&lt;li&gt; Correctly reacting to zero window probes sent by that endpoint.
  (Both the &amp;quot;still zero&amp;quot; and &amp;quot;I have some space now&amp;quot; cases).
&lt;/ul&gt;

&lt;p&gt;
Are these kind of tests interesting or useful? I&#039;d like to think
so. At least that list is a mix of things we got wrong at one time or
another, things we&#039;ve seen others get wrong, and tests done just in
case. Thinking about the test cases also gets you into an adversarial
mode of thought, where it&#039;s easier to see the cases that were left
unhandled.

&lt;p&gt;
Of course this kind of tests has its limits, and couldn&#039;t
possibly detect all failures. As I&#039;ve written
earlier, &lt;a href=&#039;https://www.snellman.net/blog/archive/2014-11-11-tcp-is-harder-than-it-looks.html&#039;&gt;TCP
is harder than it looks&lt;/a&gt; mainly because of the bizarre
interoperability failures. Unit testing can catch algorithm bugs
during development, but will at best act as a regression test for
problems encountered with endpoints that behave in completely unexpected
ways. Nor does this help at all with testing some other parts that are
on the critical traffic path like our custom device drivers.
But you can&#039;t let the perfect be the enemy of the good.

&lt;p&gt;
Even when there are corners of the system that you can&#039;t test,
I&#039;ve still found unit testing and a semi-TDD approach &lt;a href=&#039;#ftnt4&#039;
name=&#039;ftnt_ref4&#039;&gt;[4]&lt;/a&gt; to be hugely valuable in this
problem space, and I&#039;ve found myself leaning on writing the test cases
before code much more heavily than in any other project before. In
fact if we&#039;ve got a bug report and a theory about what could be going
on, the first step is writing a test case to verify or disprove the
theory. It&#039;s just an order of magnitude faster to set up fully
controlled test case with this system than it would be to try to
recreate the hypothetical network conditions required for the bug to
manifest.

&lt;p&gt;
There are some nice side benefits too in addition to the typical gains
from testing. One is that we get IPv6 test coverage for essentially
free. We can run the same tests twice, once with the packet generator
making IPv4 packets and then with it generating IPv6 ones. It mostly
just requires a bit of finesse with the packet pretty-printing /
parsing to account for the different IP address
size.

&lt;h3&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;
Anyway, I&#039;m really happy with this setup for low level network
programming. If it truly is the case that in-kernel TCP stacks are
untestable, maybe that&#039;s just another reason to get networking
out of the OS and into the userspace.

&lt;h3&gt;Footnotes&lt;/h3&gt;

&lt;div class=&#039;footnotes&#039;&gt;
&lt;p&gt;
&lt;a href=&#039;#ftnt_ref1&#039; name=&#039;ftnt1&#039;&gt;[1]&lt;/a&gt; Our
TCP stack is part of a transparent performance
enhancing proxy. It splits every TCP connection in two parts without
terminating the connections. The TCP connection is only taken over by
the proxy after the initial handshake finishes, so both endpoints end
up having a compatible view of the TCP options and sequence numbers
used for the connection. This means that we essentially run a separate
and full TCP stack for both halves of the connection, but e.g. the
amount of data that has been acked on one half affects how much window
space we want to advertise on the other half.

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref2&#039;
name=&#039;ftnt2&#039;&gt;[2]&lt;/a&gt; One file per interface for
the inputs, one file per interface for expected outputs, have the test
driver compare actual outputs to expected
ones.

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref3&#039; name=&#039;ftnt3&#039;&gt;[3]&lt;/a&gt;
Not just guessing, I know it&#039;s basically useless since I later
implemented this model for creating regression tests for issues we
already had example traces for. Theoretically this allowed creating
new test cases with almost zero effort, but the tests were so annoying
to validate and maintain that we only ever made 4 of them. This is odd because in past lives this general form of testing
has been my tool of choice over lovingly handcrafted artisanal unit
tests.

&lt;p&gt;
&lt;a href=&#039;#ftnt_ref4&#039; name=&#039;ftnt4&#039;&gt;[4]&lt;/a&gt;
Semi-TDD, since the diehards
wouldn&#039;t be happy with testing essentially a single static entry point for
klocs and klocs of code.
&lt;/div&gt;

</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Thu, 09 Jul 2015 15:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-07-09-unit-testing-a-tcp-stack/</guid></item><item><title>What&#039;s wrong with pcap filters?</title><link>https://www.snellman.net/blog/archive/2015-05-18-whats-wrong-with-pcap-filters/</link><description>
&lt;h3&gt;Introduction&lt;/h3&gt;

&lt;p&gt;
I recently watched a video of
a &lt;a href=&#039;https://www.youtube.com/watch?v=XHlqIqPvKw8&#039;&gt;great talk on
the early days of pcap&lt;/a&gt; by Steve McCanne. The bit on how
the filtering language was designed - around the 26 minute mark but
you might want to start at 20 minutes if you&#039;re unfamiliar with BPF -
was one of the best stories about creating a new &amp;quot;little
language&amp;quot; I&#039;ve heard.

&lt;p&gt;
But that got me thinking a bit. This language is a tool that I use
daily, that I&#039;m generally happy with, but that also drives me
absolutely crazy sometimes.  This post is an attempt to look at some
classes of problems that the pcap filtering language fails on, why
those deficiencies exist, and why I continue using it even despite the
flaws.

&lt;p&gt;
Just to be clear, libpcap is an amazing piece of software. It was
originally written for one purpose, and it really is my fault that I
end up too often using it for a different one. There&#039;s three very
different use cases that I have for a packet filtering language
(others may have more).

&lt;ul&gt;
&lt;li&gt; Small and simple filters to pick out a specific slice of traffic
  (single protocol, single flow, or single host). I believe it&#039;s fair
  to say that this is what the language was originally designed for.
&lt;li&gt; Potentially complex filters for classifying traffic with real-time
  constraints and with no state, usually when using the filters for
  configuration rather than as an exploratory tool. This is where pcap
  is clumsy even when it generally works.
&lt;li&gt; Offline analysis at higher protocol layers that&#039;d benefit also
  benefit from tracking the high level protocol state between packets.
  You can sometimes coerce pcap to work for this use case, but it&#039;s
  super-awkward. It&#039;s also worth noting that features that are
  beneficial for this use case would not be welcome in the
  others. (Being able to run an arbitrary PCRE regexp on the packet
  payload? Great when doing offline analysis, unacceptable for
  real-time classification).
&lt;/ul&gt;

&lt;p&gt;
I try to do the third case with tools better suited for that, and only
have a couple of complaints (e.g. VLAN support) on on first
case. Mostly the pain comes from the middle case. So as we start the
tour of annoyances, keep in mind that I&#039;ll often complain about a tool
not doing a job it wasn&#039;t meant for.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h3&gt;VLANs&lt;/h3&gt;

&lt;p&gt;
The VLAN support might be the oddest part of pcap filters.

&lt;p&gt;
First, let&#039;s start off with the way filters require the presence of
VLANs to be explicitly specified. Forgetting to do that might be the
most common mistake I&#039;ve seen people make (or done myself). They do a
tcpdump on all traffic on an interface, and get a bunch of
packets. Then they try to specify a filter, and no matter how liberal
the filter is no traffic shows up. The first suggestion to any report
I get of a filter not matching properly is &amp;quot;did you remember to add
a &lt;code&gt;vlan&lt;/code&gt; directive&amp;quot;. Note the contrast to e.g. IPv6
support, where pcap will automatically generate both IPv4 and IPv6
matching code for a filter that just e.g. has &lt;code&gt;tcp&lt;/code&gt;
directive.

&lt;p&gt;
But let&#039;s say that your users have managed to internalize this
requirement, and generally remember to specify exactly the right
number of &lt;code&gt;vlan&lt;/code&gt; directives. The usage then looks pretty
straightforward.

&lt;p&gt;
Match all TCP traffic on VLAN 11.
&lt;pre&gt;vlan 11 and tcp&lt;/pre&gt;

&lt;p&gt;
Match all TCP traffic on any VLAN.
&lt;pre&gt;vlan and tcp&lt;/pre&gt;

&lt;p&gt;
But there&#039;s a dark secret (a well-documented secret, mind you). This
filter will never match any well-formed traffic:

&lt;pre&gt;tcp and vlan&lt;/pre&gt;

&lt;p&gt;
The problem is that the &lt;code&gt;vlan&lt;/code&gt; directive actually affects
the remaining expression, but not the already compiled parts. (It
adjusts the offset of all later memory lookups to account space taken
by the VLAN header). So this filter first requires a packet to be a
non-VLAN tagged TCP packet, and then requires it to be some kind of
VLAN tagged packet. This is a pretty unlikely combination...

&lt;p&gt;
Ok, what about this filter for matching all TCP traffic, whether it&#039;s
VLAN tagged or not:

&lt;pre&gt;(vlan and tcp) or tcp&lt;/pre&gt;

&lt;p&gt;
Again that doesn&#039;t work. From reading the &lt;code&gt;man&lt;/code&gt; page too
generously, one might think that the change in the lookup offset is
scoped just to the parenthesized sub-expression that the
&lt;code&gt;vlan&lt;/code&gt; directive appears in. That&#039;s not the case. Instead
it really affects the rest of the full filter expression. Instead the
source must be rearranged such that all the non-&lt;code&gt;vlan&lt;/code&gt; options
come first:

&lt;pre&gt;tcp or (vlan and tcp)&lt;/pre&gt;

&lt;p&gt;
But wait, it gets worse!

&lt;pre&gt;vlan 11 or vlan 12&lt;/pre&gt;

&lt;p&gt;
This does not match a packet with either a VLAN tag of 11 or 12, as
one might expect. Actually it matches any packet with a tag of 11, or
a double-tagged VLAN packet with an outer tag of anything and an inner
tag of 12. This is because the offset tweaking is purely a compile-time
effct. So the &lt;code&gt;vlan 11&lt;/code&gt; directive will
advance the offset &lt;i&gt;whether it matched or not&lt;/i&gt;.

&lt;p&gt;
One might expect that the right way to do this is something like
&lt;code&gt;vlan 11 or 12&lt;/code&gt;. But that doesn&#039;t even parse despite
being analogous to other pcap filter
constructs. Instead you need to dig out the relevant bits from the
ethernet header manually.

&lt;pre&gt;vlan and (ether[14:2] &amp; 0xfff == 11 or ether[14:2] &amp; 0xfff == 12)&lt;/pre&gt;

&lt;p&gt;
The generated BPF code is actually good since the repetition gets
optimized away, it&#039;s just very awkward to write an expression like
that.

&lt;p&gt;
And finally, the cruelest trick of the pcap VLAN support:

&lt;pre&gt;(not vlan) and tcp&lt;/pre&gt;

&lt;p&gt;
Non-VLAN tagged TCP, right? But as should be obvious at this point,
that&#039;s not how things work in this corner of pcap. Again the compiler
advances the offset to account for the VLAN header even though we&#039;ve
explicitly specified no VLANs. And thus the lookups for
the IP ethertype and TCP IP proto will be from the wrong locations.

&lt;p&gt;
It&#039;s all a bit of a mess. But even if anyone was willing to break
compatibility by fixing this, it&#039;d be tricky to do given the basic
model of handling the VLANs statically at compile time. Instead it&#039;d
need to be done dynamically, either through lots of duplicate code or
by maintaining a dynamic offset that&#039;s added to all subsequent memory
lookups. (And this latter option would presumably make things a lot
harder for the pcap BPF optimizer).

&lt;p&gt;
Even so, the behavior of &lt;code&gt;vlan&lt;/code&gt; is in stark contrast to the
extreme user-friendliness of most of the basic pcap filter constructs,
as well as the design principles outlined in McCanna&#039;s talk.

&lt;h3&gt;No abstraction facilities&lt;/h3&gt;

&lt;p&gt;
At &lt;a href=&#039;https://www.teclo.net&#039;&gt;Teclo&lt;/a&gt; a substantial part of the
configuration for our mobile TCP accelerator consists of pcap
filters. There&#039;s all kinds of decisions that&#039;d be incredibly hard to
cover with a static set of configuration parameter, but can easily be
done with a simple filters. The filters are safe, and might even be
familiar to the network administrators unlike our bespoke
configuration options. What&#039;s not to like?

&lt;p&gt;
The problem is that once a filter-based decision making mechanism
exists, the filters will soon stop being simple. The pcap language
provides basically no tools at all for managing this complexity,
beyond discouraging complex filters by making writing them painful.
In fact the lack of any control structures virtually ensures massive
code duplication and all the maintenance problems that this implies.

&lt;p&gt;
Let&#039;s take for example our &lt;code&gt;optimize-filter&lt;/code&gt;, which can be
used to select at SYN-handling time which TCP flows to optimize and
which to just pass through. One use for such a filter might be to
disable optimization for a few hosts that misbehave, are being tested,
or something along those lines. Those are simple behaviors, and require
only simple filters.

&lt;p&gt;
But needs change, and soon you&#039;ll get cases like wanting to optimize
the traffic half the mobile subscribers, but not optimizing the
traffic of the other half, and doing that split in some deterministic
manner. The goal is to gather statistics for both groups of users, and
hopefully show the benefits of optimization. That&#039;s simple, right? So
let&#039;s see where that kind of an requirement could take us.

&lt;p&gt;
The filter gets only TCP SYNs as input from the application layer, so
we don&#039;t need to check for that again in the filter. Just pull out the last two
bytes of the source IP, mix them together, and check the lowest
bit. If it&#039;s 1, optimize. If it&#039;s 0, don&#039;t optimize:
&lt;pre&gt;
((ip[15] + ip[14]) &amp; 1) == 1
&lt;/pre&gt;

&lt;p&gt;
There&#039;s an obvious problem. This is using the src address of the SYN, but
it&#039;s possible that the mobile device is actually the recipient rather
than sender of the SYN. To fix this, we need to check the subnets
of the endpoints, and choose the bits from either the source or
destination address.

&lt;pre&gt;
((src net 10.0.0.0/16) and ((ip[15] + ip[14]) &amp; 1) == 1) or
((dst net 10.0.0.0/16) and ((ip[19] + ip[18]) &amp; 1) == 1)
&lt;/pre&gt;

&lt;p&gt;
But wait, what if both the sender and recipient are on the mobile
subscriber subnet, and one is eligible for optimization and the other
isn&#039;t? We&#039;ll end up optimizing the traffic regardless of the direction,
which will distort the statistics. Instead we need to make it symmetric,
based on either the sender or the receiver. And we should really be
prepared for the case where we weren&#039;t given accurate info, and
neither the sender or receiver is in the mobile address pool

&lt;pre&gt;
((src net 10.0.0.0/16 and not dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) &amp; 1) == 1) or
((dst net 10.0.0.0/16 and not src net 10.0.0.0/16) and
 ((ip[19] + ip[18]) &amp; 1) == 1) or
((src net 10.0.0.0/16 and dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) &amp; 1) == 1) or
((not src net 10.0.0.0/16 and not dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) &amp; 1) == 1)
&lt;/pre&gt;

&lt;p&gt;
It&#039;s of course unrealistic for a mobile operator to have just one
mobile address pool. There will be a couple of pools for premium
users with public IPs, a private IP pool for the main APN, another
private IP pool for a performance testing APN, and so on. So really
you can&#039;t just be checking for this single subnet.

&lt;pre&gt;
(((src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) &amp; 1) == 1) or
(((dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[19] + ip[18]) &amp; 1) == 1) or
(((src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) &amp; 1) == 1) or
((not (src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) &amp; 1) == 1)
&lt;/pre&gt;

&lt;p&gt; You might be wondering about the ever increasing nesting level of
parentheses. It looks like the kind of Lisp joke that was already
stale in the &#039;70s. But that&#039;s just what these filters end up looking
like once they get complicated enough, since the users don&#039;t have
confidence in getting the precedence right. And I&#039;ll spare you from
the version with IPv6 support.

&lt;p&gt; Once you get into filters like this, the benefit of accessability
and familiarity goes out the window. Anyone with some Unix or
networking background should be able to deal with simple pcap filters,
but you can&#039;t expect most people to do so with a 10 line filter with
parentheses nested 4 levels deep. The only way this would actually
get entered is by someone copy-pasting from our documentation. And even that
wouldn&#039;t necessarily work, since in there&#039;s so much &lt;i&gt;stuff&lt;/i&gt;
in that expression that needs to be customized for the specific
network, and which needs to be updated in a consistent manner.

&lt;p&gt;
But the code generated isn&#039;t as horrible as the source code after
the optimizer has had its way with it. And this
filter could actually be really simple with just some way of
eliminating the redundancy. It&#039;s so close to still being a good tool.

&lt;p&gt;
But maybe that&#039;s a theoretical case. How about only (reliably)
matching packets that have the window scale TCP option set, another
kind of thing you might hypothetically want to affect optimization
settings in this use case? Well, assuming a maximum of 6 TCP options
I think you&#039;d need something like this:

&lt;pre&gt;
tcp and ((tcp[tcpflags] &amp; tcp-syn) != 0) and ((tcp[20] == 3) or ((tcp[20] != 1) and ((tcp[20 + tcp[21]] == 3) or ((tcp[20 + tcp[21]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + 1]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + 1] == 3))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 2]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 2] == 3))))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + 1]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + 1] == 3))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 3]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 3] == 3))))))))) or ((tcp[20 + tcp[21]] == 1) and ((tcp[20 + tcp[21] + 1] == 3) or ((tcp[20 + tcp[21] + 1] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + 1]] == 3))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + 1] == 3))))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1 + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 2]] == 3))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 2] == 3))))))) or ((tcp[20 + tcp[21] + 1] == 1) and ((tcp[20 + tcp[21] + 2] == 3) or ((tcp[20 + tcp[21] + 2] != 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] == 3) or ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] != 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + 1]] == 3))) or ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] == 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + 1] == 3))))) or ((tcp[20 + tcp[21] + 2] == 1) and ((tcp[20 + tcp[21] + 3] == 3) or ((tcp[20 + tcp[21] + 3] != 1) and ((tcp[20 + tcp[21] + 3 + tcp[20 + tcp[21] + 4]] == 3))) or ((tcp[20 + tcp[21] + 3] == 1) and ((tcp[20 + tcp[21] + 4] == 3))))))))))) or ((tcp[20] == 1) and ((tcp[21] == 3) or ((tcp[21] != 1) and ((tcp[21 + tcp[22]] == 3) or ((tcp[21 + tcp[22]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + 1]] == 3))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + 1] == 3))))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1 + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 2]] == 3))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 2] == 3))))))) or ((tcp[21 + tcp[22]] == 1) and ((tcp[21 + tcp[22] + 1] == 3) or ((tcp[21 + tcp[22] + 1] != 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] == 3) or ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] != 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + 1]] == 3))) or ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] == 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + 1] == 3))))) or ((tcp[21 + tcp[22] + 1] == 1) and ((tcp[21 + tcp[22] + 2] == 3) or ((tcp[21 + tcp[22] + 2] != 1) and ((tcp[21 + tcp[22] + 2 + tcp[21 + tcp[22] + 3]] == 3))) or ((tcp[21 + tcp[22] + 2] == 1) and ((tcp[21 + tcp[22] + 3] == 3))))))))) or ((tcp[21] == 1) and ((tcp[22] == 3) or ((tcp[22] != 1) and ((tcp[22 + tcp[23]] == 3) or ((tcp[22 + tcp[23]] != 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] == 3) or ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] != 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + 1]] == 3))) or ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] == 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + 1] == 3))))) or ((tcp[22 + tcp[23]] == 1) and ((tcp[22 + tcp[23] + 1] == 3) or ((tcp[22 + tcp[23] + 1] != 1) and ((tcp[22 + tcp[23] + 1 + tcp[22 + tcp[23] + 2]] == 3))) or ((tcp[22 + tcp[23] + 1] == 1) and ((tcp[22 + tcp[23] + 2] == 3))))))) or ((tcp[22] == 1) and ((tcp[23] == 3) or ((tcp[23] != 1) and ((tcp[23 + tcp[24]] == 3) or ((tcp[23 + tcp[24]] != 1) and ((tcp[23 + tcp[24] + tcp[23 + tcp[24] + 1]] == 3))) or ((tcp[23 + tcp[24]] == 1) and ((tcp[23 + tcp[24] + 1] == 3))))) or ((tcp[23] == 1) and ((tcp[24] == 3) or ((tcp[24] != 1) and ((tcp[24 + tcp[25]] == 3))) or ((tcp[24] == 1) and ((tcp[25] == 3))))))))))))
&lt;/pre&gt;

&lt;p&gt;
Does that 8kB monstrosity even work? Seems to be ok based on light
testing and it all made sense when I wrote it, but I couldn&#039;t say for
sure. Note that it doesn&#039;t implement the EOL TCP option or checking for
overflowing into the payload area of the packet, so a proper solution
would look even worse. This is a task that BPF would be well suited
for, and which it shouldn&#039;t be an unreasonable task for a packet
filter language.  But it is unreasonable in pcap, and so anything
dealing with TCP options must be done using means other than
scripting.

&lt;p&gt;
Due to the limits of the pcap language, filters become an untenable as
a method for configuration surprisingly quickly, and the system
instead needs to trigger actions based on a set of static
configuration items. Unfortunately you&#039;ll never have thought of every
possibility in advance, and the next deployment could be one where
you&#039;re once again wading deep into a too complex filter as a last
resort.

&lt;p&gt;
What kind of language changes could help, assuming the the generally
declarative flavor needs to be maintained?  Just having some kind of
binding construct or alias would help. Consider the first case with
the results of the &lt;code&gt;src net&lt;/code&gt; / &lt;code&gt;dst net&lt;/code&gt;
expressions being executed once, the result bound to an identifier,
and then referred to using that identifier. It would be so much simpler
to read, let alone write.

&lt;p&gt;
An &lt;code&gt;if-else&lt;/code&gt;-expression or &lt;code&gt;case&lt;/code&gt;-expression would
allow prevent having to repeat the negation of a previous expression
to get properly disjoint subexpressions. Some kind of pre-programmable
macro facility (expanding to pcap code, not doing arbitrary
computation) would allow extending the language a bit to cover use
cases like this.

&lt;p&gt;
Changing the language like that seems unrealistic, so people wanting
to need to do these things while still using BPF as the filtering
mechanism need to compile from some other language or write raw
BPF assembler. (And then you&#039;re really not talking of the kind of
configuration that an end user could do, no matter what your definition
of end user is).

&lt;h3&gt;Hostnames&lt;/h3&gt;

&lt;p&gt;
Allowing some kind of external interaction is the bane of all
programming languages that have intentionally limited power. Users
will almost instinctively attempt to use such facilities to do
something they&#039;re not supposed to. I &lt;i&gt;think&lt;/i&gt; the only such
facility in pcap is name resolution, and since the resolution happens
at compile time, it&#039;s hard to see how to abuse it. (But users are a
devious bunch, so don&#039;t quote me on that).

&lt;p&gt;
But it&#039;s not possible to cleanly disable the hostname resolution either.
This is problematic in situations where it&#039;s unacceptable in situations
with real-time constraints on the compilation phase too, not just the
filter execution phase.

&lt;p&gt;
The first time I bumped into this, DNS was sufficiently flaky that
name resolution was taking several seconds. When a config change
including a filter with a hostname was propagated to the traffic
handling processes, they tried compiling the filter, hung while doing
the name lookup, a watchdog timer detected the system was unhealthy,
and everything was restarted.  Ok, not good.

&lt;p&gt;
But we&#039;re all responsible people, right? Let&#039;s just remember to always
use addresses rather than names. That worked for about a month. Then
somebody made a copy-paste error and ended up with some rubbish right
after a &lt;code&gt;host&lt;/code&gt; directive, something along the lines
of &lt;code&gt;host 10.0.0.1and port 80&lt;/code&gt; except much longer. Since the
name lookups happen during parsing, the name resolution was triggered
before the system decided that the expression was malformed
anyway. Time for the watchdog timer again!

&lt;p&gt;
What about doing the filter compilation in a separate thread from the
actual work? The problem here is that at least for my uses it&#039;s very
problematic if configuration changes can take tens of seconds or
minutes to propagate everywhere. It&#039;s even worse if configuration
change requests can queue up, and if the systems in charge of
propagating configuration changes don&#039;t have a good idea of when
a config change will make it through.

&lt;p&gt;
Fine, perhaps the filters need to be validated by an earlier phase in
the configuration pipeline, for example at the moment the
configuration change is submitted (the commit fails if the compilation
fails). Unfortunately if the filters get compiled again after the
validation for any reason, they lookup that worked originally can hang
now. (Or outright fails due to a DNS record having been removed). It&#039;s
an interesting question what kind of a fallback strategy you can use
if a configuration that used to be valid suddenly becomes invalid. But
it&#039;s better to not to be in a situation where you need such a
strategy.

&lt;p&gt;
There are three workarounds that will actually work. Two of them
try to ensure that all name resolution fails immediately, which
ensures that no configuration using a host name will ever be
accepted in the first place. The third tries to make sure that
name resolution can&#039;t happen at a time when it&#039;d cause trouble.

&lt;ul&gt;
&lt;li&gt; Hack out the name resolution support completely from the library,
  or go the extra mile and make it an option.
&lt;li&gt; Make sure all hostname lookups always fails immediately, either
  through some global configuration or maybe by overriding the relevant
  library functions with LD_PRELOAD.
&lt;li&gt; Always reify all filter source to compiled source at in a safe
  process and at a safe time (e.g. when the configuration change to
  use a new filter is made). Then pass the compiled program around,
  instead of the source. Or pass a source / binary pair, if it&#039;s
  important to be able to introspect the system and see what filters
  are in effect. This is actually a pretty decent solution, but requires
  a lot more changes.
&lt;/ul&gt;

&lt;p&gt;
So this is solvable by jumping through relatively minor hoops. But
it does at least break the golden rule of designing little sandboxed
languages, which is that no matter how insignificant an interface
to the external world is, it should be possible for the program that
embeds the language to disable the interface.

&lt;h3&gt;Just filtering, not classification&lt;/h3&gt;

&lt;p&gt;
Anyone dealing with packets will soon not be satisfied with simply
rejecting / accepting packets. Instead they&#039;ll have more than just two
categories, and want to assign a packet to one of them. While the BPF
instruction set can express this kind of operation through the use of
multiple distinct return values, the pcap filtering language can&#039;t.

&lt;p&gt;
A pcap-style solution is to have multiple separate filters, apply them
in order, and pick a category based on the first one to match. And it
is a pretty clean solution. But the annoying part is that by doing
that we&#039;re losing out on one of the main advantages of the language,
which is the optimizer. In situations with multiple filter categories
there&#039;s often a tremendous amount of overlap between the filters, all
of which would get optimized away by the compiler if separate filters
could be merged pre-optimization.

&lt;p&gt;
It makes perfect sense for this to be the case. The language doesn&#039;t
really have a space for imperative constructs like a return directive.
Probably the best you could have is some kind of implicit toplevel
&lt;code&gt;or&lt;/code&gt; for full toplevel expressions, and somehow assigning a
distinct success code to each full expression. But that&#039;s straying
quite far from the current form of the language.

&lt;h3&gt;Why use pcap filters then?&lt;/h3&gt;

&lt;p&gt;
Given the above whining, why am I still writing pcap filters in
pretty much any networking program that I write? Why not find some
other filter language, write my own, or link in a jitted scripting
language and write the packet filters in that?  Well, pcap has a few
big advantages; running in the kernel, safety, ease of use for the
simple cases, and ubiquity.

&lt;p&gt;
If you want to run code in the kernel, it has to be through BPF. And
pcap is the best way I know of to generate good BPF code. (Note: this
might change with the recent introduction of eBPF to the Linux kernel,
which e.g. has an LLVM backend. This could have major effects on
language availability and ease of optimization).

&lt;p&gt;
To me personally the kernel use case is irrelevant, my packet handling
happens in userspace.

&lt;p&gt;
Guaranteed termination, or even a guaranteed upper bound on execution
time, is of course a huge deal for packet filtering applications. It&#039;s
kind of bad if an accidental infinite loop in a packet capture filter
configuration field crashes a router. This probably excludes 99.9% of
non-bespoke languages; it&#039;s very hard to get people to give up their
Turing completeness. It can be hard to even get basic sandboxing
working for some scripting languages, so termination guarantees aren&#039;t
even on the radar. But the safety advantage doesn&#039;t matter if
the comparison is to other custom languages, which would certainly be
designed from the ground up to have the same properties.

&lt;p&gt;
No, for me it&#039;s the combination of the last two that does it. Even
when I&#039;m writing something to work on live network traffic, during the
early part of development it&#039;ll be mainly be tested on already
existing trace files. Which are most likely in pcap format, so
linking in libpcap to read those files is a natural choice. Once the
program evolves a bit and there&#039;s a need for a simple filtering
mechanism, it&#039;s only natural to use what&#039;s already there. Very
soon there&#039;s a critical mass of filters around and there&#039;s no
point in investigating other solutions. Hopefully all the filters
remain simple, and we stay out of trouble.

&lt;p&gt;
And finally, to undermine my own point, the user-space network appliance kit
&lt;a href=&#039;https://github.com/SnabbCo/snabbswitch&#039;&gt;Snabb Switch&lt;/a&gt;
doesn&#039;t have any need to run code in the kernel, uses a distinct
implementation of the pcap language that&#039;s not backed by BPF (and thus
doesn&#039;t benefit from the pcap optimizer), and didn&#039;t just get linked
early on to get some packets fed into.

&lt;p&gt;Despite having a total green-field opportunity, not only did they
choose to use the pcap filter language, but even went through the
trouble of commissioning said from-scratch &lt;a
href=&#039;https://github.com/Igalia/pflua&#039;&gt;luajit-based pcap
implementation&lt;/a&gt; which eliminates my convenience argument from
above. (Andy Wingo&#039;s &lt;a
href=&#039;http://wingolog.org/archives/2014/09/02/high-performance-packet-filtering-with-pflua&#039;&gt;writeup
on pflua&lt;/a&gt; on this is worth reading, very cool stuff). I asked Luke
Gorrie why they went that route, and this is what he had to say:

&lt;blockquote&gt;
&lt;p&gt;
I think the pflua implementation actually strikes a nice balance:

&lt;ul&gt;
&lt;li&gt; Safe for end-users to configure
&lt;li&gt; Relatively general, familiar, efficient
&lt;li&gt; Small dependency (at least if you are using LuaJIT)
&lt;li&gt; Clean and simple code that can evolve in interesting directions (actions, profiling, fancy LuaJIT packet mangling library underneath, language cleanup, etc)
&lt;/ul&gt;

&lt;p&gt;
[...] I didn&#039;t want to carry libpcap as a dependency for too long but neither to say &amp;quot;please come back when we have invented the ultimate new packet filtering language&amp;quot;.
&lt;/blockquote&gt;

&lt;p&gt;
All of which is hard to argue with, and certainly the &lt;a href=&#039;https://github.com/SnabbCo/snabbswitch/pull/444#issuecomment-94007097&#039;&gt;whole-application
performance improvements from pflua&lt;/a&gt; seen in Snabb are pretty
impressive.

&lt;h3&gt;What&#039;s the alternative?&lt;/h3&gt;

&lt;p&gt;
Normally this would be the point where I proclaim the miraculous
discovery of a solution to exactly the problems I&#039;ve been describing
while still keeping the (not inconsiderable) benefits of pcap.
But alas, no. If anyone has recommendations for filter languages
that are as usable as pcap in the small, but that work better when
used in the large, I&#039;m all ears.

</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Mon, 18 May 2015 12:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-05-18-whats-wrong-with-pcap-filters/</guid></item><item><title>Podcast on mobile TCP optimization</title><link>https://www.snellman.net/blog/archive/2015-03-14-podcast-on-mobile-tcp-optimization/</link><description>
&lt;p&gt;
I was recently a guest on &lt;a href=&#039;http://www.ipspace.net/About_Ivan_Pepelnjak&#039;&gt;Ivan Pepelnjak&lt;/a&gt;&#039;s (ipspace.net) Software Gone Wild podcast, talking about TCP acceleration in mobile networks, as well as whining in general about how much radio networks suck ;-) Thanks a lot to Ivan for the opportunity, it was fun!

&lt;p&gt;
You can listen to the
podcast episode &lt;a href=&#039;http://blog.ipspace.net/2015/03/tcp-optimization-with-juho-snellman-on.html&#039;&gt;here&lt;/a&gt;.
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><pubDate>Sat, 14 Mar 2015 22:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-03-14-podcast-on-mobile-tcp-optimization/</guid></item><item><title>How buying a SSL certificate broke my entire email setup</title><link>https://www.snellman.net/blog/archive/2014-12-05-how-buying-a-ssl-certificate-broke-my-email-setup/</link><description>
&lt;h3&gt;Everything looks ok to me, the error must be on your end!&lt;/h3&gt;

&lt;p&gt;
Earlier this week I got a couple of emails from different people, both
telling me that they&#039;d just created a new user account, but hadn&#039;t yet
received the validation email. In both cases my mail server logs
showed that the destination SMTP server had rejected the incoming
message due to a DNS error while trying to resolve the hostname part
of the envelope sender. Since I knew very well that my DNS setup was
working at the time, I was already ready to reply with something along
the lines of &amp;quot;It&#039;s just some kind of a transient error somewhere,
just wait a while and it&#039;ll sort itself out&amp;quot;.

&lt;p&gt;
But I decided to check the outgoing mail queue just in case. It
contained 2700 messages, going around 5 days back, most with error
messages that looked at least a little bit DNS-related.

&lt;p&gt;
Oops.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
Now, 5 days for this server is usually something like 50-60k outgoing
email messages, so those 2700 queued messages represented a pretty
decent chunk of traffic. The mail logs suggested that the errors had
started weeks ago, around November 12th. And indeed while
an &lt;code&gt;A&lt;/code&gt; query for the hostname was working fine, a
&lt;code&gt;MX&lt;/code&gt; query returned no results.

&lt;h3&gt;I didn&#039;t touch anything, it just broke!&lt;/h3&gt;

&lt;p&gt;
I was completely sure that the &lt;code&gt;MX&lt;/code&gt; setup used to work just
fine. And I had not done any kind of DNS changes at all for at least
a year. Any computer setup will rot eventually, but it shouldn&#039;t be
happening quite this fast.

&lt;p&gt;
Wait... Was that date November 12th? That&#039;s when I bought a new SSL
certificate through my registrar, who is also doing my DNS
hosting. Hmm... And I even chose to use DNS-based domain ownership
validation rather than the &#039;email a confirmation code to
hostmaster@example.com&#039; method, and allowed my registrar to
automatically create and delete the temporary authentication record.

&lt;p&gt;
Ok, so technically I did make a change, even if it was just to
authorize another system to make an automated DNS configuration change
on my behalf. But clearly my registrar must have screwed up these
automated config changes, and completely deleted the &lt;code&gt;MX&lt;/code&gt;
record along the way!

&lt;h3&gt;That config looks valid, just apply it!&lt;/h3&gt;

&lt;p&gt;
Well kind of, but not really.

&lt;p&gt;
When I logged into the DNS management UI yesterday, it turned out that
the &lt;code&gt;MX&lt;/code&gt; record was still there, but it was marked with a
validation error complaining about a conflict with the
&lt;code&gt;CNAME&lt;/code&gt;. When I did the original configuration, I&#039;d set up
the relevant host with both
&lt;code&gt;MX&lt;/code&gt; and &lt;code&gt;CNAME&lt;/code&gt; records. That is apparently
not best practice and can cause problems with some mail servers. And
who am I to argue with that, even if it had seemingly worked for
the past year.

&lt;p&gt;
I changed the &lt;code&gt;CNAME&lt;/code&gt; to an appropriate &lt;code&gt;A&lt;/code&gt;
record, the validation error was cleared, and as if by magic my
queries were now working. 5 minutes later the outgoing mail queue was
draining rapidly.

&lt;p&gt;
So clearly my service provider had added some helpful functionality to
prevent bogus DNS setups from being exported. The old zone files would
still be there working properly during the transition period, but the
next time the user would make changes they&#039;d need to fix the setup
before it&#039;d be exported.

&lt;p&gt;
That&#039;s a reasonable design, and I&#039;m sure it would work marvelously
most of the time. But in this case the zone file export was actually
triggered by an automated process, so there was no way for me to
notice that the configuration was now considered erroneous, or to fix
it. The DNS host was serving a mostly functional zone; it was just
missing one record, and even that one record only mattered for a
fraction of my outgoing mail making the problem even harder to
spot. (Just a fraction of my mail, since it looks like most mail
servers either don&#039;t validate the sender domain, or validate it with
some other kind of query).

&lt;p&gt;
There&#039;s a bit of guesswork involved in the last couple of paragraphs.
The error can no longer be replicated, since management UI will no
longer allow me to create a setup analogous to the original. So it&#039;s
hard to be completely certain of the mechanisms on that management
UI&#039;s side of the story. But I&#039;m still fairly confident that this is at
least a pretty close approximation of what happened. The timings of
me buying the certificate and the start of a spike in DNS-related mail
delivery errors match up way too well for any other explanation to be
credible.

&lt;h3&gt;Of &lt;i&gt;course&lt;/i&gt; frobnicating the wibblerizer could break the dingit! Everyone knows that...&lt;/h3&gt;

&lt;p&gt;
There&#039;s all kinds of morals one could draw from this story. Proper
monitoring would have detected this immediately, the registrar should
have accounted for this corner case, I should maybe not have the
default assumption that the other party is at fault when something
breaks, you should always check the results of any kind of automated
config change when it&#039;s done for the first time, and probably many
other excellent lessons in either life or systems engineering.

&lt;p&gt;
But really I&#039;m just telling this story because I find the endpoints in
the chain of causality completely hilarious. In a sensible world my
action A really should not have led to the final result B, but it
did. It&#039;s unfortunate that the title of this blog post ended up
looking like linkbait of the worst kind, when it&#039;s actually the best
10 word summary I could think of :-)
</description><author>jsnell@iki.fi</author><category>GENERAL</category><category>NETWORKING</category><pubDate>Fri, 05 Dec 2014 13:30:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2014-12-05-how-buying-a-ssl-certificate-broke-my-email-setup/</guid></item></channel></rss>