Last year in an online discussion someone used in-kernel TCP stacks as a canonical example of code that you can't apply modern testing practices to. Now, that might be true but if so the operative phrase there is "in-kernel", not "TCP stack". When the TCP implementation is just a normal user-space application, there's no particular reason it can't be written in a way that's testable and amenable to a test driven development approach.

The first versions of Teclo's TCP stack were written as a classic monolithic systems application with lots of explicit and implicit global state, not as something that could be treated as a library let alone something you could reasonably mock out parts of. As such it was totally unsuited for any kind of automated testing. The best you could do was run the system in various kinds of simulated network environments and check that it was getting roughly the same speeds from one release to the next. We've also got over 50 configuration parameters for tweaking the behavior of the TCP algorithms, which would make for a hell of a test matrix. Repeated manual testing of all these parameters would probably require a person to do nothing but run those tests full time.

This was clearly not tenable in the long term, so getting some kind of deterministic and automatable tests up was a pretty high priority. So very soon after we were finished with the rush to ship a first version, we could refactor things a bit for better testability and had at least some rudimentary tests up.

How we write tests

What would make a TCP implementation particularly tricky to test?

Our TCP flow record has over 70 state variables and 10 timers (some of which can interact with each other). And we need two of those records for a single TCP connection, with the state of one flow potentially affecting the behavior of the other [1]. With this much interlinked state it is hard to feel confident about any testing that tries to artificially set up only the relevant variables. Even if such setup is done correctly right now, it would be very easy for those assumptions to break as the code changes, invalidating the tests.

In general the appropriate unit of testing here is then the TCP stack as a whole rather than e.g. somehow trying to test a feature like zero window probing in isolation just by calling a method that implements that feature. The latter would be an absurd idea, since interesting TCP features end up being a lot more cross cutting than that. Of course I don't mean testing the application as a whole either. We chop the application off at the the core event loop, which would normally handle polling the NICs for packets, update the system's idea of the current time between packets, run timers when appropriate, and occasionally receive RPC messages from the management system. All of this detail is luckily irrelevant for testing the core TCP algorithms.

Instead for testing we create an instance of the TCP stack that replaces the normal NIC-based IO backend with a callback based one. The test driver will inject packets directly to the TCP stack by calling the appropriate entry point. When the TCP stack wants to emit a packet, that triggers a callback in the test driver and we can check whether the contents of the packet are what were expected. Finally, the other entry point to the TCP stack is implicit, through timer callbacks. To handle this case, we need to replace the wall clock based time source with a virtual one, and give the test driver the responsibility for triggering it.

The second problem is expressing the test cases in a convenient manner. The first thought here was expressing the test cases as pcap format trace files [2]. The trace files would theoretically have exactly right information: the exact packet contents and microsecond accurate timing information for the test driver to work by. This approach turns out to not to be good for much. Artificial test cases are very hard to create and update, debugging test failures is painful, and testing for things other than packets that are output impossible (e.g. counters) [3]. No, the test cases really need to be expressed in code.

For that to work, there needs to be a simple way of describing packets both for the purposes of generating packets as well as comparing output packets to expectations. Now, we happened to have some code around for pretty-printing many kinds of packets as JSON. That sounds like a perfect tool for the job, we'd just need a bit of code to do the reverse operation. This is what the JSON looked like:

{
    "ether":{
        "source":"31:32:33:34:35:36",
        "dest":"41:42:43:44:45:46",
        "type":"ip"
    },
    "ip":{
        "version":4,"hlen":5,"tos.dscp":0,"tos.ecn":0,"len":1040,"id":0,
        "rf":0,"df":0,"mf":0,"offs":0,"ttl":255,"proto":"tcp","check":0,
        "src":"170.170.170.187","dst":"8.0.0.8"
    },
    "tcp":{
        "source":80,"dest":58999,"window":48000,"check":0,
        "seq":1001000,"ack_seq":1100,"urg_ptr":0,
        "cwr":0,"ece":0,"urg":0,"ack":1,"psh":0,"rst":0,"syn":0,"fin":0,
        "options": {},
        "data":"..."
    }
}

Right... That just won't work. First, it's way too verbose since even a simple test will involve tens of packets. You could eliminate some of the verbosity in the packet generation by using lots of defaulting. But that doesn't really work for checking outputs against expects. This kind of 1:1 mapping to raw fields in the packet is also not what's really needed. Things like advertised windows and sequence ranges are pretty much the things I care most about when specifying a test case. But the actual advertised window can't be determined from a single packet, you need to know the window scaling factor that's in use, which is only available in the SYN. Likewise the starting sequence number of a segment is in the TCP header, but the ending sequence number is implicit.

What we need is something designed for humans to read. The obvious choice here was to pattern it after the tcpdump output format since we read packet dumps in that format every day.

41:42:43:44:45:46 > 31:32:33:34:35:36, 8.0.0.8.58999 > 170.170.170.187.80: Flags [S], seq 999, win 32000 [mss 1460, sackOK, wscale 7, ts_val 123, ts_ecr 0]

That's still a bit chubby, but the way this works in practice is that we'd make a PacketGenerator object that has defaults for the fields that are generally going to be constant over the lifetime of the connection (but just generally, if they need to change, no problem):

// Set up a generator with defaults
PacketGenerator from_server(// MAC addresses
                            "123456", "ABCDEF",
                            // IP addresses
                            0xaaaaaabb, 0x8000008,
                            // Port numbers
                            80, 58999,
                            // Window scaling, traffic direction
                            4, true);

#define INJECT_FROM_SERVER(str) 
  tcp_test.inject(from_server.generate(str))

// Generate and inject a packet
INJECT_FROM_SERVER("[A], seq 1001701:1003101, ack 161, win 192000"));

What about expects then? In the normal mode of operation we simply push the string representation of expected packets to a per-interface queue. When the TCP stack tries to emit a packet, it ends up instead in a callback in the test driver. The callback pretty-prints the packet, and compares it to the first string in the queue for the output interface. If the strings don't match, or if the queue is empty, the test fails (and thanks to having readable string representations it's generally completely obvious where in the test the problem was, and what the difference between expected and actual results). To eliminate the repetition in the pretty-printed representation, we also typically have a small macro that fills in the layer 2 / layer 3 information.

#define EXPECT(M, E) tcp_test.expect(M, E, __FILE__, __LINE__)
#define EXPECT_TO_CLIENT(str) EXPECT(true, "31:32:33:34:35:36 > 41:42:43:44:45:46, 170.170.170.187.80 > 8.0.0.8.58999: Flags " str);

// Assert that the next packet to be output toward the client should look
// like this.
EXPECT_TO_CLIENT("[A], seq 1001701:1003101, ack 161, win 48000");

// N.B. packet generation is stateful when it comes to window scaling,
// pretty-printing is not. So this 48k corresponds to the unscaled 192k
// from the input.

The last bit of basic functionality is manipulating time. This is very simple, as long as it's easy to substitute some kind of a virtual clock for a real clock. Just move the clock forward by the requested amount, run any timers, and check that the expect queues are empty. The only tricky bit here is advancing the clock in the minimum timer quant rather than all at once. This matters since a timer getting run might cause a timer (either the same or different one) to be (re)scheduled.

void TcpTest::step(int ms, const char* file, int line) {
    int usec = TimerSet::USEC_PER_TICK;
    for (int i = 0; i < (ms * 1000) / usec; ++i) {
        time_source_->step_ms(usec);
        timers()->run_expired();
    }
    assert_all_expects_satified(file, line);
}

There are a few cases where the textual representation is insufficient. For example maybe some header field that needs to be changed is too obscure to bother including in the parser and the prettyprinter. For cases like this the packet structure returned by generate() can be modified before being injected. Likewise there's another version of expect that takes a callback function for doing arbitrary checks on the packet, rather than just a string comparison.

Finally, it turns out that when testing edge cases of TCP behavior it's often very convenient to run a bunch of alternate scenarios starting from some particular socket state. A normal solution here might be to make the state cloneable, but that's something we actively don't want to do in the normal application, and maintaining the copying code would be fragile and an unnecessary hassle. Instead for testing we have little BEGIN_FORK and END_FORK macros to run a block of code in a forked process and quitting the parent process if the child processes errors out, with the alternate scenarios each running in their own forked process. It's not an ideal setup, since forking makes the experience of using tools like gdb or valgrind a bit rough.

This also makes for pretty large tests. A typical test (containing several subtests through the fork hack) is around 100-150 lines long. Unsurprisingly the tests end up a lot longer than the code being tested. Code coverage of the relevant files is at about 93% which is good enough (most of the code that isn't covered is probably never executed in production; it's old experimental features hidden behind flags not enabled by default, paranoid error checking code for situations that would be very hard or impossible to write a test to trigger, etc).

What we can and can't test

One objection that was implied to testing TCP implementations is that you only really test completely trivial things, and most of the trouble comes from the nature of TCP being a system with complex and distributed state. So what kind of tests can you express using this setup? Let's use the earlier example of zero windows. Cases you might want to test for and which are easy enough to do (and some of which we really want to test multiple times with different configuration parameters):

  • Receiving a zero window in the SYNACK, with the window getting opened by a separate ACK only once the 3WHS finishes.
  • Receiving a zero window in the SYNACK, with the window never getting opened.
  • The window starting at a reasonable value, but shrinking to zero during the connection, then opening up again (both naturally or as a reaction to zero window probing)
  • Probes getting sent at the expected timeouts if the zero window condition persists for too long.
  • Advertising a zero window yourself to one of the endpoints when buffers are full. Check that anything sent in excess of the advertised window is properly dropped.
  • Correctly reacting to zero window probes sent by that endpoint. (Both the "still zero" and "I have some space now" cases).

Are these kind of tests interesting or useful? I'd like to think so. At least that list is a mix of things we got wrong at one time or another, things we've seen others get wrong, and tests done just in case. Thinking about the test cases also gets you into an adversarial mode of thought, where it's easier to see the cases that were left unhandled.

Of course this kind of tests has its limits, and couldn't possibly detect all failures. As I've written earlier, TCP is harder than it looks mainly because of the bizarre interoperability failures. Unit testing can catch algorithm bugs during development, but will at best act as a regression test for problems encountered with endpoints that behave in completely unexpected ways. Nor does this help at all with testing some other parts that are on the critical traffic path like our custom device drivers. But you can't let the perfect be the enemy of the good.

Even when there are corners of the system that you can't test, I've still found unit testing and a semi-TDD approach [4] to be hugely valuable in this problem space, and I've found myself leaning on writing the test cases before code much more heavily than in any other project before. In fact if we've got a bug report and a theory about what could be going on, the first step is writing a test case to verify or disprove the theory. It's just an order of magnitude faster to set up fully controlled test case with this system than it would be to try to recreate the hypothetical network conditions required for the bug to manifest.

There are some nice side benefits too in addition to the typical gains from testing. One is that we get IPv6 test coverage for essentially free. We can run the same tests twice, once with the packet generator making IPv4 packets and then with it generating IPv6 ones. It mostly just requires a bit of finesse with the packet pretty-printing / parsing to account for the different IP address size.

Conclusion

Anyway, I'm really happy with this setup for low level network programming. If it truly is the case that in-kernel TCP stacks are untestable, maybe that's just another reason to get networking out of the OS and into the userspace.

Footnotes

[1] Our TCP stack is part of a transparent performance enhancing proxy. It splits every TCP connection in two parts without terminating the connections. The TCP connection is only taken over by the proxy after the initial handshake finishes, so both endpoints end up having a compatible view of the TCP options and sequence numbers used for the connection. This means that we essentially run a separate and full TCP stack for both halves of the connection, but e.g. the amount of data that has been acked on one half affects how much window space we want to advertise on the other half.

[2] One file per interface for the inputs, one file per interface for expected outputs, have the test driver compare actual outputs to expected ones.

[3] Not just guessing, I know it's basically useless since I later implemented this model for creating regression tests for issues we already had example traces for. Theoretically this allowed creating new test cases with almost zero effort, but the tests were so annoying to validate and maintain that we only ever made 4 of them. This is odd because in past lives this general form of testing has been my tool of choice over lovingly handcrafted artisanal unit tests.

[4] Semi-TDD, since the diehards wouldn't be happy with testing essentially a single static entry point for klocs and klocs of code.