Introduction

I recently watched a video of a great talk on the early days of pcap by Steve McCanne. The bit on how the filtering language was designed - around the 26 minute mark but you might want to start at 20 minutes if you're unfamiliar with BPF - was one of the best stories about creating a new "little language" I've heard.

But that got me thinking a bit. This language is a tool that I use daily, that I'm generally happy with, but that also drives me absolutely crazy sometimes. This post is an attempt to look at some classes of problems that the pcap filtering language fails on, why those deficiencies exist, and why I continue using it even despite the flaws.

Just to be clear, libpcap is an amazing piece of software. It was originally written for one purpose, and it really is my fault that I end up too often using it for a different one. There's three very different use cases that I have for a packet filtering language (others may have more).

  • Small and simple filters to pick out a specific slice of traffic (single protocol, single flow, or single host). I believe it's fair to say that this is what the language was originally designed for.
  • Potentially complex filters for classifying traffic with real-time constraints and with no state, usually when using the filters for configuration rather than as an exploratory tool. This is where pcap is clumsy even when it generally works.
  • Offline analysis at higher protocol layers that'd benefit also benefit from tracking the high level protocol state between packets. You can sometimes coerce pcap to work for this use case, but it's super-awkward. It's also worth noting that features that are beneficial for this use case would not be welcome in the others. (Being able to run an arbitrary PCRE regexp on the packet payload? Great when doing offline analysis, unacceptable for real-time classification).

I try to do the third case with tools better suited for that, and only have a couple of complaints (e.g. VLAN support) on on first case. Mostly the pain comes from the middle case. So as we start the tour of annoyances, keep in mind that I'll often complain about a tool not doing a job it wasn't meant for.

VLANs

The VLAN support might be the oddest part of pcap filters.

First, let's start off with the way filters require the presence of VLANs to be explicitly specified. Forgetting to do that might be the most common mistake I've seen people make (or done myself). They do a tcpdump on all traffic on an interface, and get a bunch of packets. Then they try to specify a filter, and no matter how liberal the filter is no traffic shows up. The first suggestion to any report I get of a filter not matching properly is "did you remember to add a vlan directive". Note the contrast to e.g. IPv6 support, where pcap will automatically generate both IPv4 and IPv6 matching code for a filter that just e.g. has tcp directive.

But let's say that your users have managed to internalize this requirement, and generally remember to specify exactly the right number of vlan directives. The usage then looks pretty straightforward.

Match all TCP traffic on VLAN 11.

vlan 11 and tcp

Match all TCP traffic on any VLAN.

vlan and tcp

But there's a dark secret (a well-documented secret, mind you). This filter will never match any well-formed traffic:

tcp and vlan

The problem is that the vlan directive actually affects the remaining expression, but not the already compiled parts. (It adjusts the offset of all later memory lookups to account space taken by the VLAN header). So this filter first requires a packet to be a non-VLAN tagged TCP packet, and then requires it to be some kind of VLAN tagged packet. This is a pretty unlikely combination...

Ok, what about this filter for matching all TCP traffic, whether it's VLAN tagged or not:

(vlan and tcp) or tcp

Again that doesn't work. From reading the man page too generously, one might think that the change in the lookup offset is scoped just to the parenthesized sub-expression that the vlan directive appears in. That's not the case. Instead it really affects the rest of the full filter expression. Instead the source must be rearranged such that all the non-vlan options come first:

tcp or (vlan and tcp)

But wait, it gets worse!

vlan 11 or vlan 12

This does not match a packet with either a VLAN tag of 11 or 12, as one might expect. Actually it matches any packet with a tag of 11, or a double-tagged VLAN packet with an outer tag of anything and an inner tag of 12. This is because the offset tweaking is purely a compile-time effct. So the vlan 11 directive will advance the offset whether it matched or not.

One might expect that the right way to do this is something like vlan 11 or 12. But that doesn't even parse despite being analogous to other pcap filter constructs. Instead you need to dig out the relevant bits from the ethernet header manually.

vlan and (ether[14:2] & 0xfff == 11 or ether[14:2] & 0xfff == 12)

The generated BPF code is actually good since the repetition gets optimized away, it's just very awkward to write an expression like that.

And finally, the cruelest trick of the pcap VLAN support:

(not vlan) and tcp

Non-VLAN tagged TCP, right? But as should be obvious at this point, that's not how things work in this corner of pcap. Again the compiler advances the offset to account for the VLAN header even though we've explicitly specified no VLANs. And thus the lookups for the IP ethertype and TCP IP proto will be from the wrong locations.

It's all a bit of a mess. But even if anyone was willing to break compatibility by fixing this, it'd be tricky to do given the basic model of handling the VLANs statically at compile time. Instead it'd need to be done dynamically, either through lots of duplicate code or by maintaining a dynamic offset that's added to all subsequent memory lookups. (And this latter option would presumably make things a lot harder for the pcap BPF optimizer).

Even so, the behavior of vlan is in stark contrast to the extreme user-friendliness of most of the basic pcap filter constructs, as well as the design principles outlined in McCanna's talk.

No abstraction facilities

At Teclo a substantial part of the configuration for our mobile TCP accelerator consists of pcap filters. There's all kinds of decisions that'd be incredibly hard to cover with a static set of configuration parameter, but can easily be done with a simple filters. The filters are safe, and might even be familiar to the network administrators unlike our bespoke configuration options. What's not to like?

The problem is that once a filter-based decision making mechanism exists, the filters will soon stop being simple. The pcap language provides basically no tools at all for managing this complexity, beyond discouraging complex filters by making writing them painful. In fact the lack of any control structures virtually ensures massive code duplication and all the maintenance problems that this implies.

Let's take for example our optimize-filter, which can be used to select at SYN-handling time which TCP flows to optimize and which to just pass through. One use for such a filter might be to disable optimization for a few hosts that misbehave, are being tested, or something along those lines. Those are simple behaviors, and require only simple filters.

But needs change, and soon you'll get cases like wanting to optimize the traffic half the mobile subscribers, but not optimizing the traffic of the other half, and doing that split in some deterministic manner. The goal is to gather statistics for both groups of users, and hopefully show the benefits of optimization. That's simple, right? So let's see where that kind of an requirement could take us.

The filter gets only TCP SYNs as input from the application layer, so we don't need to check for that again in the filter. Just pull out the last two bytes of the source IP, mix them together, and check the lowest bit. If it's 1, optimize. If it's 0, don't optimize:

((ip[15] + ip[14]) & 1) == 1

There's an obvious problem. This is using the src address of the SYN, but it's possible that the mobile device is actually the recipient rather than sender of the SYN. To fix this, we need to check the subnets of the endpoints, and choose the bits from either the source or destination address.

((src net 10.0.0.0/16) and ((ip[15] + ip[14]) & 1) == 1) or
((dst net 10.0.0.0/16) and ((ip[19] + ip[18]) & 1) == 1)

But wait, what if both the sender and recipient are on the mobile subscriber subnet, and one is eligible for optimization and the other isn't? We'll end up optimizing the traffic regardless of the direction, which will distort the statistics. Instead we need to make it symmetric, based on either the sender or the receiver. And we should really be prepared for the case where we weren't given accurate info, and neither the sender or receiver is in the mobile address pool

((src net 10.0.0.0/16 and not dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) & 1) == 1) or
((dst net 10.0.0.0/16 and not src net 10.0.0.0/16) and
 ((ip[19] + ip[18]) & 1) == 1) or
((src net 10.0.0.0/16 and dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) & 1) == 1) or
((not src net 10.0.0.0/16 and not dst net 10.0.0.0/16) and
 ((ip[15] + ip[14]) & 1) == 1)

It's of course unrealistic for a mobile operator to have just one mobile address pool. There will be a couple of pools for premium users with public IPs, a private IP pool for the main APN, another private IP pool for a performance testing APN, and so on. So really you can't just be checking for this single subnet.

(((src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) & 1) == 1) or
(((dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[19] + ip[18]) & 1) == 1) or
(((src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) & 1) == 1) or
((not (src net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8) and
 not (dst net 10.0.0.0/16 or 178.63.66.0/24 or 173.194.40.0/8)) and
 ((ip[15] + ip[14]) & 1) == 1)

You might be wondering about the ever increasing nesting level of parentheses. It looks like the kind of Lisp joke that was already stale in the '70s. But that's just what these filters end up looking like once they get complicated enough, since the users don't have confidence in getting the precedence right. And I'll spare you from the version with IPv6 support.

Once you get into filters like this, the benefit of accessability and familiarity goes out the window. Anyone with some Unix or networking background should be able to deal with simple pcap filters, but you can't expect most people to do so with a 10 line filter with parentheses nested 4 levels deep. The only way this would actually get entered is by someone copy-pasting from our documentation. And even that wouldn't necessarily work, since in there's so much stuff in that expression that needs to be customized for the specific network, and which needs to be updated in a consistent manner.

But the code generated isn't as horrible as the source code after the optimizer has had its way with it. And this filter could actually be really simple with just some way of eliminating the redundancy. It's so close to still being a good tool.

But maybe that's a theoretical case. How about only (reliably) matching packets that have the window scale TCP option set, another kind of thing you might hypothetically want to affect optimization settings in this use case? Well, assuming a maximum of 6 TCP options I think you'd need something like this:

tcp and ((tcp[tcpflags] & tcp-syn) != 0) and ((tcp[20] == 3) or ((tcp[20] != 1) and ((tcp[20 + tcp[21]] == 3) or ((tcp[20 + tcp[21]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + 1]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] + 1] == 3))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 2]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 1] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] + 2] == 3))))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + 1]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2]] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] + 1] == 3))))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 1] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] == 3) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] != 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2 + tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 3]] == 3))) or ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 2] == 1) and ((tcp[20 + tcp[21] + tcp[20 + tcp[21] + 1] + 3] == 3))))))))) or ((tcp[20 + tcp[21]] == 1) and ((tcp[20 + tcp[21] + 1] == 3) or ((tcp[20 + tcp[21] + 1] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + 1]] == 3))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1]] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] + 1] == 3))))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2]] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] == 3) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] != 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1 + tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 2]] == 3))) or ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 1] == 1) and ((tcp[20 + tcp[21] + 1 + tcp[20 + tcp[21] + 2] + 2] == 3))))))) or ((tcp[20 + tcp[21] + 1] == 1) and ((tcp[20 + tcp[21] + 2] == 3) or ((tcp[20 + tcp[21] + 2] != 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] == 3) or ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] != 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + 1]] == 3))) or ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3]] == 1) and ((tcp[20 + tcp[21] + 2 + tcp[20 + tcp[21] + 3] + 1] == 3))))) or ((tcp[20 + tcp[21] + 2] == 1) and ((tcp[20 + tcp[21] + 3] == 3) or ((tcp[20 + tcp[21] + 3] != 1) and ((tcp[20 + tcp[21] + 3 + tcp[20 + tcp[21] + 4]] == 3))) or ((tcp[20 + tcp[21] + 3] == 1) and ((tcp[20 + tcp[21] + 4] == 3))))))))))) or ((tcp[20] == 1) and ((tcp[21] == 3) or ((tcp[21] != 1) and ((tcp[21 + tcp[22]] == 3) or ((tcp[21 + tcp[22]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + 1]] == 3))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1]] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] + 1] == 3))))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1]] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] == 3) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] != 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1 + tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 2]] == 3))) or ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 1] == 1) and ((tcp[21 + tcp[22] + tcp[21 + tcp[22] + 1] + 2] == 3))))))) or ((tcp[21 + tcp[22]] == 1) and ((tcp[21 + tcp[22] + 1] == 3) or ((tcp[21 + tcp[22] + 1] != 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] == 3) or ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] != 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + 1]] == 3))) or ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2]] == 1) and ((tcp[21 + tcp[22] + 1 + tcp[21 + tcp[22] + 2] + 1] == 3))))) or ((tcp[21 + tcp[22] + 1] == 1) and ((tcp[21 + tcp[22] + 2] == 3) or ((tcp[21 + tcp[22] + 2] != 1) and ((tcp[21 + tcp[22] + 2 + tcp[21 + tcp[22] + 3]] == 3))) or ((tcp[21 + tcp[22] + 2] == 1) and ((tcp[21 + tcp[22] + 3] == 3))))))))) or ((tcp[21] == 1) and ((tcp[22] == 3) or ((tcp[22] != 1) and ((tcp[22 + tcp[23]] == 3) or ((tcp[22 + tcp[23]] != 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] == 3) or ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] != 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + 1]] == 3))) or ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1]] == 1) and ((tcp[22 + tcp[23] + tcp[22 + tcp[23] + 1] + 1] == 3))))) or ((tcp[22 + tcp[23]] == 1) and ((tcp[22 + tcp[23] + 1] == 3) or ((tcp[22 + tcp[23] + 1] != 1) and ((tcp[22 + tcp[23] + 1 + tcp[22 + tcp[23] + 2]] == 3))) or ((tcp[22 + tcp[23] + 1] == 1) and ((tcp[22 + tcp[23] + 2] == 3))))))) or ((tcp[22] == 1) and ((tcp[23] == 3) or ((tcp[23] != 1) and ((tcp[23 + tcp[24]] == 3) or ((tcp[23 + tcp[24]] != 1) and ((tcp[23 + tcp[24] + tcp[23 + tcp[24] + 1]] == 3))) or ((tcp[23 + tcp[24]] == 1) and ((tcp[23 + tcp[24] + 1] == 3))))) or ((tcp[23] == 1) and ((tcp[24] == 3) or ((tcp[24] != 1) and ((tcp[24 + tcp[25]] == 3))) or ((tcp[24] == 1) and ((tcp[25] == 3))))))))))))

Does that 8kB monstrosity even work? Seems to be ok based on light testing and it all made sense when I wrote it, but I couldn't say for sure. Note that it doesn't implement the EOL TCP option or checking for overflowing into the payload area of the packet, so a proper solution would look even worse. This is a task that BPF would be well suited for, and which it shouldn't be an unreasonable task for a packet filter language. But it is unreasonable in pcap, and so anything dealing with TCP options must be done using means other than scripting.

Due to the limits of the pcap language, filters become an untenable as a method for configuration surprisingly quickly, and the system instead needs to trigger actions based on a set of static configuration items. Unfortunately you'll never have thought of every possibility in advance, and the next deployment could be one where you're once again wading deep into a too complex filter as a last resort.

What kind of language changes could help, assuming the the generally declarative flavor needs to be maintained? Just having some kind of binding construct or alias would help. Consider the first case with the results of the src net / dst net expressions being executed once, the result bound to an identifier, and then referred to using that identifier. It would be so much simpler to read, let alone write.

An if-else-expression or case-expression would allow prevent having to repeat the negation of a previous expression to get properly disjoint subexpressions. Some kind of pre-programmable macro facility (expanding to pcap code, not doing arbitrary computation) would allow extending the language a bit to cover use cases like this.

Changing the language like that seems unrealistic, so people wanting to need to do these things while still using BPF as the filtering mechanism need to compile from some other language or write raw BPF assembler. (And then you're really not talking of the kind of configuration that an end user could do, no matter what your definition of end user is).

Hostnames

Allowing some kind of external interaction is the bane of all programming languages that have intentionally limited power. Users will almost instinctively attempt to use such facilities to do something they're not supposed to. I think the only such facility in pcap is name resolution, and since the resolution happens at compile time, it's hard to see how to abuse it. (But users are a devious bunch, so don't quote me on that).

But it's not possible to cleanly disable the hostname resolution either. This is problematic in situations where it's unacceptable in situations with real-time constraints on the compilation phase too, not just the filter execution phase.

The first time I bumped into this, DNS was sufficiently flaky that name resolution was taking several seconds. When a config change including a filter with a hostname was propagated to the traffic handling processes, they tried compiling the filter, hung while doing the name lookup, a watchdog timer detected the system was unhealthy, and everything was restarted. Ok, not good.

But we're all responsible people, right? Let's just remember to always use addresses rather than names. That worked for about a month. Then somebody made a copy-paste error and ended up with some rubbish right after a host directive, something along the lines of host 10.0.0.1and port 80 except much longer. Since the name lookups happen during parsing, the name resolution was triggered before the system decided that the expression was malformed anyway. Time for the watchdog timer again!

What about doing the filter compilation in a separate thread from the actual work? The problem here is that at least for my uses it's very problematic if configuration changes can take tens of seconds or minutes to propagate everywhere. It's even worse if configuration change requests can queue up, and if the systems in charge of propagating configuration changes don't have a good idea of when a config change will make it through.

Fine, perhaps the filters need to be validated by an earlier phase in the configuration pipeline, for example at the moment the configuration change is submitted (the commit fails if the compilation fails). Unfortunately if the filters get compiled again after the validation for any reason, they lookup that worked originally can hang now. (Or outright fails due to a DNS record having been removed). It's an interesting question what kind of a fallback strategy you can use if a configuration that used to be valid suddenly becomes invalid. But it's better to not to be in a situation where you need such a strategy.

There are three workarounds that will actually work. Two of them try to ensure that all name resolution fails immediately, which ensures that no configuration using a host name will ever be accepted in the first place. The third tries to make sure that name resolution can't happen at a time when it'd cause trouble.

  • Hack out the name resolution support completely from the library, or go the extra mile and make it an option.
  • Make sure all hostname lookups always fails immediately, either through some global configuration or maybe by overriding the relevant library functions with LD_PRELOAD.
  • Always reify all filter source to compiled source at in a safe process and at a safe time (e.g. when the configuration change to use a new filter is made). Then pass the compiled program around, instead of the source. Or pass a source / binary pair, if it's important to be able to introspect the system and see what filters are in effect. This is actually a pretty decent solution, but requires a lot more changes.

So this is solvable by jumping through relatively minor hoops. But it does at least break the golden rule of designing little sandboxed languages, which is that no matter how insignificant an interface to the external world is, it should be possible for the program that embeds the language to disable the interface.

Just filtering, not classification

Anyone dealing with packets will soon not be satisfied with simply rejecting / accepting packets. Instead they'll have more than just two categories, and want to assign a packet to one of them. While the BPF instruction set can express this kind of operation through the use of multiple distinct return values, the pcap filtering language can't.

A pcap-style solution is to have multiple separate filters, apply them in order, and pick a category based on the first one to match. And it is a pretty clean solution. But the annoying part is that by doing that we're losing out on one of the main advantages of the language, which is the optimizer. In situations with multiple filter categories there's often a tremendous amount of overlap between the filters, all of which would get optimized away by the compiler if separate filters could be merged pre-optimization.

It makes perfect sense for this to be the case. The language doesn't really have a space for imperative constructs like a return directive. Probably the best you could have is some kind of implicit toplevel or for full toplevel expressions, and somehow assigning a distinct success code to each full expression. But that's straying quite far from the current form of the language.

Why use pcap filters then?

Given the above whining, why am I still writing pcap filters in pretty much any networking program that I write? Why not find some other filter language, write my own, or link in a jitted scripting language and write the packet filters in that? Well, pcap has a few big advantages; running in the kernel, safety, ease of use for the simple cases, and ubiquity.

If you want to run code in the kernel, it has to be through BPF. And pcap is the best way I know of to generate good BPF code. (Note: this might change with the recent introduction of eBPF to the Linux kernel, which e.g. has an LLVM backend. This could have major effects on language availability and ease of optimization).

To me personally the kernel use case is irrelevant, my packet handling happens in userspace.

Guaranteed termination, or even a guaranteed upper bound on execution time, is of course a huge deal for packet filtering applications. It's kind of bad if an accidental infinite loop in a packet capture filter configuration field crashes a router. This probably excludes 99.9% of non-bespoke languages; it's very hard to get people to give up their Turing completeness. It can be hard to even get basic sandboxing working for some scripting languages, so termination guarantees aren't even on the radar. But the safety advantage doesn't matter if the comparison is to other custom languages, which would certainly be designed from the ground up to have the same properties.

No, for me it's the combination of the last two that does it. Even when I'm writing something to work on live network traffic, during the early part of development it'll be mainly be tested on already existing trace files. Which are most likely in pcap format, so linking in libpcap to read those files is a natural choice. Once the program evolves a bit and there's a need for a simple filtering mechanism, it's only natural to use what's already there. Very soon there's a critical mass of filters around and there's no point in investigating other solutions. Hopefully all the filters remain simple, and we stay out of trouble.

And finally, to undermine my own point, the user-space network appliance kit Snabb Switch doesn't have any need to run code in the kernel, uses a distinct implementation of the pcap language that's not backed by BPF (and thus doesn't benefit from the pcap optimizer), and didn't just get linked early on to get some packets fed into.

Despite having a total green-field opportunity, not only did they choose to use the pcap filter language, but even went through the trouble of commissioning said from-scratch luajit-based pcap implementation which eliminates my convenience argument from above. (Andy Wingo's writeup on pflua on this is worth reading, very cool stuff). I asked Luke Gorrie why they went that route, and this is what he had to say:

I think the pflua implementation actually strikes a nice balance:

  • Safe for end-users to configure
  • Relatively general, familiar, efficient
  • Small dependency (at least if you are using LuaJIT)
  • Clean and simple code that can evolve in interesting directions (actions, profiling, fancy LuaJIT packet mangling library underneath, language cleanup, etc)

[...] I didn't want to carry libpcap as a dependency for too long but neither to say "please come back when we have invented the ultimate new packet filtering language".

All of which is hard to argue with, and certainly the whole-application performance improvements from pflua seen in Snabb are pretty impressive.

What's the alternative?

Normally this would be the point where I proclaim the miraculous discovery of a solution to exactly the problems I've been describing while still keeping the (not inconsiderable) benefits of pcap. But alas, no. If anyone has recommendations for filter languages that are as usable as pcap in the small, but that work better when used in the large, I'm all ears.