Can't even throw code across the wall - on open sourcing existing code

Posted on 2015-03-19 in General

Starting a new project as open source feels like the simplest thing in the world. You just take the minimally working thing you wrote, slap on a license file, and push the repo to Github. The difficult bit is creating and maintaining a community that ensures long term continuity of the project, especially as some contributors leave and new ones enter. But getting the code out in a way that could be useful to others is easy.

Things are different for existing codebases, in ways that's hard to appreciate if you haven't tried doing it. Code releases that are made with no attempt to create a community around it and which aren't kept constantly in sync with the proprietary version are derided as "throwing the code across the wall". But getting even to that point can be non-trivial.

We've had vague plans almost from the start of eventually releasing some of Teclo's code as open source. Not the commercially vital bits like the core TCP stack, but at least some of the auxiliary programs and libraries. Setting those bits free should be a win all around. The public gets access to something they didn't have before. We get the warm and fuzzy feeling of contributing something back, maybe a tiny bit of good PR, maybe some useful code contributions, and potentially a future recruitment channel.

But it's always been just talk. Something of such low priority ("would be nice to do that some day") gets pushed back by more important things, and never actually gets done. But last week I had two discussions like that with people outside the company, rather than internally. The two discussions were rather different.

The unrealistically easy case

In the first discussion, Luke was looking for a SSE-accelerated IP checksum library to benchmark and maybe use in Snabb Switch. He knew that Teclo had one, and asked if he could get access to it. It was an ideal case: completely self-contained in one file, not providing any real competitive advantage to us, and in the dependency graph of stuff we'd already in decided should be open. Realistically the only reason not to have released that code ages ago was that we didn't know anyone had any use for it; hardware offload of checksum computation / validation is kind of the default these days.

So releasing that code to the wild took about 5 minutes of my time. And within a day the now free code had sprouted a significantly faster AVX2 variant that we could easily merge back (thanks Luke!). Checksum computations aren't a huge bottleneck for us, but it could still reduce our CPU usage by 1% once we upgrade to Haswell based Xeons.

It's about as clean a win-win as you can get. If things were always this easy and productive, we'd have most of our codebase opened up in a jiffy :-)

A more realistic case

Unfortunately things aren't always that easy. The other discussion went roughly like this.

A customer of ours needs to do some load-testing. They know we've got a decent in-house traffic generator with some properties that no other freely available traffic generator had. (Basically support for huge concurrent connection counts combined with generating valid bidirectional TCP flows that react correctly to network events such as lost packets; not just replaying a capture or generating semi-random packets. If you're aware of an already existing open source traffic generator like that, please let me know in the comments).

So they wondered if they could maybe have access to our traffic generator. Unfortunately the program in the current state is not something that's useful to anyone else than us, since it's using a proprietary high-speed raw packet library that's specific to one obscure brand of NICs. We've got a few in the lab for load-testing (stopped using them for production years ago, so they're useless for anything else), but it's unlikely that anyone else does in this day and age. On hearing this, the customer suggested that they could just add DPDK support to the traffic generator themselves. And from there it was a short step to agreeing that it'd be in everyone's best interest for all of this to be released as open source. Again it should be an obvious win-win.

(Of course this might not actually happen, so please don't take this as any kind of a promise. But it's an instructive example anyway.)

The big difference to the previous case is that here the program is no longer anywhere near self-contained. It'd pull in ridiculous amounts of cruft from all over our code base. Some of the cruft is just useless for a standalone open source release, some is a little embarrassing, some of it is totally blocking. (And the blocking code can't just be cut out. You can maybe release a program in a rough state, but at a bare minimum it must compile and do something useful).

First, there's a baroque build system that depends on dozens of makefile fragments and external scripts. Reasonable for a large project with a long history and odd requirements, not something anyone would like to work with for a small standalone tool. Even if we used a less insane tool than make (e.g. CMake), figuring out exactly which build steps we actually need and rewriting a minimal build file would take a while.
Then there's a custom JSON-over-UDP RPC system, for a program that doesn't fundamentally need to do any RPC. But does it anyway, since as a matter of policy all of our programs are RPC servers, and all of them support a few RPC methods by default (e.g. querying counters). Nobody outside the company would actually want to use these facilities. They even couldn't use them, unless we also released some additional tools for actually making RPC requests... The downside of including the RPC server in the code is that it pulls in lots and lots of additional code that's not really relevant for this particular program.
There's also a horrible system for generating C++ classes with automated JSON marshaling / unmarshaling from data structure description files. (Written in sh and the C preprocessor, to give an idea of what a kludge it is; I'm feeling a little bit ill just at the thought of more people seeing that code). Including this code is kind of necessary, since that's how the traffic generator is configured.
And a little mini-language that can describe the link aggregation setup in use for a set of physical network interfaces. Any time we pass interface names e.g. on the command line, they're passed using this language. Because anything we run in production will need to handle link aggregation. But there's not really any compelling reason for the load generator to support EtherChannel (or whatnot). All that code is just extra baggage.
There are various in-house utility libraries that we use all over the place. Timers, realtime clock libraries, select/epoll/etc wrappers, logging, minimal C++ wrappers for C libraries, small utilities made obsolete by C++11, and so on. Most of these are doing jobs that would be better done by mature open source libraries, but where we ended up with something in-house instead for some (usually path-dependent) reason. None of those reasons would actually apply to an open source program living completely outside our main repository.
And then there are hardwired dependencies to third-party libraries just for minor conveniences, like linking in libunwind for backtraces. Critical for anything we run in production, and thus habitually linked in to everything. Somewhere between useless and mostly harmless for a tool like this.
Finally there are a bunch of raw packet IO backends. Some that nobody in the world would actually want to use, and some that we might not even be able to legally distribute in a buildable form due to proprietary third party library dependencies.

And that was just what I could think of off the top of my head; there's probably stuff that I've forgotten about completely. As soon as you get outside of the "one self-contained file or directory" level of complexity, the threshold for releasing code becomes much higher. And likewise every change to a program that was made in order to open source it will make it less likely that the two versions can really be kept in sync in the long term.

In this case the core code is maybe 2k-3k lines and won't require much work. It's all the support infrastructure that's going to be an issue. Getting it to just a minimally releasable state is likely to be at least a few days of work, and that "minimum" is going to be rough. That's not huge amounts of work, and still probably worth it. But significant enough that it needs to be properly scheduled and prioritized.

What should have been done differently?

At first it seems that this trouble is a sign of some kind of software engineering failure. Why do we have code around that's embarrassing, why not fix it? Why does the transitive dependency closure of the program include code that can't possibly get executed? Why is the build-system monolithic and a bit cumbersome?

But after giving this some thought, I'm less sure. Except for a couple of unnecessary instances of NIH, it's not clear to me that there was any problem with the development process as such.

For example one could imagine maintaining a strict separation between projects / components. "This is our RPC library, it lives in this repository, it has its own simple build system, it has its own independent existence of everything else, its own release process and schedule, and could be trivially open sourced". The top level build could be as simple as pulling in the latest version of each component, building them separately, and bundling them up somehow. Some people would argue that such forced modularization is a good thing. But I take it almost as axiomatic that a single repository for all code is a major benefit that you should not give up unless you really have to. (E.g. being able to atomically update both an interface and all users of an interface is just so nice).

Or all of that old and crufty code? Well, the reality is that a lot of the old code has been completely rewritten over the years. That's code that actually was causing trouble. What survives might be ugly, but it also works. No matter how good the test coverage, rewriting or even refactoring a lot of that stuff would be a totally unnecessary and unjustifiable risk.

And it's obvious that when writing a tool mainly for internal use, you should use the normal internal conventions and convenience libraries. Writing crippled bespoke versions of those things just to make a possible future open source release easier would be insane.

Even if making real changes to the development process isn't feasible, there's one thing that could be improved with little effort. Simply keep an eye out for situations where small, useful and mostly self-contained bits of code could be opened up. They're going to be easiest to release, and are well positioned to create a useful feedback cycle. If the bottleneck really is finding the time to do the work, this should give pretty good bang for the buck.

Any obstacles to doing this are more likely to by psychological than technical or political. "Of course nobody else in the world could possibly want to do such a special purpose task". "No point in releasing something that trivial". "Just how anticlimactic is it to release one file of code and a makefile, a lame near-empty repository?". Obviously these kinds of fears are nonsense.

So I'm going to make a conscious effort to find these opportunities, and actually take advantage of them. As an extra incentive I'm talking about it publicly in this post, so that you can mock me in a year if I still just talked the talk :)

Name
Message
	As an antispam measure, you need to write a super-secret password below. Today's password is "xyzzy" (without the quotes).
Password