Juho Snellman's Weblog

Web Environment Integrity vs. Private Access Tokens - They're the same thing!

jsnell@iki.fi — Tue, 25 Jul 2023 18:30:00 GMT

I've seen a lot of discussions in the last week about the Web Environment Integrity proposal. Quite predictably from the moment it got called things like "DRM for the web", people have been arguing passionately against it on HN, Github issues, etc. The basic claims seem to be that it's going to turn the web into a walled garden, kill ad blockers, kill all small browsers, kill all small operating systems, kill accessibility tools like screen readers, etc.

The Web Environment Integrity proposal is basically:

A website can request an attestation from the browser
The browser forwards the attestation requests to an attester
The attester checks properties like hardware and software integrity
If they check out, the attester creates a token and signs it with its private key.
The attester hands off the signed token to the browser, which in turn sends it to the website.
The website checks that the token was signed by a trusted attester

Here's a funny thing I suspect few of those commenters know: A very similar mechanism already exists on the web, and is already deployed in production browsers (Safari), operating systems (iOS, OS X), and hosting infrastructure (Cloudflare, Fastly). That mechanism is Private Access Tokens / Privacy Pass.

Here's what PATs (as deployed by Apple, and on by default) do to the best of my understanding:

A website can request an attestation from the browser
The browser forwards the attestation requests to an attester
The attester checks properties like hardware and software integrity.
If they check out, the attester calls the website's trusted token issuer
The issuer checks whether to trust the attester and whether the information passed by the attester is sufficient, and then issues a token signed by its private key
The attester hands off the signed token to the browser, which passes it to the website.
The website checks that the token was signed by a trusted token issuer

This launching was hailed in the tech press as a win for privacy and security, not as an attempt to kill accessibility tools or build a walled garden. [1]

You might notice that the basic operating model of the two protocols is almost exactly the same. So is their intended use. From the "DRM for websites" perspective, I don't think there is a difference.

With both WEI and PATs, the website would be able to ask Apple to verify that the request is coming from a genuine non-jailbroken iPhone running Safari, and block the ones running Firefox on Linux. And in both, the intent is not for the API to be used for that kind of outright blocking.

Neither lists e.g. checking whether the browser is running an ad blocker extension as a use case. Both would have just the same technical capabilities for making that kind of thing happen, by just having the attester check for it, and I bet that in both cases the attester would be equally unmotivated in actually providing that kind of attestation.

It's also not that PATs would somehow make it easier for people to spin up new attesters for small or new platforms. Want to run your own attester for PATs? You could, but the issuers you care about will not trust it. [2]

Now, the technologies aren't quite identical, but the distinctions are subtle and would just matter for exactly the kind of anti-abuse work that both of the proposals were ostensibly meant for. The big one is the WEI proposal including the ability to content-bind the attestation to a specific operation. It's a feature anyone trying to use a feature like this for abuse prevention would think is needed, but that adds no power to the theorized "DRM for the web" use case. There is also a more obvious difference between the two, with whether the attester and issuer are the same entity or split seems. But that too is irrelevant in the discussion on how the technology could be misused. [3]

In principle there could also be differences in the exact things that the APIs allow attesting for. But neither standard really defines the exact set of attestations, just the mechanisms.

Given the DRM narrative would have worked exactly the same for the two projects, why such a different reception? I can only think of two differences, both social rather than technical.

One is that the PAT (and related Privacy Pass) draft standards were written in the IETF and are dense standardese. There was no plaintext explainer. Effectively nobody outside of the internet standardization circles read those drafts, and if they had they wouldn't have known whether they needed to be outraged or not. The first time it actually broke through to the public was when Apple implemented it.

The other is the framing. PATs were sold to the public exclusively as a way of seeing fewer captchas. Who wouldn't want fewer captchas? WEI was pitched as a bunch of fairly abstract use cases and mostly from the perspective of the service provider, not for how it'd improve the user experience by reducing the need for invasive challenges and data collection.

This isn't the first time I've seen two attempts at a really similar project, with one getting lauded while the other gets trashed for something that's common to both. But it is the one where the two things are the most similar, and it feels like it should be instructive somehow.

If the takeaway is that standards proposals should be opaque and kept away from the public for as long as possible, before being launched straight to prod based on a draft spec, that'd be bad. If it's that standard proposals should be carefully written to highlight the benefit for the end user, even starting from the first draft, that's probably pretty good? And if it's that only Apple can launch any browser features without a massive backlash, it seems pretty damn bad.

[1] Just to be clear, the one significant HN discussion on PATs had similar arguments about it being DRM, so my claim is not that absolutely everyone loved PATs. But it didn't actually get traction as a hacker cause celebre, and as far as I can see the general media coverage was broadly positive.

[2] What's the process for getting Cloudflare or Fastly to trust a non-Apple attester anyway? I can't find any documentation.

[3] The split version seems kind of superior for deployment, since it means each site needs to only care about a single key (their chosen issuer). This makes e.g. the creation of a new attester a lot more tractable. You only need to convince half a dozen issuers to trust your new attester and ingest the keys, not try to sign up every single website in the world one by one.

A monorepo misconception - atomic cross-project commits

jsnell@iki.fi — Wed, 21 Jul 2021 11:00:00 GMT

In articles and discussions about monorepos, there's one frequently alleged key benefit: atomic commits across the whole tree let you make changes to both a library's implementation and the clients in a single commit. Many authors even go as far to claim that this is the only benefit of monorepos.

I like monorepos, but that particular claim makes no sense! It's not how you'd actually make backwards incompatible changes, such as interface refactorings, in a large monorepo. Instead the process would be highly incremental, and more like the following:

Push one commit to change the library, such that it supports both the old and new behavior with different interfaces.
Once you're sure the commit from stage 1 won't be reverted, push N commits to switch each of the N clients to use the new interface.
Once you're sure the commits from stage 2 won't be reverted, push one commit to remove the old implementation and interface from the library.

There's a bunch of reasons why this is a nicer sequencing than a single atomic commit, but they're mostly variations on the theme: mitigating risks. If something breaks, you want as few things as possible to break at once, and for the rollback to a known-good state to be simple. Here's how the risks are mitigated at the various stages in the process:

There is nothing risky at all about the first commit. It is just adding new code that's not yet used by anyone.
The commits for changing the clients can be done gradually, starting with the ones that the library owners are themselves working on, the projects that are most likely to detect bugs, or the clients that are most forgiving to errors. Depending on the risk profile of the change, you might even use these commits as a form of staged rollout, where you'll wait to see if the previous clients report any problems in production before sending the next batch of commits for code review.
The final commit to remove the old implementation can only break a minimal number of clients: the ones that just started using the library between the removal commit being reviewed and pushed, and did so using the old interface. The ideal environment would have tooling in place to prevent that kind of backslipping from happening in the first place (e.g. lint warnings on new uses of deprecated interfaces).

If anything goes wrong in stage 2, it's trivial to revert a commit that's only touching a couple of files. By contrast, reverting a commit that's spanning hundreds of projects would be quite painful, especially if the repo has any kind of per-directory ACLs (which I think is mandatory for a big monorepo). It gets worse if the breakage isn't detected immediately, since the more code that the single change is affecting, the less likely it's that the reversion applies cleanly.

If anything goes wrong in stage 3, it would also have gone wrong when using atomic commits. But with atomic commits the breakage in stage 3 is far more likely, since the new users will naturally use the old interface (the new one doesn't exist yet in their view of the world), and since the window between start of code review and committing will be wider. And again, the rollback will be far easier with the commit that's only touching the library and not the clients.

There's some additional reasons for why the huge commit will be annoying. For example getting a clean presubmit CI run will become progressively harder the more projects a single commits is changing.

Sure, the atomic commit will save a little bit of work in not needing to have the implementation support both interfaces at once. But that tiny saving is just not a worthwhile tradeoff when compared to how much work wrangling the huge commit would be.

It's particularly easy to see that the "atomic changes across the whole repo"story is rubbish when you move away from libraries, and also consider code that has any kind of more complicated deployment lifecycle, for example the interactions between services and client binaries that communicate over an RPC interface. Obviously you can't do an atomic change in that case, since you need to continue supporting the old server implementation until all client binaries have been upgraded (and are rollback-safe). The same goes for changes to database schemas, command line tools, synchronized client-side Javascript + backend changes, etc.

I think it's true that monorepos make refactoring easier. So that's not the problem. It's also true that they have atomic commits across projects. But the two facts have nothing to do with each other. The reasons monorepos make refactoring simpler all boil down to everyone in the organization having a shared view of what the current state is:

A monorepo will, in practice, mean trunk-based development. You'll know that everybody really is on HEAD rather than actually doing their development on some year-old branch.
And conversely, you'll know that every user of the library is using your library from HEAD rather than pinning it to some year-old version.
It's trivial to find all the current callers, so that you know which clients need to be updated. (Once you've solved the highly non-trivial problem of having any kind of monorepo tooling at scale, of course.)

In theory you could do the exact same thing with multirepos assuming sufficient tool support, discipline about code organization, enforced trunk-based development in all repositories, a master list of all repositories in the org, and defaulting to all repositories being readable by every engineer with no hidden silos. That's all technically doable, but I suspect not culturally compatible with using multirepos in the first place.

Where does this misconception come from? It's certainly present in the Google monorepo paper, which somewhat contradicts itself on this. On one hand, they describe exactly this form of atomic refactoring as a benefit of monorepos:

The ability to make atomic changes is also a very powerful feature of the monolithic model. A developer can make a major change touching hundreds or thousands of files across the repository in a single consistent operation. For instance, a developer can rename a class or function in a single commit and yet not break any builds or tests.

But when it comes to the actual refactoring workflow is, the process that's described is quite different:

A team of Google developers will occasionally undertake a set of wide-reaching code-cleanup changes to further maintain the health of the codebase. The developers who perform these changes commonly separate them into two phases. With this approach, a large backward-compatible change is made first. Once it is complete, a second smaller change can be made to remove the original pattern that is no longer referenced.

I suspect what happened here was that the atomic commits were identified as a benefit in the abstract, with refactoring being used as an illustration of a use case. This was then quite understandably read as a practical example of how you'd work with a monorepo.

There might be a few cases where atomic commits across the whole repository are the right solution, but it has to be exceedingly rare. The example of renaming a function with thousands of callers, for example, is probably better handled by just temporarily aliasing the function, or by temporarily defining the new function in terms of the old. (But this does suggest that languages, both programming languages and IDLs, should make aliasing and indirection easy for as many constructs as possible).

Are there organizations with a large monorepo where atomic cross-project commits are routinely used to change both the implementation and the clients?

Writing a procedural puzzle generator

jsnell@iki.fi — Tue, 14 May 2019 15:00:00 GMT

This blog post describes the level generator for my puzzle game Linjat. The post is standalone, but might be a bit easier to digest if you play through a few levels. The source code is available; anything discussed below is in src/main.cc.

A rough outline of this post:

Linjat is a logic game of covering all the numbers and dots on a grid with lines.
The puzzles are procedurally generated by a combination of a solver, a generator, and an optimizer.
The solver tries to solve puzzles the way a human would, and assign a score for how interesting a given puzzle is.
The puzzle generator is designed such that it's easy to change one part of the puzzle (the numbers) and have other parts of the puzzle (the dots) get re-organized such that the puzzle remains solvable.
A puzzle optimizer repeatedly solves levels and generates new variations from the most interesting ones that have been found so far.

The rules

To understand how the level generator works, you unfortunately must first know the rules of the game. Luckily the rules are very simple. The puzzle consists of a grid containing empty squares, numbers, and dots. Like this:

The goal is to draw a vertical or horizontal line through each of the numbers, with three constraints:

The line going through a number must be of the same length as the number.
The lines can't cross.
All the dots need to be covered by a line.

Like this:

Whee! The game is all designed, the UI is implemented, now all I need are a few hundred good puzzles, and we're good to go. And for a game like this, there's really no point in trying to make those puzzles by hand. That's a job for a computer.

Requirements

What makes for a good puzzle in this game? I tend to think of puzzle games as coming in two categories. There's the ones where you're exploring a complicated state space from the start to the end (something like Sokoban or Rush Hour), and where it might not even obvious exactly what states exist in the game. Then there are ones where all the states are known at the start, and you're slowly whittling the state space down by process of elimination (e.g. Sudoku or Picross). This game is clearly in the latter category.

Now, players have very different expectations for these two different kinds of puzzles. For this latter kind there's a very strong expectation that the puzzle is solvable just with deduction, and that there should never be a need for backtracking / guessing / trial and error. [0] [1]

It's not enough to know if a puzzle can be solved with just logic. In addition to that we need to have some idea of how good the produced puzzles are. Otherwise most of the levels might be just trivial dross. In an ideal world this could also be used for building a smooth progression curve, where the levels get progressively harder as the player progresses through the game.

The solver

The first step to meeting the above requirements is a solver for the game that's optimized for this purpose. A backtracking brute-force solver will be fast and accurate at telling whether the puzzle is solvable, and could also be changed to determine whether the solution is unique. But it can't give any idea of how challenging the puzzle actually is, since that's not how a human would solve these puzzles. The solver needs to imitate humans.

How does a human solve this puzzle? There's a couple of obvious moves, which the tutorial teaches:

If a dot can only be reached from one number, the line from that number should be extended to cover the dot. Here the dot can only be reached from the three, not the four:

Leading to:
If the line doesn't fit in one orientation, it must be placed in the other orientation instead. In the above example the 4 can no longer be placed vertically, so we know it has to be horizontal. Like this:
If a line of size X is known to be in a certain orientation and there isn't enough space to fit a line of X spaces on both sides, some of the squares in the middle must be covered. For example if in the above example the "4" had been a "3" instead, we wouldn't know whether it extended all the way to the right or to the left of the board. But we would know it must cover the two middle squares:

This kind of thinking is the meat and potatoes of the game. You figure a way to extend one line a little bit, make that move, and then inspect the board again since that hopefully gave you the information to make a new deduction elsewhere. Writing a solver that follows these rules would be enough to determine if a human could solve the puzzle without backtracking.

It doesn't really say anything about how hard or interesting the level is though. In addition to the solvability, we need to somehow quantify the difficulty.

The obvious first idea for a scoring function is that a puzzle that takes more moves to finish is the harder one. That's probably a good metric in other games, but in this one the number of valid moves that the player has at any one time is probably more important. If there are 10 possible deductions a player could make, they'll find one of those very quickly. If there's only one valid move, it'll take longer.

So as a first approximation you want the solution tree to be deep and narrow: there's a long dependency chain of moves from start to finish, and at any one time there are only a few ways of moving forward on the chain. [2]

How do you figure out the width and depth of the tree? Just solving the puzzle once and evaluating the produced tree doesn't give a precise answer. The exact order in which you make the moves will end up affecting the shape of the tree. You'd need to look at all the possible solutions, and do something like optimize for the best worst-case. Now, I'm no stranger to brute-forcing puzzle game search graphs, but for this project I wanted a single-pass solver rather than any kind of exhaustive search. Due to the opimization phase, the goal was for the solver runtime to be measured in microseconds rather than seconds.

I decided not to do that. Instead my solver doesn't actually make one move at a time, but solves the puzzle by layers: given a state, find all valid moves that could be made. Then apply all of those moves at once. Finally start over from the new state. The number of layers and the maximum number of moves ever found in a single layer are then used as proxies for the depth and the width of the search tree as a whole.

Here's what the solution for one of the harder puzzles looks like with this model (click on the thumb-nail to expand). Dotted lines are the lines that were extended on that solver layer, solid ones didn't change. Green lines are of the right length, red are not yet complete.

The next problem is that not all moves a player makes are created equal. What was listed at the start of this section is really just common sense. Here's an example of a more complicated deduction rule, which would require some more thought to find. Consider a board like:

The dots at C and D can only be covered by the 5 and the middle 4 (and neither piece can cover both of them at the same time). This means that the middle 4 needs to cover one of the two, and thus can't be used to cover A. Instead A has to be covered with the lower left 4.

It'd clearly be silly to treat this chain of deductions the same as a one-step "this dot can only be reached from that number". Can these more complex rules just be weighted more heavily in the scoring function? Unfortunately not with the layer-based solver, since it's not guaranteed to find lowest cost solution. It's not just a theoretical concern, in practice it's pretty common for a part of the board to be solvable in either a single complex deduction or a chain of several much simpler moves. The layer-based solver basically finds the shortest path, not the cheapest one, and that can't just be fixed in the scoring function.

The method I ended up using was to change the solver such that each layer consists of only one kind of deduction. The algorithm goes through the deduction rules in a rough order of difficulty. If a rule finds any moves, they're applied and the iteration is over, and the next iteration starts the list over from the beginning.

The solution is then scored by assigning each layer a cost based on the single rule used for it. This is still not guaranteed to find the cheapest solution, but with a good selection of weights it'll at least not find an expensive solution if a cheap solution exists.

It also seems to map out pretty well to how humans solve the puzzle. You look for the gimmes first, and only start thinking hard once there are no easy moves.

The Generator

The previous section took care of figuring out if a level is any good or not. But that alone isn't enough, you also need to somehow generate levels for the solver to score. It's quite unlikely that a randomly generated level would be solvable, let alone interesting.

The key idea (which is by no means novel) is to interleave the solver and the generator. Let's start with a puzzle that's probably unsolvable, consisting just of numbers 2-5 placed in random locations on the grid:

The solver runs until it can't make any more progress:

The generator then adds some information to the puzzle, in the form of a dot, and continues solving.

In this case that one added is not enough to allow the solver to make any progress. So the generator will keep on adding more dots until the solver is happy:

And then the solver resumes normal operation:

This process continues either until the puzzle is solved or there is no more information to add (i.e. every space that's reachable from a number is covered by a dot).

This method works only if the new information that's being added can't invalidate any of the previously made deductions. That would be tough to do when adding numbers to the grid [3]. But adding new dots to the board has that property, at least given the deduction rules I'm using in this program.

Where shoud the algorithm add the dots? What I ended up doing was to add them in the empty space that could have been covered by the most lines in the starting state, so each dot tends to give as little information as possible. There is no attempt to add it specifically to a location where it'll be useful in advancing the puzzle at the point where the solver got stuck. This produces a pretty neat effect where most of the dots will be totally useless at the start of the puzzle, which makes the puzzle seem harder than it is. There are all these apparent moves you could make, but somehow none of them quite work out. The puzzle generator ends up being a bit of a jerk.

This process will not always produce a solution, but it's pretty fast (on the order of 50-100 microseconds) so it can just be repeated a bunch of times until it generates a level. Unfortunately it'll generally produce a mediocre puzzle. There are too many obvious moves right at the start, the board gets filled in very quickly and the solution tree is quite shallow.

The optimizer

The above process produced a mediocre puzzle. In the final stage, we use that level as a seed for an optimization process. The process works as follows.

The optimizer sets up a pool of up to 10 puzzle variants. The pool is initialized with the newly generated random puzzle. On each iteration, the optimizer selects one puzzle from the pool and mutates it.

The mutation removes all the dots, and then changes the numbers a bit (e.g. reduce/increase the value of a randomly selected number, or move a number to a different location on the grid). It might be possible to apply multiple mutations to board in one go. We then run the solver in the special level-generation mode described in the previous section. This adds enough dots to the puzzle to make it solvable again.

After that, we run the solver again, this time in the normal mode. During this run, the solver keeps track of a) the depth of the solution tree, b) how often each of the various kinds of rules was needed, c) how wide the solution tree was at times. The puzzle is scored based on the above criteria. The scoring function will basically prefer deep and narrow solutions, and at higher difficulty levels also rewards puzzles that require use of one or more of the advanced deduction rules.

The new puzzle is then added to the pool. If the pool ever contains more than 10 puzzles, the worst one is discarded.

This process is repeated a number of times (anything from 10k to 50k iterations seemed to be fine). After that, the version of the puzzle with the highest score is saved into the puzzle's level database. This is what the progress of the best puzzle looks like through one optimization run:

I tried a few other ways of structuring the optimization as well. One version used simulated annealing, the others were genetic algorithms with different crossover operations. None of these performed as well as the naive pool of hill-climbers.

Unique single solution

There's an interesting complication that arises when the puzzle has a single unique solution. Is it valid for the player to assume that's the case, and make deductions based on that? Is it fair for the puzzle generator to assume that the player will do so?

In a post on HN, I mentioned four options for how to deal with this:

State the "only a single solution" up front, and make the puzzle generator generate levels that require this form of deduction. This sucks, since it'll make the rules far more complicated to understand. And it's also exactly the kind of detail people would forget.
Don't guarantee a single solution: have potentially multiple solutions, and accept any of them. This doesn't really solve the problem, it just moves it around.
Punt, and just assume this is a very rare event that won't matter in practice. (This is was the original implementation.)
Change the puzzle generator such that it doesn't generate puzzles where the knowing the solution is unique helps. (Probably the right thing to do, but also extra work.)

I originally went with the last option, and that was a horrible mistake. It turns out that I'd only considered one way in which the uniqueness of the solution leaks information, and that's indeed pretty rare. But there's others, and one was present in basically every level I'd generated, and often kind of trivialized the solution. So in May 2019 I updated the Hard and Expert mode levels to go with the third option instead.

The most annoying case is the 2 with the dotted line in the following board:

Why could a sneaky player make that deduction? The 2 can cover any of the 4 adjacent squares. None of them have any dots, so they don't necessarily need to be covered by anything. And the square that's downwards doesn't have any overlap with other pieces. If there's a single solution, it has to be the case that other pieces cover the other three squares, and the 2 covers the downwards square.

The solution is to add some dots when these cases are detected, like this:

Another common case was the dotted 2 on this board:

Nothing distinguishes the squares to the left and up of the 2. Neither has a dot, and neither is reachable from any other number. Any solution where the 2 covers the upward square would have a matching solution where it covers the leftware square instead, and vice versa. If there's a single unique solution, it can't be either and thus the 2 must cover the downward square instead.

This kind of case I just solved by the "if it hurts, just don't do it" method. I.e. having the solver use this rule very early on in the priority list, and assigning these moves a large negative weight. Puzzles with this kind of property will mostly end up discarded by the optimizer, and the few that make it through will be discarded when doing the final level selection for the published game.

This is not an exhaustive list, I found a lot of other unique-solution rules when adversarially play-testing. But most of them felt like they were rare and difficult enough to find that they're not really shortcuts. If somebody solves a puzzle using that kind of deduction, I'm not going to begrudge them that.

Conclusion

The game was originally designed as an experiment for procedural puzzle generation. The game design and the generator go hand in hand, so the exact techniques won't be directly applicable to existing games.

The part I can't answer is whether putting this much effort into the procedural generation was worth it. The feedback from players has been pretty inconsistent when it comes to the level design. A common theme for positive comments has been about how the puzzles always feel like there's a clever gotcha in there. The most common negative complaint has been that there's not enough of a difficulty gradient in the game.

I have a couple of other puzzle games in an embryonic stage, and felt good enough about this generator that I'd probably at least try similar procedural generation methods for those too. One thing I'd definitely do differently the next time around is to do adversarial playtesting from the start.

Footnotes

[0] Or at least that's what I believed. But when I observed a bunch of players in person, about half of them just made guesses and then iterated on those guesses. Oh, well.

[1] Anyone reading this should also read Solving Minesweeper and making it better by Magnus Hoff, which has a fascinating twist on the perceived need for puzzle games with hidden information to be guaranteed solvable.

[2] Just to be clear, this depth / narrowness of the tree is a metric that I thought was meaningful to this puzzle, not something that's going to be applicable to all or even most puzzles. For example there's a good argument to be made that a Rush Hour puzzle is interesting if there are multiple paths paths to the solution of almost but not quite the same length. But that's because Rush Hour is a game of finding the shortest solution, not just some solution.

[3] With the exception if adding 1s. The first version of the puzzle didn't have the dots, and the plan was to have the generator add 1s when it needed to add more information. But that felt a little too constrained.

Optimizing a breadth-first search

jsnell@iki.fi — Mon, 23 Jul 2018 16:00:00 GMT

A couple of months ago I finally had to admit I wasn't smart enough to solve a few of the levels in Snakebird, a puzzle game. The only way to salvage some pride was to write a solver, and pretend that writing a program to do the solving is basically as good as having solved the problem myself. The C++ code for the resulting program is on Github. Most of what's discussed in the post is implemented in search.h and compress.h. This post deals mainly with optimizing a breadth-first search that's estimated to use 50-100GB of memory to run on a memory budget of 4GB.

There will be a follow up post that deals with the specifics of the game. For this post, all you need to know is that that I could not see any good alternatives to the brute force approach, since none of the usual tricks worked. There are a lot of states since there are multiple movable or pushable objects, and the shape of some of them matters and changes during the game. There were no viable conservative heuristics for algorithms like A* to narrow down the search space. The search graph was directed and implicit, so searching both forward and backward simultaneously was not possible. And a single move could cause the state to change in a lot of unrelated ways, so nothing like Zobrist hashing was going to be viable.

A back of the envelope calculation suggested that the biggest puzzle was going to have on the order of 10 billion states after eliminating all symmetries. Even after packing the state representation as tightly as possible, the state size was on the order of 8-10 bytes depending on the puzzle. 100GB of memory would be trivial at work, but this was my home machine with 16GB of RAM. And since Chrome needs 12GB of that, my actual memory budget was more like 4GB. Anything in excess of that would have to go to disk (the spinning rust kind).

How do we fit 100GB of data into 4GB of RAM? Either a) the states would need to be compressed to 1/20th of their original already optimized size, b) the algorithm would need to be able to efficiently page state to disk and back, c) a combination of the above, or d) I should buy more RAM or rent a big VM for a few days. Option D was out of the question due to being boring. Options A and C seemed out of the question after a proof of concept with gzip: a 50MB blob of states compressed to about 35MB. That's about 7 bytes per state, while my budget was more like 0.4 bytes per state. So option B it was, even though a breadth-first search looks pretty hostile to secondary storage.

This is a somewhat long post, so here's a brief overview of the sections ahead:

A textbook BFS - What's the normal formulation of breadth-first search like, and why is it not suitable for storing parts of the state on disk?
A sort + merge BFS - Changing the algorithm to efficiently do deduplications in batches.
Compression - Reducing the memory use by 100x with a combination of off-the-shelf and custom compression.
Oh no, I've cheated! - The first few sections glossed over something; it's not enough to know there is a solution, we need to know what the solution is. In this section the basic algorithm is updated to carry around enough data to reconstruct a solution from the final state.
Sort + merge with multiple outputs - Keeping more state totally negates the compression gains. The sort + merge algorithm needs to be updated to keep two outputs: one that compresses well used during the search, and another that's just used to reconstruct the solution after one is found.
Swapping - Swapping on Linux sucks even more than I thought.
Compressing new states before merging - So far the memory optimizations have just been concerned with the visited set. But it turns out that the list of newly generated states is much larger than one might think. This section shows a scheme for representing the new states more efficiently.
Saving space on the parent states - Investigate some CPU/memory tradeoffs for reconstructing the solution at the end.
What didn't or might not work - Some things that looked promising but I ended up reverting, and others that research suggested would work but my intuition said wouldn't for this case.

A textbook BFS

So what does a breadth-first search look like, and why would it be disk-unfriendly? Before this little project I'd only ever seen variants of the textbook formulation, something like this:

def bfs(graph, start, end):
    visited = {start}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return True
        for kid in adjacent(node):
            if kid not in visited:
                visited.add(kid)
                todo.push_back(kid)
    return False

As the program produces new candidate nodes, each node is checked against a hash table of already visited nodes. If it's already present in the hash table, we ignore the node. Otherwise it's added both to the queue and the hash table. Sometimes the 'visited' information is carried in the nodes rather than in a side-table; but that's a dodgy optimization to start with, and totally impossible when the graph is implicit rather than explicit.

Why is a hash table problematic? Because hash tables will tend to have a totally random memory access pattern. If they don't, it's a bad hash function and the hash table will probably perform terribly due to collisions. This random access pattern can cause performance issues even when the data fits in memory: an access to a huge hash table is pretty likely to cause both a cache and TLB miss. But if a significant chunk of the data is actually on disk rather than in memory? It'd be disastrous: something on the order of 10ms per lookup.

With 10G unique states wed be looking at about four months of waiting for disk IO just for the hash table accesses. That can't work; the problem absolutely needs to be transformed such that the program can process big batches of data in one go.

A sort + merge BFS

If we wanted to batch the data access as much as possible, what would be the maximum achievable coarseness? Since the program can't know which nodes to processes on depth layer N+1 before layer N has been fully processed, it seems obvious that we have to do our deduplication of states at least once per depth.

Dealing with a whole layer at one time allows ditching hash tables, and representing the visited set and the new states as sorted streams of some sort (e.g. file streams, arrays, lists). We can trivially find the new visited set with a set union on the streams, and equally trivially find the todo set with a set difference.

The two set operations can be combined to work on a single pass through both streams. Basically peek into both streams, process the smaller element, and then advance the stream that the element came from (or both streams if the elements at the head were equal). In either case, add the element to the new visited set. When advancing just the stream of new states, also add the element to the new todo set:

def bfs(graph, start, end):
    visited = Stream()
    todo = Stream()
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return True
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited)
    return False

# Merges sorted streams new and visited. Return a sorted stream of
# elements that were just present in new, and another sorted
# stream containing the elements that were present in either or
# both of new and visited.
def merge_sorted_streams(new, visited):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            if visited.peek() == new.peek():
                out_visited.add(visited.pop())
                new.pop()
            elif visited.peek() < new.peek():
                out_visited.add(visited.pop())
            elif visited.peek() > new.peek():
                out_todo.add(new.peek())
                out_visited.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek())
            out_visited.add(new.pop())
    return out_todo, out_visited

The data access pattern is now perfectly linear and predictable, there are no random accesses at all during the merge. Disk latency thus becomes irrelevant, and the only thing that matters is throughput.

What does the theoretical performance look like with the simplified data distribution of 100 depth levels and 100M states per depth? The average state will be both read and written 50 times. That's 10 bytes/state * 5G states * 50 = 2.5TB. My hard drive can supposedly read and write at a sustained 100MB/s, which would mean (2 * 2.5TB) / (100MB/s) =~ 50k/s =~ 13 hours spent on the IO. That's a couple of orders of magnitude better than the earlier four month estimate!

It's worth noting that this simplistic model is not considering the size of the newly generated states. Before the merge step, they need to be kept in-memory for the sorting + deduplication. We'll look closer at that in a later section.

Compression

In the introduction I mentioned that compressing the states didn't look very promising in the initial experiments, with a 30% compression ratio. But after the above algorithm change the states are now ordered. That should be a lot easier to compress.

To test this theory, I used zstd on a puzzle of 14.6M states, with each state being 8 bytes. After the sorting they compressed to an average of 1.4 bytes per state. That seems like a solid improvement. Not quite enough to run the whole program in memory, but it could plausibly cut the disk IO to just a couple of hours.

Is there any way to do better than a state of the art general purpose compression algorithm, if you know something about the structure of the data? Almost certainly. One good example is the PNG format. Technically the compression is just a standard Deflate pass. But rather than compress the raw image data, the image is first transformed using PNG filters. A PNG filter is basically a formula for predicting the value of a byte in the raw data from the value of the same byte on the previous row and/or the same byte of the previous pixel. For example the 'up' filter transforms each byte by subtracting the previous row's value from it during compression, and doing the inverse when decompressing. Given the kinds of images PNG is meant for, the result will probably mostly consist of zeroes or numbers close to zero. Deflate can compress these far better than the raw data.

Can we apply a similar idea to the state records of the BFS? Seems like it should be possible. Just like in PNGs, there's a fixed row size, and we'd expect adjacent rows to be very similar. The first tries with a subtraction/addition filter followed by zstd resulted in another 40% improvement in compression ratios: 0.87 bytes per state. The filtering operations are trivial, so this was basically free from a CPU consumption point of view.

It wasn't clear if one could do a lot better than that, or whether this was a practical limit. In image data there's a reasonable expectation of similarity between adjacent bytes of the same row. For the state data that's not true. But actually slightly more sophisticated filters could still improve on that number. The one I ended up using worked like this:

Let's assume we have adjacent rows R1 = [1, 2, 3, 4] and R2 = [1, 2, 6, 4]. When outputting R2, we compare each byte to the same byte on the previous row, with a 0 for match and 1 for mismatch: diff = [0, 0, 1, 0]. We then emit that bitmap encoded as a VarInt, followed by just the bytes that did not match the previous row. In this example, the two bytes '0b00000100 6'. This filter alone compressed the benchmark to 2.2 bytes / state. But combining this filter + zstd got it down to 0.42 bytes / state. Or to put it another way, that's 3.36 bits per state, which is just a little bit over what the back of the envelope calculation suggested was needed to fit in RAM.

In practice the compression ratios improve as the sorted sets get more dense. Once the search gets to a point where memory starts getting an issue, the compression ratios can get a lot better than that. The largest problem turned out to have 4.6G distinct visited states in the end. These states took 405MB when sorted and compressed with the above scheme. That's 0.7 bits per state. The compression and decompression end up taking about 25% of the program's CPU time, but that seems like a great tradeoff for cutting memory use to 1/100th.

The filter above does feel a bit wasteful due to the VarInt header on every row. It seems like it should be easy to improve on it with very little extra cost in CPU or complexity. I tried a bunch of other variants that transposed the data to a column-major order, or wrote the bitmasks in bigger blocks, etc. These variants invariably got much better compression ratio by themselves, but then didn't do as well when the output of the filter was compressed with zstd. It wasn't just due to some quirk of zstd either, the results were similar with gzip and bzip2. I don't have any great theories on why this particular encoding ended up compressing much better than the alternatives.

Another mystery is the compression ratio ended up far better when the data was sorted little-endian rather than big-endian. I initially thought it was due to the little-endian sort ending up with more leading zeros on the VarInt-encoded bitmask. But this difference persisted even for filters that didn't have such dependencies.

(There's a lot of research on compressing sorted sets of integers, since they're a basic building block of search engines. I didn't find a lot on compressing sorted fixed-size records though, and didn't want to start jumping through the hoops of representing my data as arbitrary precision integers.q)

Oh no, I've cheated!

You might have noticed that the above pseudocode implementations of BFS were only returning a boolean for solution found / not found. That's not very useful. For most purposes you need to be able to produce a list of the exact steps of the solution, not just state that a solution exists.

On the surface the solution is easy. Rather than collect sets of states, collect mappings from states to a parent state. Then after finding a solution, just trace back the list of parent states from the end to the start. For the hash table based solution, it'd be something like:

def bfs(graph, start, end):
    visited = {start: None}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return trace_solution(node, visited)
        for kid in adjacent(node):
            if kid not in visited:
                visited[kid] = node
                todo.push_back(kid)
    return None

def trace_solution(state, visited):
  if state is None:
    return []
  return trace_solution(start, visited[state]) + [state]

Unfortunately this will totally kill the compression gains from the last section; the core assumption was that adjacent rows would be very similar. That was true when we just looked at the states themselves. But there is no reason to believe that's going to be true for the parent states; they're effectively random data. Second, the sort + merge solution has to read and write back all seen states on each iteration. To maintain the state / parent state mapping, we'd also have to read and write all this badly compressing data to disk on each iteration.

Sort + merge with multiple outputs

The program only needs the state/parent mappings at the very end, when tracing back the solution. We can thus maintain two data structures in parallel. 'Visited' is still the set of visited states, and gets recomputed during the merge just like before. 'Parents' is a mostly sorted list of state/parent pairs, which doesn't get rewritten. Instead the new states + their parents get appended to 'parents' after each merge operation.

def bfs(graph, start, end):
    parents = Stream()
    visited = Stream()
    todo = Stream()
    parents.add((start, None))
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return trace_solution(node, parents)
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited, parents)
    return None

# Merges sorted streams new and visited. New contains pairs of
# key + value (just the keys are compared), visited contains just
# keys.
#
# Returns a sorted stream of keys that were just present in new,
# another sorted stream containing the keys that were present in either or
# both of new and visited. Also adds the keys + values to the parents
# stream for keys that were only present in new.
def merge_sorted_streams(new, visited, parents):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            visited_head = visited.peek()
            new_head = new.peek()[0]
            if visited_head == new_head:
                out_visited.add(visited.pop())
                new.pop()
            elif visited_head < new_head:
                out_visited.add(visited.pop())
            elif visited_head > new_head:
                out_todo.add(new_head)
                out_visited.add(new_head)
                out_parents.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek()[0])
            out_visited.add(new.peek()[0])
            out_parents.add(new.pop())
    return out_todo, out_visited

This gives us the best of both worlds from a runtime and working set perspective, but does mean using more secondary storage. A separate copy of the visited states grouped by depth turns out to also be useful later on for other reasons.

Swapping

Another detail ignored in the snippets of pseudocode is that there is no explicit code for disk IO, just an abstract interface Stream. The Stream might be a file stream or an in-memory array, but we've been ignoring that implementation detail. Instead the pseudocode is concerned with having a memory access pattern that would be disk friendly. In a perfect world that'd be enough, and the virtual memory subsystem of the OS would take care of the rest.

At least with Linux that doesn't seem to be the case. At one point (before the working set had been shrunk to fit in memory) I'd gotten the program to run in about 11 hours when the data was stored mostly on disk. I then switched the program to use anonymous pages instead of file-backed ones, and set up sufficient swap on the same disk. After three days the program had gotten a quarter of the way through, and was still getting slower over time. My optimistic estimate was that it'd finish in 20 days.

Just to be clear, this was exactly the same code and exactly the same access pattern. The only thing that changed was whether the memory was backed by an explicit on-disk file or by swap. It's pretty much axiomatic that swapping tends to totally destroy performance on Linux, whereas normal file IO doesn't. I'd always assumed it was due to programs having the gall to treat RAM as something to be randomly accessed. But that wasn't the case here.

Turns out that file-backed and anonymous pages are not treated identically by the VM subsystem after all. They're kept in separate LRU caches with different expiration policies, and they also appear to have different readahead / prefetching properties.

So now I know: Linux swapping will probably not work well even under optimal circumstances. If parts of the address space are likely to be paged out for a while, it's better to arrange manually for the to be file-backed than to trust swap. I did it by implementing a custom vector class that started off as a purely in-memory implementation, and after a size threshold is exceeded switches to mmap on an unlinked temporary file.

Compressing new states before merging

In the simplified performance model the assumption was that there would be 100M new states per depth. That turned out not to be too far off reality (the most difficult puzzle peaked at about 150M unique new states from one depth layer). But it's also not the right thing to measure; the working set before the merge isn't related to just the unique states, but all the states that were output for this iteration. This measure peaks at 880M output states / depth. These 880M states a) need to be accessed with a random access pattern for the sorting, and b) can't be compressed efficiently due to not being sorted, c) need to be stored along with the parent state. That's a roughly 16GB working set.

The obvious solution would be to use some form of external sorting. Just write all the states to disk, do an external sort, do a deduplication, and then execute the merge just as before. This is the solution I went with first, but while it mostly solved problem A, it did nothing for B and C.

The alternative I ended up with was to collect the states into an in-memory array. If the array grows too large (e.g. more than 100M elements), it's sorted, deduplicated and compressed. This gives us a bunch of sorted runs of states, with no duplicates inside the run but potentially some between the runs. The code for merging the new and visited states is fundamentally the same; it's still based on walking through the streams in lockstep. The only change is that instead of walking through just the two streams, there's a separate stream for each of the sorted runs of new states.

The compression ratios for these 100M state runs are of course not quite as good as for compressing the set of all visited states. But even so, it cuts down both the working set and the disk IO requirements by a ton. There's a little bit of extra CPU from having to maintain a priority queue of streams, but it was still a great tradeoff.

Saving space on the parent states

At this point the vast majority of the space used by this program is spent on storing the parent states, so that we can reconstruct the solution after finding it. They are unlikely to compress well, but is there maybe a CPU/memory tradeoff to be made?

What we need is a mapping from a state S' at depth D+1 to its parent state S at depth D. If we could iterate all possible parent states of S', we could simply check if any of them appear at depth D in our visited set. (We've already produced the visited set grouped by depth as a convenient byproduct when outputting the state/parent mappings from merge). Unfortunately that doesn't work for this problem; it's simply too hard to generate all the possible states S given S'. It'd probably work just fine for many other search problems though.

If we can only generate the state transitions forward, not backward, how about just doing that then? Let's iterate through all the states at depth D, and see what output states they have. If some state produces S' as an output, we've found a workable S. The issue with the plan is that it increases the total CPU usage of the program by 50%. (Not 100%, since on average we find S after looking at half the states of depth D).

So I don't like either of the extremes, but at least there is a CPU/memory tradeoff available there. Is there maybe a more palatable option somewhere in the middle? What I ended up doing was to not store the pair (S', S), but instead (S', H(S)), where H is an 8 bit hash function. To find an S given S', again iterate through all the states at depth D. But before doing anything else, compute the same hash. If the output doesn't match H(S), this isn't the state we're looking for, and we can just skip it. This optimization means doing the expensive re-computation for just 1/256 states, which is a negligible CPU increase, while cutting down memory the memory spent for storing the parent states from 8-10 bytes to 1 byte.

What didn't or might not work

The previous sections go through a sequence of high level optimizations that worked. There were other things that I tried that didn't work, or that I found in the literature but decided would not actually work in this particular case. Here's a non-exhaustive list.

At one point I was not recomputing the full visited set at every iteration. Instead it was kept as multiple sorted runs, and those runs were occasionally compacted. The benefit was fewer disk writes and less CPU spent on compression. The downside was more code complexity and a worse compression ratio. I originally thought this design made sense since in my setup writes were more expensive than reads. But in the end the compression ratio was worse by a factor of 2. The tradeoffs are non-obvious, but in the end I reverted back to the simpler form.

There is a little bit of research done into executing huge breadth first searches for implicit graphs on secondary storage, a 2008 survey paper is a good starting point. As one might guess, the idea of doing the deduplication in a batch with sort+merge, on secondary store, isn't novel. The surprising part is that it was apparently only discovered in the 1993. That's pretty late! There are then some later proposals for secondary storage breadth first search that don't require a sorting step.

One of them was to map the states to integers, and to maintain an in-memory bitmap of the visited states. This is totally useless for my case, since the sizes of the encodable vs. actually reachable state spaces are so different. And I'm a bit doubtful about there being any interesting problems where this approach works.

The other viable sounding alternative is based on temporary hash tables. The visited states are stored unsorted in a file. Store the outputs from depth D in a hash table. Then iterate through the visited states, and look them up in the hash table. If the element is found in the hash table, remove it. After iterating through the whole file, only the non-duplicates remain. They can then be appended to the file, and used to initialize the todo list for the next iteration. If the number of outputs is so large that the hash table doesn't fit in memory, both the files and the hash tables can be partitioned using the same criteria (e.g. top bits of state), with each partition getting processed independently.

While there are benchmarks claiming the hash-based approach is roughly 30% faster than sort+merge, the benchmarks don't really seem to consider compression. I just don't see how giving up the compression gains could be worth it, so didn't experiment with these approaches at all.

The other relevant branch of research that seemed promising was database query optimization. The deduplication problem seems very much related to database joins, with exactly the same sort vs. hash dilemma. Obviously some of these findings should carry over to a search problem. The difference might be that the output of a database join is transient, while the outputs of a BFS deduplication persist for the rest of the computation. It feels like that changes the tradeoffs: it's not just about how to process one iteration most efficiently, it's also about having the outputs in the optimal format for the next iteration.

Conclusion

That concludes the things I learned from this project that seem generally applicable to other brute force search problems. These tricks combined to get the hardest puzzles of the game from an effective memory footprint of 50-100GB to 500MB, and degrading gracefully if the problem exceeds available memory and spills to disk. It is also 50% faster than a naive hash table based state deduplication even for puzzles that fit into memory.

The next post will deal with optimizing grid-based spatial puzzle games in general, as well as some issues specific just to this particular game.

In the meanwhile, Snakebird is available at least on Steam, Google Play, and the App Store. I recommend it for anyone interested in a very hard but fair puzzle game.

Numbers and tagged pointers in early Lisp implementations

jsnell@iki.fi — Mon, 04 Sep 2017 15:00:00 GMT

There was a bit of discussion on HN about data representations in dynamic languages, and specifically having values that are either pointers or immediate data, with the two cases being distinguished by use of tag bits in the pointer value:

If there's one takeway/point of interest that I'd recommend looking at, it's the novel way that Ruby shares a pointer value between actual pointers to memory and special "immediate" values that simply occupy the pointer value itself [1].
This is usual in Lisp (compilers/implementations) and i wouldn't be surprised if it was invented on the seventies once large (i.e. 36-bit long) registers were available.

I was going to nitpick a bit with the following:

The core claim here is correct; embedding small immediates inside pointers is not a novel technique. It's a good guess that it was first used in Lisp systems. But it can't be the case that its invention is tied into large word sizes, those were in wide use well before Lisp existed. (The early Lisps mostly ran on 36 bit computers.)

It seems more likely that this was tied into the general migration from word-addressing to byte-addressing. Due to alignment constraints, byte-addressed pointers to word-sized objects will always have unused bits around. It's harder to arrange for that with a word-addressed system.

But the latter part of that was speculation, maybe I should try to check the facts first before being tediously pedantic? Good call, since that speculation was wrong. Let's take a tour through some early Lisp implementations, and look at how they represented data in general, and numbers in particular.

The problem with integers
LISP I
LISP 1.5
Basic PDP-1 LISP
M 460 LISP
PDP-6 LISP
BBN LISP
Conclusion

The problem with integers

Before we get started, let's state the problem that tagged pointers solve. In a dynamically typed programming language, the language implementation must be able to distinguish between values of different types. The obvious implementation is boxing; all values are treated as blobs of memory allocated somewhere on the heap, with an envelope containing metadata such as the type and (maybe) the size of the object.

But this means that integers now have tons of overhead. They use up heap space, need to be garbage collected, and new memory needs to be constantly allocated for the results of arithmetic operations. Since integers are so critical to almost all kinds of computing, it would be great to minimize the overhead. And ultimately, to eliminate the overhead completely by encoding small integers as recognizably invalid pointers.

LISP I

I wasn't super hopeful about finding out exactly what numbers looked like in the original Lisp implementation. As far as I know, the source code hasn't been preserved. Now, the original paper describing Lisp ( Recursive Functions of Symbolic Expressions and their Computation by Machine, Part I ) isn't quite as theoretical as the title suggests. For example it describes the memory allocator and garbage collector on a reasonable systems level. But it doesn't mention numbers at all; this is a system for symbolic computation, so numbers might as well not exist.

The LISP I Programmer's Manual from 1960 is more illuminating, though not entirely consistent. In one place the manual claims that LISP I only supports floats, and you'll need to wait until LISP II to use integers. But the rest of the document happily describes the exact memory layout of integers, so who can tell.

A floating point value looks like this:

Let's say we have the value 1.0 in a LISP I program. This value is actually pointer to a word. How do we know what the type of the pointed to word is? If the upper half of that word is -1, it's a symbol. Otherwise it's a cons. (The use of -1.0 and 1.0 as the example floats in this picture is unfortunate, since it looks like the -1.0 and -1 are somehow related. That's not the case, -1 is the universal tag value for atoms, and independent of the exact floating point values.)

So the number 1.0 is a symbol? Technically yes, since at this stage of Lisp's evolution everything is either a symbol or a cons. There are no other atoms. We can find out if the symbol represents a number by following the linked list starting from the cdr of the symbol (a pointer stored in the lower half of the word). If we find the symbol NUMB on the list, it's some kind of number. If we find the symbol FLO, it's a floating point number, and the property list will be pointing to a word that contains the raw floating point value that this number represents.

There's a detail here that's kind of amazing. Notice that 1.0 and -1.0 share the same list structure. The only difference is that -1.0 has the symbol MINUS in the list, after which the list merges with the list of 1.0. What a fabulously inefficient representation! Not only do you have to do a bunch of pointer chasing just to find the actual value of a number, but then you'll get to do it again to find out the sign!

The question I can't answer just from reading this document is how exactly the raw floating point value is handled. Surely the garbage collector must know not to interpret those raw bits as pointer data? There is a very detailed example of the memory layout for an integer on pages 94-95, but even with that example I just don't see where the type information is stored. It's clearly not based on address ranges (the raw values are mixed in with the other words), nor the pointer value (all the pointers are stored as 2's complement), nor the 6 unused bits in the machine word.

Suggestions welcome. My best guess is that the example is inaccurate.

LISP 1.5

The LISP 1.5 Programmer's Manual from 1962 explains in a very concise manner how numbers worked in that implementation:

Numbers are still considered to be symbols, and symbols are still marked with -1 as the car. But the standard symbol property list is now gone; instead the symbol is pointing directly to the memory that stores the raw integer value. How does the program know not to follow that pointer as a list? As the document says, that's specified by "certain bits in the tag".

The tag? What's the tag? The IBM 704 had a 36-bit word size but just a 15 bit address space. The words were split (on the ISA level) into a 3 bit "prefix", 15 bit "address", 3 bit "tag", and 15 bit "decrement". Since Lisp values are pointers, only the two 15 bit regions are useful for that. One of the 3 bit regions has been repurposed by the Lisp implementation to mark the pointers to raw data.

This is a clear improvement over LISP I, but a number is still represented as an untagged pointer to a tagged pointer to the raw value. Why is the intermediate word there at all, why not go directly with a tagged pointer to the raw value? Maybe code size?

In parallel to that, the address space has now been split into multiple separate pieces, with the cons cells being allocated from a different range of addresses than plain data like numbers and string segments. It could well be that the tagged pointer is irrelevant to the GC, which just makes its decisions on what's a pointer based on whether the pointer is contained in the "full word space" or the "free space". The tags would then be used just for implementing NUMBERP.

Basic PDP-1 LISP

For a L. Peter Deutsch joint, The LISP implementation for the PDP-1 Computer proves to be a surprisingly unsatisfying document. It's almost exclusively user documentation, with no information on the systems architecture. Well, except a full source code listing. Guess we'll have to look at that, then. NUMBERP is the easiest starting point:

/// ("is a number")
/NUMBERP
nmp,    lac i 100
        and (jmp
        sad (jmp
        jmp tru
        jmp fal

The main thing that need to be known from the rest of the code is that the interpreter stores a pointer to the Lisp value that's currently operated on value at address 100 (octal).

First "lac i 100" follows the pointer to read the first data words of the value into the accumulator. The next line looks bizarre; due to the way the PDP-1 macro-assembler works, "and (jmp" effectively means "and 600000". So this instruction is masking away all but the top two bits of the accumulator, and "sad (jmp" is checking whether the result of the masking equals octal 600000. It appears that there is nothing special about the pointer to a number, but numbers are identified by having the top two bits set in the pointed-to value.

The next step in understanding the layout is the code for reading the raw value of a number.

/get numeric value
vag,    lio i 100
        cla
        rcl 2s
        sas (3
        jmp qi3
        idx 100
        lac i 100
        rcl 8s
        rcl 8s
        jmp x

"lio i 100" loads the current Lisp value into the IO register. "cla" sets the accumulator to zero. "rcl 2s" then rotates the combination of the IO register and accumulator by 2 bits. The accumulator now contains as its low bits the previous high two bits of the IO register. "sas (3" compares the accumulator to 3; if they're not equal we jump to qi3 (the error routine for "non-numeric arg for arith"). "idx 100" moves the pointer to the next word of the value, and "lac i 100" reads that word into the accumulator. And finally the combination of the two registers is rotated by 16 bits, so that we end up with the raw 18 bit value in the accumulator. Written out step by step the process looks like this:

    . == Bit with value of 0
    ! == Bit with value of 1
    ? == Bit with unknown value
    0-9, A-H == bits of the integer value

    X                    X+1
------------------------------------------------
    [!!23456789ABCDEFGH] [................01]

    IO                   AC
------------------------------------------------
Load IO from address X
    [!!23456789ABCDEFGH] [??????????????????]
Clear AC
    [!!23456789ABCDEFGH] [..................]
Rotate left by 2
    [23456789ABCDEFGH..] [................!!]
Check AC == 3
Load AC from address X+1
    [23456789ABCDEFGH..] [................01]
Rotate left by 8
    [ABCDEFGH..........] [........0123456789]
Rotate left by 8
    [..................] [0123456789ABCDEFGH]

Clearly an integer is now represented by a pointer to two words that has a special tag in the high bits of the first word. This implementation got rid of the extra layer of indirection in LISP 1.5; an integer is now just a pointer to tagged data. But we're still left with the storage of a one-word integer requiring three words.

Why use a layout that requires shuffling data around this much, instead of just having the tag in X and the raw value in X+1? It seems awfully inconvenient. My best guess is that the top 1-2 bits of the second word are reserved for the GC, e.g. for use as mark bits. But understanding exactly how the GC works is maybe a project for another day.

M 460 LISP

Before starting research for this article, I'd never heard of the early Lisp implementation for the Univac M 460. A description of the system can be found in the 1964 collection The programming language LISP: Its operation and applications .

Numbers and print names are placed in free storage using the device that sufficiently small (i.e., less than 2^10) half-word quantities appear to point into the bit table area and so don't cause the garbage collector any trouble. A number is stored as a list of words (a flag-word and from 1 to 3 number words, as required), each number word containing in its CAR part 10 significant bits and sign. Thus an integer whose absolute value is less than 2^11 will occupy the same amount of storage (2 words) as in 7090 LISP 1.5.

This is another bit of progress! The key insight on the road to tagged pointers is that invalid parts of the address space can be used to distinguish between pointers and immediate data. Another important insight in this paper is that most numbers in a program are going to be small, so it might make sense to have variable representations for numbers of different magnitude. But it's not a full realization of the concept yet, immediate small numbers are not accessible directly by the user. They are internal to the implementation, used as a building block for boxed integers of various levels of inefficiency.

The paper gets even better once we get a few more pages in, since for characters M 460 Lisp does take that final step:

Each character in the character set available on the M 460 (including tab, carriage return, and others) is represented internally by an 8-bit code (6 bits for the character (up to case), 1 bit for case, and 1 bit for color). To facilitate the manipulation of character strings within our LISP system, we permit such character literals to appear in list structure as if they were atoms, i.e. pointers to property lists. These literals can, where necessary, be distinguished from atoms since they are less than 2^8 in magnitude and hence, viewed as pointers, don't point into free storage (where, as in 7090 LISP, property lists are stored). The predicate charp simply makes this magnitude test.

That's about as clear a case of using embedding immediate data in pointers as it gets. It's just that the tag is rather large (22 highest bits, rather than the 1-4 lowest bits you'd expect today). And it's also dealing with characters rather than numbers, so let's carry on with the investigation a bit longer.

PDP-6 LISP

The June 1966 report on PDP-6 LISP has the following to say on integers:

Fixed-point numbers >= 0 and < about 4000 are represented by a "pointer" 1 greater than their value, and no additional list structure. All other numbers use a pointer to full-word space as part of an atom header with a FIXNUM or FLONUM indicator.

This is starting to get close to the modern fixnum, except for no facility for immediate negative numbers and a tiny range. (This is a machine with 36 bit words and 18 bit pointers; one would hope for a bit more than 12 bits for immediate integers).

BBN LISP

Structure of a LISP system using two-level storage is a wonderful systems design paper from November 1966, describing BBN LISP for a PDP-1 with 16K of core memory, 88K of absurdly slow drum memory, and no hardware paging support. How do you make efficient use of the drum memory? By some clever data layout, software-driven paging, and a locality-optimizing memory allocator.

So it's actually a paper I thought was totally worth reading just for its own sake. But for the purposes of this post, this is the money quote:

LISP assumes that it is operating in an environment containing 128K words, that is from 0 to 400,000 octal. Only 88K actually exist on the drum. The remaining portion of the address space is used for representation of small integers between -32,767 and 32,767 (offset by 300,000 octal), as described below.

The paper describes a machine with both an 18-bit word size and address space, with 16-bit signed fixnums embedded in the pointers. That's about as good as it gets. (Though not quite optimal; they're using bit 17 as the integer tag, but what happened to bit 18? The paper doesn't say, but odds are that it's again a GC mark bit).

The particularly observant reader might have noticed that this machine had 104K words of physical memory, but the described tagging scheme only leaves 64K words addressable. What's up with that? On one level it's exactly what M 460 LISP and PDP-6 Lisp were doing: that 40K of address space stores things that can't be directly pointed to from another Lisp value. But those other implementations were just opportunistically reusing the parts of address space that contained native code.

By contrast, BBN LISP carefully arranged for there to exist as much of such storage as possible, and for it to be located above the address 200,000 (octal).

The most clever example of that is the representation of symbols. The first implementations we saw just implemented symbols as a list of properties indexed by name (e.g. name, value cell, function cell, etc). An obvious optimization is to allocate a symbol as a single larger block of memory with fixed slots for the most common properties, and a generic property list slot to contain anything else.

What BBN Lisp does instead is allocate a symbol in multiple separate blocks rather than a single contiguous one. A pointer to the symbol will point to the block of value cells, so reading the value cell is trivial. What if you want to read another property, e.g. the function? We look at the offset of the value cell pointer to the start of the value cell block, and access the function cell block at the same offset. In modern parlance it ends up as an structure-of-arrays layout rather than an array-of-structures.

In addition to getting more address space for fixnums, they also got exactly the same kind of locality improvements that an structure-of-arrays would be used for today. So it was an all-around neat optimization.

There is also an early design document for BBN 940 LISP from almost the same time as the above paper. It appears to describe the kind of elaborate tagging scheme that a modern Lisp might use, and places the tags in the low bits where they're easier to test for/eliminate. And they even call heap-allocated numbers "boxed"! I had no idea this terminology was in use 50 years ago. The relevant section:

There will be a maximum of 16 pointer types of objects in the 940 LISP System. These are (numbered in octal)

00. S-expressions (nonatomic)
01. Identifiers (literal atoms)
02. Small Integers
03. Boxed Large Integers
04. Boxed Floating Point Numbers
05. Compiled Function - Lambda Type
06. Compiled Function - Lambda Type - Indef Args
07. Compiled Function - Mu Type - Args Paired
10. Compiled Function - Mu Type - List of Args
11. Compiled Function - Macro
12. Array - Pointers
13. Array - Integers
14. Array - FP #s
15. Strings - Packed Character Arrays
16.
17. Pushdown List Pointers

Each pointer will be contained in one 940 word of 24 bits. Bits 0 and 1 will be nominally empty, and may in some cases be used by the system (e.g. bit 0 for garbage collection) or perhaps even the user (in S-expressions). The four bits 2-5 will contain the type number for this pointer. The 18 bits 6-23 will contain an effective address (in the LISP drum file) where the referenced information is stored.

It looks like they ended up not using this design for BBN 940 LISP, and it instead uses an extended version of the segmented memory scheme from the PDP-1 implementation described earlier in this section. But even if these particular bits weren't practical to use with that hardware, at this point just about all the ideas for tagged pointers have definitely been invented.

Conclusion

The initial LISP I implementation in 1960 had the least efficient implementation of numbers this side of church numerals, where even just getting the value might imply chasing half a dozen pointers. But new implementations optimized that layout aggressively. By 1964, the M 460 LISP implementation had arrived at the general solution of using pointers to invalid parts of the address space for storing immediate data, but user-accessible integers were still boxed; the only use for the unboxed integers was as an internal building block. In 1966 PDP-6 LISP applied the idea of tagged immediate data to tiny positive integers, and the PDP-1 based BBN LISP took the idea to the logical conclusion, and allowed immediate storage of integers of almost the full machine word.

I would not have guessed that these optimizations were discovered and applied so early and so aggressively. It's also noteworthy that this was independent of both the machine word size, address space size, and addressing mode of the machine. The first fully fledged implementation I found was on a machine with 18 bit words, 18 bits of address space, and word-addressing. That should have been just about the worst case!

There's an interesting tangent with how MacLISP ended up reversing this progress in the '70s and going back to boxed integers, since they wanted to have just a single integer representation. I won't go into the details since this post already grew longer than intended. But for those interested in the subject AI Memo 421 is a fun read.

Was the technique definitely first used in Lisp? These implementations are early enough that there aren't a ton of other possibilities. The only ones I can think of would be APL and Dartmouth BASIC. If anyone can find documentation on earlier uses of storing immediate data in tagged pointers, please let me know and I'll edit the article.

Why PS4 downloads are so slow

jsnell@iki.fi — Sat, 19 Aug 2017 19:00:00 GMT

Game downloads on PS4 have a reputation of being very slow, with many people reporting downloads being an order of magnitude faster on Steam or Xbox. This had long been on my list of things to look into, but at a pretty low priority. After all, the PS4 operating system is based on a reasonably modern FreeBSD (9.0), so there should not be any crippling issues in the TCP stack. The implication is that the problem is something boring, like an inadequately dimensioned CDN.

But then I heard that people were successfully using local HTTP proxies as a workaround. It should be pretty rare for that to actually help with download speeds, which made this sound like a much more interesting problem.

This is going to be a long-winded technical post. If you're not interested in the details of the investigation but just want a recommendation on speeding up PS4 downloads, skip straight to the conclusions.

Background

Before running any experiments, it's good to have a mental model of how the thing we're testing works, and where the problems might be. If nothing else, it will guide the initial experiment design.

The speed of a steady-state TCP connection is basically defined by three numbers. The amount of data the client is will to receive on a single round-trip (TCP receive window), the amount of data the server is willing to send on a single round-trip (TCP congestion window), and the round trip latency between the client and the server (RTT). To a first approximation, the connection speed will be:

    speed = min(rwin, cwin) / RTT

With this model, how could a proxy speed up the connection? Well, with a proxy the original connection will be split into two mostly independent parts; one connection between the client and the proxy, and another between the proxy and the server. The speed of the end-to-end connection will be determined by the slower of those two independent connections:

    speed_proxy_client = min(client rwin, proxy cwin) / client-proxy RTT
    speed_server_proxy = min(proxy rwin, server cwin) / proxy-server RTT
    speed = min(speed_proxy_client, speed_server_proxy)

With a local proxy the client-proxy RTT will be very low; that connection is almost guaranteed to be the faster one. The improvement will have to be from the server-proxy connection being somehow better than the direct client-server one. The RTT will not change, so there are just two options: either the client has a much smaller receive window than the proxy, or the client is somehow causing the server's congestion window to decrease. (E.g. the client is randomly dropping received packets, while the proxy isn't).

Out of these two theories, the receive window one should be much more likely, so we should concentrate on it first. But that just replaces our original question with a new one: why would the client's receive window be so low that it becomes a noticeable bottleneck? There's a fairly limited number of causes for low receive windows that I've seen in the wild, and they don't really seem to fit here.

Maybe the client doesn't support the TCP window scaling option, while the proxy does. Without window scaling, the receive window will be limited to 64kB. But since we know Sony started with a TCP stack that supports window scaling, they would have had to go out of their way to disable it. Slow downloads, for no benefit.
Maybe the actual downloader application is very slow. The operating system is supposed to have a certain amount of buffer space available for each connection. If the network is delivering data to the OS faster than the application is reading it, the buffer will start to fill up, and the OS will reduce the receive window as a form of back-pressure. But this can't be the reason; if the application is the bottleneck, it'll be a bottleneck with or without the proxy.
The operating system is trying to dynamically scale the receive window to match the actual network conditions, but something is going wrong. This would be interesting, so it's what we're hoping to find.

The initial theories are in place, let's get digging.

Experiment #1

For our first experiment, we'll start a PSN download on a baseline non-Slim PS4, firmware 4.73. The network connection of the PS4 is bridged through a Linux machine, where we can add latency to the network using tc netem. By varying the added latency, we should be able to find out two things: whether the receive window really is the bottleneck, and whether the receive window is being automatically scaled by the operating system.

This is what the client-server RTTs (measured from a packet capture using TCP timestamps) look like for the experimental period. Each dot represents 10 seconds of time for a single connection, with the Y axis showing the minimum RTT seen for that connection in those 10 seconds.

The next graph shows the amount of data sent by the server in one round trip in red, and the receive windows advertised by the client in blue.

First, since the blue dots are staying constantly at about 128kB, the operating system doesn't appear to be doing any kind of receive window scaling based on the RTT. (So much for that theory). Though at the very right end of the graph the receive window shoots out to 650kB, so it isn't totally fixed either.

Second, is the receive window the bottleneck here? If so, the blue dots would be close to the red dots. This is the case until about 10:50. And then mysteriously the bottleneck moves to the server.

So we didn't find quite what we were looking for, but there are a couple of very interesting things that are correlated with events on the PS4.

The download was in the foreground for the whole duration of the test. But that doesn't mean it was the only thing running on the machine. The Netflix app was still running in the background, completely idle [1]. When the background app was closed at 11:00, the receive window increased dramatically. This suggests a second experiment, where different applications are opened / closed / left running in the background.

The time where the receive window stops being the bottleneck is very close to the PS4 entering rest mode. That looks like another thing worth investigating. Unfortunately, that's not true, and rest mode is a red herring here. [2]

Experiment #2

Below is a graph of the receive windows for a second download, annotated with the timing of various noteworthy events.

The differences in receive windows at different times are striking. And more important, the changes in the receive windows correspond very well to specific things I did on the PS4.

When the download was started, the game Styx: Shards of Darkness was running in the background (just idling in the title screen). The download was limited by a receive window of under 7kB. This is an incredibly low value; it's basically going to cause the downloads to take 100 times longer than they should. And this was not a coincidence, whenever that game was running, the receive window would be that low.
Having an app running (e.g. Netflix, Spotify) limited the receive window to 128kB, for about a 5x reduction in potential download speed.
Moving apps, games, or the download window to the foreground or background didn't have any effect on the receive window.
Launching some other games (Horizon: Zero Dawn, Uncharted 4, Dreadnought) seemed to have the same effect as running an app.
Playing an online match in a networked game (Dreadnought) caused the receive window to be artificially limited to 7kB.
Playing around in a non-networked game (Horizon: Zero Dawn) had a very inconsistent effect on the receive window, with the effect seemingly depending on the intensity of gameplay. This looks like a genuine resource restriction (download process getting variable amounts of CPU), rather than an artificial limit.
I ran a speedtest at a time when downloads were limited to 7kB receive window. It got a decent receive window of over 400kB; the conclusion is that the artificial receive window limit appears to only apply to PSN downloads.
Putting the PS4 into rest mode had no effect.
Built-in features of the PS4 UI, like the web browser, do not count as apps.
When a game was started (causing the previously running game to be stopped automatically), the receive window could increase to 650kB for a very brief period of time. Basically it appears that the receive window gets unclamped when the old game stops, and then clamped again a few seconds later when the new game actually starts up.

I did a few more test runs, and all of them seemed to support the above findings. The only additional information from that testing is that the rest mode behavior was dependent on the PS4 settings. Originally I had it set up to suspend apps when in rest mode. If that setting was disabled, the apps would be closed when entering in rest mode, and the downloads would proceed at full speed.

A 7kB receive window will be absolutely crippling for any user. A 128kB window might be ok for users who have CDN servers very close by, or who don't have a particularly fast internet. For example at my location, a 128kB receive window would cap the downloads at about 35Mbp to 75Mbps depending on which CDN the DNS RNG happens to give me. The lowest two speed tiers for my ISP are 50Mbps and 200Mbps. So either the 128kB would not be a noticeable problem (50Mbps) or it'd mean that downloads are artificially limited to to 25% speed (200Mbps).

Conclusions

If any applications are running, the PS4 appears to change the settings for PSN store downloads, artificially restricting their speed. Closing the other applications will remove the limit. There are a few important details:

Just leaving the other applications running in the background will not help. The exact same limit is applied whether the download progress bar is in the foreground or not.
Putting the PS4 into rest mode might or might not help, depending on your system settings.
The artificial limit applies only to the PSN store downloads. It does not affect e.g. the built-in speedtest. This is why the speedtest might report much higher speeds than the actual downloads, even though both are delivered from the same CDN servers.
Not all applications are equal; most of them will cause the connections to slow down by up to a factor of 5. Some games will cause a difference of about a factor of 100. Some games will start off with the factor of 5, and then migrate to the factor of 100 once you leave the start menu and start playing.
The above limits are artificial. In addition to that, actively playing a game can cause game downloads to slow down. This appears to be due to a genuine lack of CPU resources (with the game understandably having top priority).

So if you're seeing slow downloads, just closing all the running applications might be worth a shot. (But it's obviously not guaranteed to help. There are other causes for slow downloads as well, this will just remove one potential bottleneck). To close the running applications, you'll need to long-press the PS button on the controller, and then select "Close applications" from the menu.

The PS4 doesn't make it very obvious exactly what programs are running. For games, the interaction model is that opening a new game closes the previously running one. This is not how other apps work; they remain in the background indefinitely until you explicitly close them.

And it's gets worse than that. If your PS4 is configured to suspend any running apps when put to rest mode, you can seemingly power on the machine into a clean state, and still have a hidden background app that's causing the OS to limit your PSN download speeds.

This might explain some of the superstitions about this on the Internet. There are people who swear that putting the machine to rest mode helps with speeds, others who say it does nothing. Or how after every firmware update people will report increased download speeds. Odds are that nothing actually changed in the firmware; it's just that those people had done their first full reboot in a while, and finally had a system without a background app running.

Speculation

Those were the facts as I see them. Unfortunately this raises some new questions, which can't be answered experimentally. With no facts, there's no option except to speculate wildly!

Q: Is this an intentional feature? If so, what its purpose?

Yes, it must be intentional. The receive window changes very rapidly when applications or games are opened/closed, but not for any other reason. It's not any kind of subtle operating system level behavior; it's most likely the PS4 UI explicitly manipulating the socket receive buffers.

But why? I think the idea here must be to not allow the network traffic of background downloads to take resources away from the foreground use of the PS4. For example if I'm playing an online shooter, it makes sense to harshly limit the background download speeds to make sure the game is getting ping times that are both low and predictable. So there's at least some point in that 7kB receive window limit in some circumstances.

It's harder to see what the point of the 128kB receive window limit for running any app is. A single game download from some random CDN isn't going to muscle out Netflix or Youtube... The only thing I can think of is that they're afraid that multiple simultaneous downloads, e.g. due to automatic updates, might cause problems for playing video. But even that seems like a stretch.

There's an alternate theory that this is due to some non-network resource constraints (e.g. CPU, memory, disk). I don't think that works. If the CPU or disk were the constraint, just having the appropriate priorities in place would automatically take care of this. If the download process gets starved of CPU or disk bandwidth due to a low priority, the receive buffer would fill up and the receive window would scale down dynamically, exactly when needed. And the amounts of RAM we're talking about here are miniscule on a machine with 8GB of RAM; less than a megabyte.

Q: Is this feature implemented well?

Oh dear God, no. It's hard to believe just how sloppy this implementation is.

The biggest problem is that the limits get applied based just on what games/applications are currently running. That's just insane; what matters should be which games/applications someone is currently using. Especially in a console UI, it's a totally reasonable expectation that the foreground application gets priority. If I've got the download progress bar in the foreground, the system had damn well give that download priority. Not some application that was started a month ago, and hasn't been used since. Applying these limits in rest mode with suspended apps is beyond insane.

Second, these limits get applied per-connection. So if you've got a single download going, it'll get limited to 128kB of receive window. If you've got five downloads, they'll all get 128kB, for a total of 640kB. That means the efficiency of the "make sure downloads don't clog the network" policy depends purely on how many downloads are active. That's rubbish. This is all controlled on the application level, and the application knows how many downloads are active. If there really were an optimal static receive window X, it should just be split evenly across all the downloads.

Third, the core idea of applying a static receive window as a means of fighting bufferbloat is just fundamentally broken. Using the receive window as the rate limiting mechanism just means that the actual transfer rate will depend on the RTT (this is why a local proxy helps). For this kind of thing to work well, you can't have the rate limit depend on the RTT. You also can't just have somebody come up with a number once, and apply that limit to everyone. The limit needs to depend on the actual network conditions.

There are ways to detect how congested the downlink is in the client-side TCP stack. The proper fix would be to implement them, and adjust the receive window of low-priority background downloads if and only if congestion becomes an issue. That would actually be a pretty valuable feature for this kind of appliance. But I can kind of forgive this one; it's not an off the shelf feature, and maybe Sony doesn't employ any TCP kernel hackers.

Fourth, whatever method is being used to decide on whether a game is network-latency sensitive is broken. It's absurd that a demo of a single-player game idling in the initial title screen would cause the download speeds to be totally crippled. This really should be limited to actual multiplayer titles, and ideally just to periods where someone is actually playing the game online. Just having the game running should not be enough.

Q: How can this still be a problem, 4 years after launch?

I have no idea. Sony must know that the PSN download speeds have been a butt of jokes for years. It's probably the biggest complaint people have with the system. So it's hard to believe that nobody was ever given the task of figuring out why it's slow. And this is not rocket science; anyone bothering to look into it would find these problems in a day.

But it seems equally impossible that they know of the cause, but decided not to apply any of the the trivial fixes to it. (Hell, it wouldn't even need to be a proper technical fix. It could just be a piece of text saying that downloads will work faster with all other apps closed).

So while it's possible to speculate in an informed manner about other things, this particular question will remain as an open mystery. Big companies don't always get things done very efficiently, eh?

Footnotes

[1] How idle? So idle that I hadn't even logged in, the app was in the login screen.

[2] To be specific, the slowdown is caused by the artifical latency changes. The PS4 downloads files in chunks, and each chunk can be served from a different CDN. The CDN that was being used from 10:51 to 11:00 was using a delay-based congestion control algorithm, and reacting to the extra latency by reducing the amount of data sent. The CDN used earlier in the connection was using a packet-loss based congestion control algorithm, and did not slow down despite seeing the latency change in exactly the same pattern.

The mystery of the hanging S3 downloads

jsnell@iki.fi — Thu, 20 Jul 2017 16:00:00 GMT

A coworker was experiencing a strange problem with their Internet connection at home. Large downloads from most sites worked fine. The exception was that downloads from a Amazon S3 would get up to a good speed (500Mbps), stall completely for a few seconds, restart for a while, stall again, and eventually hang completely. The problem seemed to be specific to S3, downloads from generic AWS VMs were ok.

What could be going on? It shouldn't be a problem with the ISP, or anything south of that: after all, connections to other sites were working. It should not be a problem between the ISP and Amazon, or there would have been problems with AWS too. But it also seems very unlikely that S3 would have a trivially reproducible problem causing large downloads to hang. It's not like this is some minor use case of the service.

If it had been a problem with e.g. viewing Netflix, one might suspect some kind of targeted traffic shaping. But an ISP throttling or forcibly closing connections to S3 but not to AWS in general? That's just silly talk.

The normal troubleshooting tips like reducing the MTU didn't help either. This sounded like a fascinating networking whodunit, so I couldn't resist butting in after hearing about it through the grapevine.

The packet captures

The first step of debugging pretty much any networking problem is getting a packet capture from as many points in the network as possible. In this case we only had one capture point: the client machine. The problem could not be reproduced on anything but S3, and obviously taking a capture from S3 was not an option. Nor did we have access to any devices elsewhere on the traffic path. [0]

A superficial check of the ACK stream showed the following pattern. The traffic would be humming along nicely, from the sequence numbers we can see that about 57MB have already been downloaded in the first 2.5 seconds.

00:00:02.543596 client > server: Flags [.], ack 57657817
00:00:02.543623 client > server: Flags [.], ack 57661318
00:00:02.543682 client > server: Flags [.], ack 57667046

Then, a single packet loss occurs. We can tell from the SACK block that 1432 bytes of payload are missing. That's almost certainly a single packet.

00:00:02.543734 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:57669910}]

After the single packet loss, more data continues to be delivered with no problems. In the next 100ms a further 6MB gets delivered. But the missing data never arrives.

...
00:00:02.648316 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63829515}]
00:00:02.648371 client > server: Flags [.], ack 57667046,
    options [sack 1 {57668478:63830947}]

In fact, no further ACKs are sent for 4 seconds. And even then it's not done by one 1432 byte packet like we expected, but by two 512 byte packets and one 408 byte one. There's also a RTT-sized delay between the first and second packets.

00:00:06.751691 client > server: Flags [.], ack 57667558,
    options [sack 1 {57668478:63830947}]
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]
00:00:06.796277 client > server: Flags [.], ack 63830947

After that, the connection continues merrily along, but the exact same thing happens 3 seconds later.

What can we tell from this? Clearly the actual server would be retransmitting the lost packet much more quickly than with a 4 second delay. It also would not be re-packetizing the 1432 byte packet into three pieces. Instead what must be happening is that each retransmitted copy is getting lost. After a few seconds RFC 4821-style path MTU probing kicks in, and a smaller packet gets retransmitted. For some reason this retransmission makes it through; this makes the sender believe that the path MTU has been reduced, and it starts sending smaller packets.

Again this suggests there's something dodgy going on with MTUs, but as mentioned in the beginning, reducing the MTU did not help.

But it also suggests a mechanism for why the connection eventually hangs completely, rather than alternating between stalling and recovering. There's a limit to how far the MSS can be reduced. If nothing else, the segments will need to have at least one byte of payload. In practice most operating systems have a much higher limit on the MSS (something in the 80-160 byte range is typical). If even packets of the minimum size aren't making it through, the server can't react by sending smaller packets.

With the information from the ACK stream exhausted, it's time to look at the packets in both directions. And what do you know? We actually see the earlier retransmissions at the client, with beautiful exponential backoff. The packets were not lost in the network, but were silently rejected by the client for some reason.

00:00:02.685557 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:02.960249 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:03.500500 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:04.580168 server > client: Flags [.], seq 57667046:57668478, ack 4257, length 1432
00:00:06.751657 server > client: Flags [.], seq 57667046:57667558, ack 4257, length 512
00:00:06.751691 client > server: Flags [.], ack 57667558, win 65528,
    options [sack 1 {57668478:63830947}]
00:00:06.792565 server > client: Flags [.], seq 57667558:57668070, ack 4257, length 512
00:00:06.792567 server > client: Flags [.], seq 57668070:57668478, ack 4257, length 408
00:00:06.792592 client > server: Flags [.], ack 57668070,
    options [sack 1 {57668478:63830947}]

There are really just two reasons this would happen. The IP or TCP checksum could be wrong. But how could it be wrong for the same packet six times in a row? That's crazy talk, the expected packet corruption rate is more like one in a million. Alternatively the packet is too large. But damn it, we know that's not the problem, no matter how well this case is matching the common pattern. Let's just have a look at the checksums, to rule it out...

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 57667046:57668478, ack 4257, length 1432
...

Oh... Every single copy of that packet had a checksum of 0 instead of the expected checksum of 0xd7a7. (Checksums of 0 are often not real errors, but just artifacts of checksum offload. The packets being captured by software before the checksum is computed by hardware. That's not the case here; these are packets we're receiving rather than transmitting.). And it gets crazier, when we look at the next instance of the problem a few seconds later.

server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
server > client: Flags [.], cksum 0x0000 (incorrect -> 0xd7a7), seq 70927740:70928764, ack 4709, length 1024
...

It's the exact same problem, all the way down to the problem appearing specifically with a TCP checksum of 0xd7a7. Further analysis of the captures verified that this was a systematic problem and not a coincidence. Packets with an expected checksum of 0xd7a7 would always have the checksum replaced with 0. Packets with any other expected checksum would work just fine. [1].

This explains why the path MTU probing temporarily fixes the problem: the repacketized segments have different checksums, and make it through unharmed.

TCP Timestamps

So, a problem internal to S3 is causing this very specific kind of packet corruption then?

Not so fast! It turns out that most TCP implementations would work around this kind of corruption by accident. The reason for that is TCP Timestamps. And while you don't need to actually know much about TCP Timestamps to understand this story, I have been looking for an excuse to rant about them.

With TCP Timestamps, every TCP packet will contain a TCP option with two extra values. One of them is the sender's latest timestamp. The other is an echo of the latest timestamp the sender received from the other party. For example here the client is sending the timestamp 805, and the server is echoing it back:

client > server: Flags [.], ack 89,
    options [TS val 805 ecr 10087]
server > client: Flags [P.], seq 89:450, ack 569,
    options [TS val 10112 ecr 805]

TCP Timestamps were added to TCP very early on, for two reasons, neither of which was very compelling in retrospect.

Reason number one was PAWS, Protection Against Wrapped-Around Sequence-Numbers. The idea was that very fast connections might require huge TCP window sizes, and minor packet reordering/duplication might cause an old packet to be interpreted as a new packet, due to the 32 bit sequence number having wrapped around. I don't think that world ever really arrived, and PAWS is irrelevant to practically all TCP use cases.

The other original reason for timestamps was to enable TCP senders to measure RTTs in the presence of packet loss. But this can also be done with TCP Selective ACKs, a feature that's much more useful in general (and thus was widely deployed a lot sooner, despite being standardized later).

In exchange for these dubious benefits, every TCP packet (both data segments and pure control packets) is bloated by 12 bytes. This is in contrast to something like selective ACKs, where most packets don't grow in size. You only pay for selective ACKs when packets are lost or reordered. I think that the debuggability of network protocols is important, but with TCP you get basically everything you need from other sources. TCP timestamps have a high fixed cost, but give very little additional power.

If TCP Timestamps suck so much, why does everyone use them them? I don't know for sure anyone else's reasons. I ended up implementing them purely due to an interoperability issue with the FreeBSD TCP stack. Basically FreeBSD uses a small static receive window for connections without TCP timestamps, while with TCP timestamps on it'd scale the receive window up as necessary. With connections with even a bit of latency, you needed TCP timestamps to avoid the receive window becoming a bottleneck. (This was fixed in FreeBSD a few months ago. Yay!).

Now, performance of FreeBSD clients isn't a big deal for me as long as the connections work. But you know who else uses a FreeBSD-derived TCP stack? Apple. And when it comes to mobile networks, performance of iOS devices is about as important as it gets. Anyone who cares about large transfers to iOS or OS X clients must use TCP Timestamps, no matter how distasteful they find the feature.

"But Juho, what does any of this have to do with S3?", you ask. Well, S3 is one of those rare services that disable timestamps. And that actually makes for a big difference in this case. With timestamps, each retransmitted copy of a packet would use a different timestamp value [2]. And when any part of the TCP header changes, odds are that the checksum changes as well. Even if some packets are lost due to the having the magic checksum, at least the retransmissions will make it through promptly.

To check this theory, I asked for a test with TCP timestamps disabled on the client. And immediately large downloads from anywhere - even the ISP's own speedtest server - started hanging. Success!

Conclusion

With this information I suggested my coworker call his ISP, and report the problem. He was smarter than that, and ran one more test: switching the cable modem from router mode to bridging mode. Bam, the problem was gone. In retrospect this makes sense: in router mode the cable modem needs to update the checksums for each packet that pass through the device. In bridging mode there's no NAT, so no checksum update is needed.

And that's how a dodgy cable modem caused downloads to fail with one service, but one service only. I've seen many kinds of packet corruption before, but never anything that was so absurdly specific.

Footnotes

[0] There are techniques around for routing the traffic such that we would have had a measurement point. One would have been using something like a VPN or a Socks proxy. But that's such a fundamental change to the traffic pattern that it doesn't make for a very interesting test. Odds are that the problem would just go away when you do that. The other option would be to use a fully transparent generic TCP proxy on some server with a public IP, have the client connect to the TCP proxy and the proxy connect to the actual server. But setting that up is tedious; certainly not worth doing as a first step.

It's also pretty common to only have one trace point to start with. For analysis I'd do for actual work purposes, we pretty often have just a trace from somewhere in the middle of the path, but nothing from the client or the server. Getting traces from multiple points is so much trouble that we usually need to roughly pinpoint the problem first with single-point packet capture, and only then ask for more trace points.

[1] As far as I can tell 0xd7a7 has no interesting special properties. The bytes are not printable ASCII characters. 0xd7a7 isn't a value with any special significance in another TCP header field either. There are ways to screw up TCP checksum computations, but I think they're mostly to do with the way 0x0 and 0xffff are both zero values in a one's complement system.

[2] Assuming sensible timestamp resolution. Not the rather unpractical 500ms tick that e.g. OpenBSD uses.

I don't want no 'wantarray'

jsnell@iki.fi — Tue, 18 Jul 2017 18:00:00 GMT

A while back, I got a bug report for json-to-multicsv. The user was getting the following error for any input file, including the one used as an example in the documentation:

    , or } expected while parsing object/hash, at character offset 2 (before "n")

The full facts of the matter were:

The JSON parser was failing on the third character of the file.
That was also the end of the first line in the file. (I.e. the first line of the JSON file contained just the opening bracket).
The user was running it on Windows.
The same input file worked fine for me on Linux.

Now, there's an obvious root cause here. It's almost impossible not to blame this on Windows using CR-LF line endings, where Unix uses just LF. The pattern match is irresistible: works on Linux, fails on Windows, fails at the end of the first line. And I almost answered the email based on this assumption.

Except... Something feels off with that theory. What would be the root cause here? "Wow, I can't believe that the JSON spec missed specifying the CR as whitespace"? No, that makes no sense, nobody would define a text-based file format that sloppily. 0

How about: "Wow, I can't believe the JSON module of a major programming language has a bug making it fail on all inputs on a major operating system, and it took a decade for anyone to notice". That doesn't seem plausible either.

So I tried to reproduce the problem, by making a file with DOS line endings and running it through the script on Linux. That worked fine. Hm. Put in some invalid garbage, and you get a parser error as expected. Double-hm. But the error message I got was very different from that in the bug report. Could it be that it's using a totally different JSON module altogether?

Turns out that's basically what was going on. Perl's JSON module doesn't actually do any parsing itself. It's mostly a shim layer, the actual work is done by one of several different parser modules. On Linux, I'd been getting JSON::XS as the backend (XS is Perl-talk for "native code"). In cases where JSON::XS is not available, the shim module would use a pure Perl fallback, e.g. JSON::PP.

Ok, so force the JSON module to dispatch to JSON::PP. Success! Problem reproduced. Guess it really was buggy parser after all. Remove the DOS line endings, just to be sure... And it's still failing. WTF?

A bit more digging revealed that the error message was actually a lie. The problem wasn't with the whitespace, but with there being an end of file right after said whitespace. The input to JSON::PP contained just a single line, not the whole file! At that point, the actual problem becomes obvious and the fix trivial:

-    my $json = decode_json read_file $file;
+    my $json = decode_json scalar read_file $file;

I was using the read_file function from File::Slurp to read the contents of the file. Unfortunately that function behaves differently in scalar and list contexts. In scalar context, it returns the contents of the file in a single string. In list context, an array of strings. What had to be happening was that the context was changing based on the backend.

And just why would changing the parser backend change the context for that read_file call? As it happens, the JSON module does not actually define decode_json, but directly aliases to the matching function in the backend. For example:

*{"JSON::decode_json"} = &{"JSON::XS::decode_json"};

JSON::XS declares the function with a $ prototype forcing the argument to be evaluated in scalar context. JSON::PP uses no prototype and thus the arguments defaulted to being evaluated in list context.

The blame game

So, that's the bug. But what was the real culprit? I could come up with the following suspects.

Me, for using File::Slurp for this in the first place. "Oh, I just always pass a file-handle to decode_json" said one coworker when I described this bug. And that would indeed have side-stepped the problem, and read_file is just saving a couple of lines of code. But it's exactly the couple of lines of code I don't want to be writing: pairing up file opens/closes, and boilerplate error handling.
Me, for not realizing that the code was only working by accident. I knew read_file works differently in scalar and list contexts. I also knew this case needed scalar context, and had no special reason to believe that decode_json would provide it. The default assumption should have bene for this code not to work. When it did, I should not have accepted it, but figured out why it worked and whether it was guaranteed to work in the future.
The JSON module, for not explicitly documenting the inconsistent prototypes as part of the interface. I don't know that anyone would actually notice that in the documentation though. It might end up as just cover-your-ass documentation.
The JSON module, for directly exposing the backend functions with aliasing, for a minimal performance gain. It's a shim: isn't the whole point to hide away the implementation differences from the user?
The File::Slurp module, for using wantarray to switch behavior of read_file based on the context.
Perl for having the concept of different contexts in the first place.
Perl for allowing random library code to detect different contexts via wantarray.

The thing that really sticks out to me here is overloading of File::Slurp::read_file based on the context. Returning a file as a single string vs. an array of lines are very different operations. There is absolutely no reason for them to share a name. It'd be simpler to implement, simpler to use, and simpler to document. It's even already in a library, so it's not like there would be any kind of namespace pollution by using different names. (Unlike for the uses of context-sensitive overloading in core Perl. Sure, count would probably make more sense than scalar grep. But it would be a new name in the global namespace).

What about wantarray? It's what's enabling this bogus overloading in the first place. I've been using Perl for 20 years, writing some pretty hairy stuff. As far as I can remember, I haven't used wantarray once. And what's more, I don't remember ever using a library that used it to good effect. The reason context-sensitivity works in core Perl is the limited set of operations. One can reasonably learn the entire set of context-sensitive operations, and their (sometimes surprising) behavior. It's a lot less reasonable to expect people to learn this for arbitrary amounts of user code.

It's a bit unfortunate that function aliasing can cause action at a distance like this. But at least that's a feature with solid use cases.

So I think that's where I fall on this. It's all because of a horrible and mostly unnecessary language feature, used for particularly bad effect in a library. It feels like avoiding this kind of problem on the consumer side is almost impossible; it'd just require superhuman levels of attention to detail. Avoiding it on the producer side is really easy: wantarray: just say no.

Footnotes

[0] Did you nod and agree at "that makes no sense"? Haha. The original JSON spec does say that "whitespace can be inserted between any two tokens", but doesn't actually define whitespace.

The origins of XXX as FIXME

jsnell@iki.fi — Mon, 17 Apr 2017 18:00:00 GMT

The token XXX is frequently used in source code comments as a way of marking some code as needing attention. (Similar to a FIXME or TODO, though at least to me XXX signals something far to the hacky end of the spectrum, and perhaps even outright broken).

It's a bit of an odd and non-obvious string though, unlike FIXME and TODO. Where did this convention come from? I did a little bit of light software archaeology to try to find out. To start with, my guesses in order were:

MIT (since it sometimes feels like that's the source of 90% of ancient hacker shibboleths)
Early Unix (probably the most influential codebase that's ever existed)
Some kind of DEC thing (because really, all the world was a PDP)

Other uses of `XXX`

It turns out that XXX and xxx are incredibly annoying things to search for in old code. I'd bet it's the most common sequence of 3+ identical letters in source code. That means there's a ton of false positives to sift through. Here's a few examples of the kind of stuff that will be found.

By far the most common use of XXX in old is for it to be some kind of a template placeholder. This makes some sense; x for an unknown value has an obvious long history that predates computing. These templates might be used to describe the exact data layout of something, like in the following bits from the Apollo guidance computer:

# 17    ASTRONAUT TOTAL ATTITUDE      3COMP   XXX.XX DEG FOR EACH
# 18    AUTO MANEUVER BALL ANGLES     3COMP   XXX.XX DEG FOR EACH
# 19    BYPASS ATTITUDE TRIM MANEUVER 3COMP   XXX.XX DEG FOR EACH
# 20    ICDU ANGLES                   3COMP   XXX.XX DEG FOR EACH
# 21    PIPAS                         3COMP   XXXXX. PULSES FOR EACH
# 22    NEW ICDU ANGLES               3COMP   XXX.XX DEG FOR EACH
# 23    SPARE
# 24    DELTA TIME FOR AGC CLOCK      3COMP   00XXX. HRS. DEC ONLY

Or as just a wildcard for a bunch of related names, like the in this Lisp Machine source code:

;Q-FASL-xxxx refers to functions which load into the cold load, and
; return a "Q", i.e. a list of data-type and address-expression.
;M-FASL-xxxx refers to functions which load into Maclisp, and
; return a Lisp object.

Or as actual templates-as-program, with parts of an input remains while others (those marked with XXX) are programatically replaced. For example temporary file generation in in UNIXv5:

                f = ranname("/usr/lpd/dfxxx");

And finally, it could denote parts of persistent data structures that were reserved for future use (or no longer used), for example in CPM:

/* THE FILE CONTROL BLOCK FORMAT IS SH0WN BELOW:
   --------------------------------------------------------
   /    1 BY / 8 BY / 3 BY / 1 BY /2BY/1 BY/ 16 BY /
   /F1LETYPE/   NAME / EXT / REEL NO/XXX/RCNT/DM0 DM15/
   --------------------------------------------------------

   FILETYPE     :       0E5H IF AVAILABLE (OTHERWISE UNDEFINED NOW)
...
   XXX          :       UNUSED FOR NOW
   RCNT         :       RECORD COUNT IN FILE (0 TO , 127)

A less savoury use of XXX is as an identifier for something that didn't even qualify to have a real name. Most commonly it'd be the name of a branch target, like in a very early version of the C compiler:

    xxx:
        if (o==KEYW) {
                if (cval==EXTERN) {
                        o = symbol();
                        goto xxx;
                }

It could also be used to name variables. The following is from the FORTRAN II compiler for the IBM 704 from 1958. (I don't read 704 assembler, so maybe I'm misinterpreting what's going on in that program. It seems funny enough that I wanted to include it here anyway).

XXXXXX SYN 0  THE APPEARANCE OF THIS SYMBOL IN   F4400370
       REM       THE LISTING INDICATES THAT ITS  F4400380
       REM       VALUE IS SET BY THE PROGRAM.    F4400390

Some DEC code seems to have gone really overboard with this, with single source files having half a dozen different XXXYYY identifiers. (Sorry, had to use YYY as the placeholder there for obvious reasons).

Finally, there are all kinds of bizarre one-off uses. TENEX seems to have used XXX for implementing rubout. That is, when you'd press backspace to delete something you've typed, it'd print out XXX on the teletype to mark the deletion. (Rather than try to move the cursor back). Some kind proto-instant messaging program from 1976 written in Interlisp that I found would just print XXX as the error message for invalid user input.

Now, sorry if the above parts were kind of tedious. But there is actually a point here. Turns out that XXX is a really stupid marker to use for a FIXME. Looking at the Panda TOPS-20 distribution, there are 3083 instances of XXX, none of which are FIXMEs. Just about anything else would be easier to find. This makes its use as one of the three main FIXME-markers all the more puzzling.

`XXX` as a `FIXME`

To get the negative results out of the way, there is absolutely no sign of this being an MIT or DEC thing. XXX as FIXME doesn't appear on ITS or TOPS-20 disks, nor does it appear in any of the mountains of really old Lisp code that I happened to have around; I don't think it makes it to Lisp-land until the mid-'80s. It's also absent in smaller collections of old code from other sources.

No, this seems to definitely be a Unix thing. There are a couple of interesting possibilities in early BSD. First, there's the following lines in a package of troff macros that first appeared in 2BSD, with a copyright date of 1978:

..
.de (t                 \" XXX temp ref to (z
.(z \\$1 \\$2
..
.de )t                 \" XXX temp ref to )t
.)z \\$1 \\$2

I'm pretty sure these are not actually a FIXME. It looks like the convention in this code was to mark .de commands with three character tags depending on their type, as explained in the beginning of the file:

+.\"	Code on .de commands:
+.\"		***	a user interface macro.
+.\"		&&&	a user interface macro which is redefined
+.\"			when used to be the real thing.
+.\"		$$$	a macro which may be redefined by the user
+.\"			to provide variant functions.
+.\"		---	an internal macro.

These lines seem to have been commands that didn't fit into those existing categories, and needed a new tag.

Next up, there's a bunch of very promising looking changes to the troff C source in the summer of 1980. Stuff like:

if(j == ' '){
        storeword(i,width(i));  /* XXX */
        continue;
}

That certainly looks like a classic FIXME. But I think this is another dead end. It turns out that after this change there are 37 /* XXX */ comments in code that didn't use to have any. And when comparing to Unix v7 source code, it looks like basically every single line that was changed got marked with one. So it's unlikely that these are actual FIXMEs. I think this was just the author making sure they could identify their changes, in case they wanted to reintegrate with "upstream".

Soon after that BSD moves to SCCS, and we start getting fine-grained changes rather than huge code-dumps. From there, it's easy to find the first /* XXX */ commit from Nov 9, 1981. This one is interesting in a few ways:

This is definitely a FIXME; just a few very special parts of the code got tagged, and many of them got rewritten soon after.
After this commit, the use of /* XXX */ starts spreading quickly through the BSD codebase and eventually to other authors.
A closer reading of the commit shows something interesting: a bunch of /* ### */ comments. Going through the earlier history, it seems that Bill Joy had been marking his FIXMEs with ###, and halfway through this commit changed to using XXX. I don't know why, or whether these two markers were intended to have slightly different semantics (like ### was code that needed to be fixed, XXX was code that was commented out and needed to be fixed and re-enabled). But XXX quickly became the preferred form.

                        if (rcv_empty(tp)) {                    /* 16 */
-                               tcp_close(tp, UCLOSED);
+                               sowakeup(tp->t_socket); /* ### */
+/* XXX */                      /* tcp_close(tp, UCLOSED); */
                                nstate = CLOSED;
                        } else

(On a personal note, as someone who goes out of their way to read through any published TCP stacks, I'm kind of amused that a search for a random historical trivia leads me to a damn TCP stack).

Leaving it at that seems like a good story. And I'd already checked basically all of Bell Labs code that I could find. It's not in Unix v2-v7 and not in the Programmer's Workbench. But then I decided to check Unix v1 just for completeness sake, and got very confused. Because...

/ XXX fix me, I dont quite understand what to do here or
/ what is done in the similar code below e407:
/ cmp   r5, u.count / see if theres enough room
/ bgt   1f
mov     r5,u.count / read text+data into core

WTF? It doesn't get any clearer than that. But where did it come from? And if this convention was used at Bell Labs in 1970, where did XXX disappear for a decade?

Turns out this was a false alarm. The only reason we have the Unix v1 source code in the first place is that a team of people transcribed the source from PDF scans to text. Then they went on to make it possible to compile the code and run it in an emulator. As part of this latter work, a block of code was added to the source. And a bit unfortunately it was this patched version rather than the "original" that made it to the Unix History Repo. This comment was actually from 2008, not 1971.

There's actually an interesting story behind that extra block of code, as told by Toomey. After finally getting the v1 kernel transcribed, compiled, and running, they hit the problem of the only having two userland programs available: init and sh. Everything else was using a more recent executable header. To be able to do anything at all with the system, they needed to add support for "0407 binaries" as opposed to the "0405" ones the kernel supported natively.

What about C code outside of Unix distributions? It's actually kind of hard to find any of that from before 1982. There might be an earlier instance in Gosling Emacs, though it differs from the modern form by going for a full 9 Xs:

#ifdef HalfBaked
/*    sigset (SIGINT, InterruptKey); *//*XXXXXXXXX*/
    sigset (SIGINT, InterruptKey);/*XXXXXXXXX*/
#endif

And there's a Changelog entry from July 1981, which seems to match up perfectly with both the functionality of the code, and the surrounding ifdef:

Tue Jul  7 12:51:44 1981  James Gosling  (jag at VLSI-Vax)
        ... I also installed Dave
        Dyer's hack to allow ^G's to interrupt execution immediatly.  This
        has a rather major bug, and is the reason that I didn't implement
        it a long time ago: if you type ^G while Emacs is doing output,
        then all queued-but-not-printed characters get lost and Emacs no
        longer has any idea of what the screen looks like. It is pretty
        much impossible for Emacs to tell whether or not this has
        happened. You end up having to type ^L now and then.  The
        "HalfBaked" switch in config.h controls the compilation of this
        facility, ...

But thankfully this code has RCS history starting from 1986, and somebody did in fact edit this code in 1986 with no functional changes, but adding the commented out copy and the XXXXXXXXX:

 #ifdef HalfBaked
-    sigset (SIGINT, InterruptKey);
+/*    sigset (SIGINT, InterruptKey); *//*XXXXXXXXX*/
+    sigset (SIGINT, InterruptKey);/*XXXXXXXXX*/
 #endif

And those are the only signs of XXX in applications that could predate the BSD usage. Both were red herrings, caused by how difficult it's to actually find pristine copies of source code that old. It was very lucky that the Gosling Emacs comment was added after the code was put to RCS, and made not in the five year interval between the original commit and the project starting to use RCS.

So it seems likely that this convention was invented by Bill Joy in BSD. If he wasn't the first one, he was certainly the one that popularized it. Why he chose to switch to the rather inconvenient XXX from ### is unclear.

If you can find an earlier occurence (or know of good collections of pre-1981 C source code), please let me know and I'll update the post.

Computing multiple hash values in parallel with AVX2

jsnell@iki.fi — Sun, 19 Mar 2017 12:00:00 GMT

I wanted to compute some hash values in a very particular way, and couldn't find any existing implementations. The special circumstances were:

The keys are short (not sure exactly what size they'll end up, but almost certainly in the 12-40 byte range).
The keys all of the same length.
I know the length at compile time.
I have a batch of keys to process at once.

Given the above constraints, it seems obvious that doing multiple keys in a batch with SIMD could speed thing up over computing each one individually. Now, typically small data sizes aren't a good sign for SIMD. But that's not the case here, since the core problem parallelizes so neatly.

After a couple of false starts, I ended up with a version of xxHash32 that computes hash values for 8 keys at the same time using AVX2. The code is at parallel-xxhash.

Benchmarks

Before heading off into the weeds with the details, below are a couple of pretty graphs showing performance with different key sizes for a few different implementations: CityHash64 since it's been my default hash function for years, xxHash64 since my parallel implementation was based on xxHash32, and MetroHash64 since I saw people suggesting it was the fastest option for small keys. I did not include FarmHash since it was consistently slower than CityHash for all key sizes.

Finally, to isolate the benefits of specializing for the statically known key sizes, I've included a scalar version of xxHash32. It has exactly the same structure as the parallel version, except for not using SIMD [0].

All implementations computed hashes for the same number of keys; the parallel implementations did it 8 keys at a time, the others did them sequentially. The tests were run on a i7-6700 and GCC 6.3.0, with -O3 -march=native -fno-strict-aliasing. The benchmark code is in the repository, but you'll need to bring your own copies of the external hash table libraries.

First, let's look at the time take per key for key sizes relevant to my use case (this graph is 4-72 bytes, but as mentioned before the most interesting range for me is around 12-40 bytes):

That looks pretty nice, with very significant speedups compared to the alternatives on all the key sizes. With larger key sizes the parallel Murmur3 (my first try) quickly runs out of steam, but the parallel xxHash32 stayed ahead of the pack. We'll switch to showing time per byte rather than time per key here.

And at 512 bytes or so, the time per byte has flattened out completely:

Don't look under the rug

So what are the downsides? Why wouldn't everyone use this?

The most glaring problem is that most applications don't do hash computations in parallel. Either it's going to be fundamentally impossible, or at least it will require a major restructuring.

Second, I've swept a small detail under the rug: the parallel implementations were using column-major order for the data. It's the natural way to structure this. The timings above do not include a row-major to column-major conversion step. That's because my application was already using column-major anyway. But if that weren't the case, it's totally possible that the conversion step would wipe away a good chunk of the gains. (What about scatter-gather? See below).

Third, I suspect that most uses of hash tables use strings as keys. This code will not work at all in that use case. Not only do the sizes of keys have to be statically known, but (another detail I skimmed over above) they also need to be a multiple of 4 bytes long. Basically, I want to use structures as hash keys; not sure how many other people also need that.

And fourth, the parallel implementations were using the 32-bit variants of the algorithms due to reasons that I'll explain later. That does not make the benchmarks unfair (the 64-bit versions are faster than the 32-bit ones). But some applications will need those extra bits in the hash value. This code can't provide it.

So while this should work fine for me (though that still remains to be seen), it might not be a very large ecological niche.

What's interesting about this?

Converting from the scalar version to the parallel version is a fairly mindless process, not many insights to be had in that part. But while doing this, I bumped into some interesting aspects on the periphery.

Rotates

All the fast and high quality hash functions I looked at seemed to be descendants of Murmur, and used rotates as their primitive of choice for moving bits down. This is most likely because x86 has a dedicated rotate instruction, while most other methods require two instructions, e.g. shift+xor. For AVX that's not the case, and you need to synthesize the rotate from two shifts and a xor/or.

Based on some quick testing, a single-instruction replacement could give a 40% speedup, and a two instruction replacement a 20% speedup. There's not a huge number of single instruction options available though: horizontal 16-bit addition/subtraction, or the 8-bit shuffles. I suspect neither would work very well due to the effects aligning at an 8 bit boundary. With two instructions a shift+xor is probably the best option. Would be interesting to see if the best speed/quality tradeoff is different for AVX than for x86.

Multiplies

These days new hash functions are mostly built with 64*64->64 multiplies. We won't have that in SIMD until AVX-512 (and given the way things are going, I wonder if a general purpose CPU using AVX-512 will actually ever launch). Synthesizing a 64-bit multiply from 32-bit multiplies doesn't seem viable for this use case. So for this use case, we really want to look at the hash functions defined a few years ago rather than the latest hotness.

Memory layout

Like I mentioned earlier, my data is already in column-major order so I didn't need to worry about wrangling that. But at one point I thought that it'd be nice to provide an alternate version that would work on row-major data. That's what scatter-gather is for, right?

Nope, the gather instructions are just unbelievably slow, and additionally for some reason prevented compilers unrolling the Murmur3 loop, for a 4x performance loss. (Even on GCC 6.3 and clang 3.8). In theory the xxHash inner loop should be better for the gather instructions, since at least there you're not depending on the compiler unrolling to get multiple parallel loads going. But the results there were only marginally less worse.

Auto-vectorization

After having written the version using intrinsics, it occurred to me that I really should have started off with just writing out plain C++ with the same semantics, and see if it auto-vectorizes. Because this really looks like it should be a very easy case. And while the transformation from a scalar version to using intrinsics is not too bad, the transformation to standard C++ expressing the same order of operations on the same memory layout is easier yet. The theoretically auto-vectorizable code certainly looks very pretty compared to the AVX intrinsic soup.

But ignoring aesthetics, the results were mixed. GCC 6.3 seemed to vectorize everything perfectly. GCC 4.9 [1] missed something (I didn't track down exactly what) that cost about 25% performance. And Clang 3.8 did nothing at all, with the plain-C++ version being 150% slower than the version using intrinsics. So still a bit on the fragile side. But this is the best showing for auto-vectorization that I've experienced so far.

(The GCC 4.9 case is particularly annoying; it would have been easy to write the auto-vectorizable version first, see the speedups and think auto-vectorization was working, but miss that it was still leaving a lot of performance on the table).

32-bit output values

The other advantage of 64-bit operations is that the natural implementation will end up producing a 64-bit hash value. Now, for normal hash tables I'm totally OK with a single 32-bit hash value. But there's some use cases like Cuckoo hash tables or Bloom Filters where one would really like more key material.

Before moving from Murmur3 to xxHash, I experimented a bit with a version that would not only compute results for multiple different keys at once, but also do it with multiple different seed values. It was actually pretty efficient. I didn't end up redoing that work for the xxHash version though. Primarily since I don't actually need that version right now, and secondarily since I'm actually not sure of whether the different seed values will give different enough outputs for use in a probabilistic data structures.

(If anyone knows for sure whether the last bit is true or not, please let me know).

Is there a faster non-parallel hash function here?

As mentioned multiple times, computing multiple keys in parallel is a very niche use case. But based on the benchmark graphs for large key sizes, I wonder if there's a decent non-parallel hash function hidden here: compute the 8 32-bit streams in parallel and combining them at the end (or at certain block boundaries). After all, that's already what xxHash does on a smaller scale.

This seems like something that people would already have explored in the quest for faster and faster hashing for large key sizes. But I can't find any trace of such an implementation. Maybe everyone had already moved to 64-bit multiplies by the time AVX2 started to be widely deployed and 32-bit multiplies became the faster option again. Or maybe 32-bit hash values for large key sizes aren't actually a useful point in the design space.

Designing hash functions is hard. I explicitly did not want to invent a new one here, but just re-implement existing algorithms. I even went so far as to add in the "mix in the length of the key" steps, just so that I could verify my code against the reference implementations. Sure, it's a useless step given the length is constant. But it doesn't cost that much to do either, and lets me not worry about accidentally destroying the hash quality.

But if I wanted to burn some brain cycles on designing one and a lot of CPU cycles on running SMHasher... 32-bit multiplies + shift-xor, working 64 bytes at a time, and code organized in a way that makes it easy to auto-vectorize could be a pretty interesting place to start from.

Footnotes

[0] Note that I tried to make sure to isolate this to specializing for the key size, not to e.g. be able to hoist any computations outside the benchmark loop. AFAIK all implementations went through the same number of non-inlined function calls.

[1] Yes, it's a couple of years old. But that's Debian stable for you. And to be honest, a year ago our main compiler at work was still GCC 4.4. Compared to that, 4.9 feels pretty darn luxurious.

Juho Snellman's Weblog

Web Environment Integrity vs. Private Access Tokens - They're the same thing!

A monorepo misconception - atomic cross-project commits

Writing a procedural puzzle generator

The rules

Requirements

The solver

The Generator

The optimizer

Unique single solution

Conclusion

Footnotes

Optimizing a breadth-first search

Table of contents

A textbook BFS

A sort + merge BFS

Compression

Oh no, I've cheated!

Sort + merge with multiple outputs

Swapping

Compressing new states before merging

Saving space on the parent states

What didn't or might not work

Conclusion

Numbers and tagged pointers in early Lisp implementations

Table of Contents

The problem with integers

LISP I

LISP 1.5

Basic PDP-1 LISP

M 460 LISP

PDP-6 LISP

BBN LISP

Conclusion

Why PS4 downloads are so slow

Background

Experiment #1

Experiment #2

Conclusions

Speculation

Footnotes

The mystery of the hanging S3 downloads

The packet captures

TCP Timestamps

Conclusion

Footnotes

I don't want no 'wantarray'

The blame game

Footnotes

The origins of XXX as FIXME

Other uses of XXX

XXX as a FIXME

Computing multiple hash values in parallel with AVX2

Benchmarks

Don't look under the rug

What's interesting about this?

Rotates

Multiplies

Memory layout

Auto-vectorization

32-bit output values

Is there a faster non-parallel hash function here?

Footnotes

Other uses of `XXX`

`XXX` as a `FIXME`