Juho Snellman's Weblog

Writing a procedural puzzle generator

jsnell@iki.fi — Tue, 14 May 2019 15:00:00 GMT

This blog post describes the level generator for my puzzle game Linjat. The post is standalone, but might be a bit easier to digest if you play through a few levels. The source code is available; anything discussed below is in src/main.cc.

A rough outline of this post:

Linjat is a logic game of covering all the numbers and dots on a grid with lines.
The puzzles are procedurally generated by a combination of a solver, a generator, and an optimizer.
The solver tries to solve puzzles the way a human would, and assign a score for how interesting a given puzzle is.
The puzzle generator is designed such that it's easy to change one part of the puzzle (the numbers) and have other parts of the puzzle (the dots) get re-organized such that the puzzle remains solvable.
A puzzle optimizer repeatedly solves levels and generates new variations from the most interesting ones that have been found so far.

The rules

To understand how the level generator works, you unfortunately must first know the rules of the game. Luckily the rules are very simple. The puzzle consists of a grid containing empty squares, numbers, and dots. Like this:

The goal is to draw a vertical or horizontal line through each of the numbers, with three constraints:

The line going through a number must be of the same length as the number.
The lines can't cross.
All the dots need to be covered by a line.

Like this:

Whee! The game is all designed, the UI is implemented, now all I need are a few hundred good puzzles, and we're good to go. And for a game like this, there's really no point in trying to make those puzzles by hand. That's a job for a computer.

Requirements

What makes for a good puzzle in this game? I tend to think of puzzle games as coming in two categories. There's the ones where you're exploring a complicated state space from the start to the end (something like Sokoban or Rush Hour), and where it might not even obvious exactly what states exist in the game. Then there are ones where all the states are known at the start, and you're slowly whittling the state space down by process of elimination (e.g. Sudoku or Picross). This game is clearly in the latter category.

Now, players have very different expectations for these two different kinds of puzzles. For this latter kind there's a very strong expectation that the puzzle is solvable just with deduction, and that there should never be a need for backtracking / guessing / trial and error. [0] [1]

It's not enough to know if a puzzle can be solved with just logic. In addition to that we need to have some idea of how good the produced puzzles are. Otherwise most of the levels might be just trivial dross. In an ideal world this could also be used for building a smooth progression curve, where the levels get progressively harder as the player progresses through the game.

The solver

The first step to meeting the above requirements is a solver for the game that's optimized for this purpose. A backtracking brute-force solver will be fast and accurate at telling whether the puzzle is solvable, and could also be changed to determine whether the solution is unique. But it can't give any idea of how challenging the puzzle actually is, since that's not how a human would solve these puzzles. The solver needs to imitate humans.

How does a human solve this puzzle? There's a couple of obvious moves, which the tutorial teaches:

If a dot can only be reached from one number, the line from that number should be extended to cover the dot. Here the dot can only be reached from the three, not the four:

Leading to:
If the line doesn't fit in one orientation, it must be placed in the other orientation instead. In the above example the 4 can no longer be placed vertically, so we know it has to be horizontal. Like this:
If a line of size X is known to be in a certain orientation and there isn't enough space to fit a line of X spaces on both sides, some of the squares in the middle must be covered. For example if in the above example the "4" had been a "3" instead, we wouldn't know whether it extended all the way to the right or to the left of the board. But we would know it must cover the two middle squares:

This kind of thinking is the meat and potatoes of the game. You figure a way to extend one line a little bit, make that move, and then inspect the board again since that hopefully gave you the information to make a new deduction elsewhere. Writing a solver that follows these rules would be enough to determine if a human could solve the puzzle without backtracking.

It doesn't really say anything about how hard or interesting the level is though. In addition to the solvability, we need to somehow quantify the difficulty.

The obvious first idea for a scoring function is that a puzzle that takes more moves to finish is the harder one. That's probably a good metric in other games, but in this one the number of valid moves that the player has at any one time is probably more important. If there are 10 possible deductions a player could make, they'll find one of those very quickly. If there's only one valid move, it'll take longer.

So as a first approximation you want the solution tree to be deep and narrow: there's a long dependency chain of moves from start to finish, and at any one time there are only a few ways of moving forward on the chain. [2]

How do you figure out the width and depth of the tree? Just solving the puzzle once and evaluating the produced tree doesn't give a precise answer. The exact order in which you make the moves will end up affecting the shape of the tree. You'd need to look at all the possible solutions, and do something like optimize for the best worst-case. Now, I'm no stranger to brute-forcing puzzle game search graphs, but for this project I wanted a single-pass solver rather than any kind of exhaustive search. Due to the opimization phase, the goal was for the solver runtime to be measured in microseconds rather than seconds.

I decided not to do that. Instead my solver doesn't actually make one move at a time, but solves the puzzle by layers: given a state, find all valid moves that could be made. Then apply all of those moves at once. Finally start over from the new state. The number of layers and the maximum number of moves ever found in a single layer are then used as proxies for the depth and the width of the search tree as a whole.

Here's what the solution for one of the harder puzzles looks like with this model (click on the thumb-nail to expand). Dotted lines are the lines that were extended on that solver layer, solid ones didn't change. Green lines are of the right length, red are not yet complete.

The next problem is that not all moves a player makes are created equal. What was listed at the start of this section is really just common sense. Here's an example of a more complicated deduction rule, which would require some more thought to find. Consider a board like:

The dots at C and D can only be covered by the 5 and the middle 4 (and neither piece can cover both of them at the same time). This means that the middle 4 needs to cover one of the two, and thus can't be used to cover A. Instead A has to be covered with the lower left 4.

It'd clearly be silly to treat this chain of deductions the same as a one-step "this dot can only be reached from that number". Can these more complex rules just be weighted more heavily in the scoring function? Unfortunately not with the layer-based solver, since it's not guaranteed to find lowest cost solution. It's not just a theoretical concern, in practice it's pretty common for a part of the board to be solvable in either a single complex deduction or a chain of several much simpler moves. The layer-based solver basically finds the shortest path, not the cheapest one, and that can't just be fixed in the scoring function.

The method I ended up using was to change the solver such that each layer consists of only one kind of deduction. The algorithm goes through the deduction rules in a rough order of difficulty. If a rule finds any moves, they're applied and the iteration is over, and the next iteration starts the list over from the beginning.

The solution is then scored by assigning each layer a cost based on the single rule used for it. This is still not guaranteed to find the cheapest solution, but with a good selection of weights it'll at least not find an expensive solution if a cheap solution exists.

It also seems to map out pretty well to how humans solve the puzzle. You look for the gimmes first, and only start thinking hard once there are no easy moves.

The Generator

The previous section took care of figuring out if a level is any good or not. But that alone isn't enough, you also need to somehow generate levels for the solver to score. It's quite unlikely that a randomly generated level would be solvable, let alone interesting.

The key idea (which is by no means novel) is to interleave the solver and the generator. Let's start with a puzzle that's probably unsolvable, consisting just of numbers 2-5 placed in random locations on the grid:

The solver runs until it can't make any more progress:

The generator then adds some information to the puzzle, in the form of a dot, and continues solving.

In this case that one added is not enough to allow the solver to make any progress. So the generator will keep on adding more dots until the solver is happy:

And then the solver resumes normal operation:

This process continues either until the puzzle is solved or there is no more information to add (i.e. every space that's reachable from a number is covered by a dot).

This method works only if the new information that's being added can't invalidate any of the previously made deductions. That would be tough to do when adding numbers to the grid [3]. But adding new dots to the board has that property, at least given the deduction rules I'm using in this program.

Where shoud the algorithm add the dots? What I ended up doing was to add them in the empty space that could have been covered by the most lines in the starting state, so each dot tends to give as little information as possible. There is no attempt to add it specifically to a location where it'll be useful in advancing the puzzle at the point where the solver got stuck. This produces a pretty neat effect where most of the dots will be totally useless at the start of the puzzle, which makes the puzzle seem harder than it is. There are all these apparent moves you could make, but somehow none of them quite work out. The puzzle generator ends up being a bit of a jerk.

This process will not always produce a solution, but it's pretty fast (on the order of 50-100 microseconds) so it can just be repeated a bunch of times until it generates a level. Unfortunately it'll generally produce a mediocre puzzle. There are too many obvious moves right at the start, the board gets filled in very quickly and the solution tree is quite shallow.

The optimizer

The above process produced a mediocre puzzle. In the final stage, we use that level as a seed for an optimization process. The process works as follows.

The optimizer sets up a pool of up to 10 puzzle variants. The pool is initialized with the newly generated random puzzle. On each iteration, the optimizer selects one puzzle from the pool and mutates it.

The mutation removes all the dots, and then changes the numbers a bit (e.g. reduce/increase the value of a randomly selected number, or move a number to a different location on the grid). It might be possible to apply multiple mutations to board in one go. We then run the solver in the special level-generation mode described in the previous section. This adds enough dots to the puzzle to make it solvable again.

After that, we run the solver again, this time in the normal mode. During this run, the solver keeps track of a) the depth of the solution tree, b) how often each of the various kinds of rules was needed, c) how wide the solution tree was at times. The puzzle is scored based on the above criteria. The scoring function will basically prefer deep and narrow solutions, and at higher difficulty levels also rewards puzzles that require use of one or more of the advanced deduction rules.

The new puzzle is then added to the pool. If the pool ever contains more than 10 puzzles, the worst one is discarded.

This process is repeated a number of times (anything from 10k to 50k iterations seemed to be fine). After that, the version of the puzzle with the highest score is saved into the puzzle's level database. This is what the progress of the best puzzle looks like through one optimization run:

I tried a few other ways of structuring the optimization as well. One version used simulated annealing, the others were genetic algorithms with different crossover operations. None of these performed as well as the naive pool of hill-climbers.

Unique single solution

There's an interesting complication that arises when the puzzle has a single unique solution. Is it valid for the player to assume that's the case, and make deductions based on that? Is it fair for the puzzle generator to assume that the player will do so?

In a post on HN, I mentioned four options for how to deal with this:

State the "only a single solution" up front, and make the puzzle generator generate levels that require this form of deduction. This sucks, since it'll make the rules far more complicated to understand. And it's also exactly the kind of detail people would forget.
Don't guarantee a single solution: have potentially multiple solutions, and accept any of them. This doesn't really solve the problem, it just moves it around.
Punt, and just assume this is a very rare event that won't matter in practice. (This is was the original implementation.)
Change the puzzle generator such that it doesn't generate puzzles where the knowing the solution is unique helps. (Probably the right thing to do, but also extra work.)

I originally went with the last option, and that was a horrible mistake. It turns out that I'd only considered one way in which the uniqueness of the solution leaks information, and that's indeed pretty rare. But there's others, and one was present in basically every level I'd generated, and often kind of trivialized the solution. So in May 2019 I updated the Hard and Expert mode levels to go with the third option instead.

The most annoying case is the 2 with the dotted line in the following board:

Why could a sneaky player make that deduction? The 2 can cover any of the 4 adjacent squares. None of them have any dots, so they don't necessarily need to be covered by anything. And the square that's downwards doesn't have any overlap with other pieces. If there's a single solution, it has to be the case that other pieces cover the other three squares, and the 2 covers the downwards square.

The solution is to add some dots when these cases are detected, like this:

Another common case was the dotted 2 on this board:

Nothing distinguishes the squares to the left and up of the 2. Neither has a dot, and neither is reachable from any other number. Any solution where the 2 covers the upward square would have a matching solution where it covers the leftware square instead, and vice versa. If there's a single unique solution, it can't be either and thus the 2 must cover the downward square instead.

This kind of case I just solved by the "if it hurts, just don't do it" method. I.e. having the solver use this rule very early on in the priority list, and assigning these moves a large negative weight. Puzzles with this kind of property will mostly end up discarded by the optimizer, and the few that make it through will be discarded when doing the final level selection for the published game.

This is not an exhaustive list, I found a lot of other unique-solution rules when adversarially play-testing. But most of them felt like they were rare and difficult enough to find that they're not really shortcuts. If somebody solves a puzzle using that kind of deduction, I'm not going to begrudge them that.

Conclusion

The game was originally designed as an experiment for procedural puzzle generation. The game design and the generator go hand in hand, so the exact techniques won't be directly applicable to existing games.

The part I can't answer is whether putting this much effort into the procedural generation was worth it. The feedback from players has been pretty inconsistent when it comes to the level design. A common theme for positive comments has been about how the puzzles always feel like there's a clever gotcha in there. The most common negative complaint has been that there's not enough of a difficulty gradient in the game.

I have a couple of other puzzle games in an embryonic stage, and felt good enough about this generator that I'd probably at least try similar procedural generation methods for those too. One thing I'd definitely do differently the next time around is to do adversarial playtesting from the start.

Footnotes

[0] Or at least that's what I believed. But when I observed a bunch of players in person, about half of them just made guesses and then iterated on those guesses. Oh, well.

[1] Anyone reading this should also read Solving Minesweeper and making it better by Magnus Hoff, which has a fascinating twist on the perceived need for puzzle games with hidden information to be guaranteed solvable.

[2] Just to be clear, this depth / narrowness of the tree is a metric that I thought was meaningful to this puzzle, not something that's going to be applicable to all or even most puzzles. For example there's a good argument to be made that a Rush Hour puzzle is interesting if there are multiple paths paths to the solution of almost but not quite the same length. But that's because Rush Hour is a game of finding the shortest solution, not just some solution.

[3] With the exception if adding 1s. The first version of the puzzle didn't have the dots, and the plan was to have the generator add 1s when it needed to add more information. But that felt a little too constrained.

Optimizing a breadth-first search

jsnell@iki.fi — Mon, 23 Jul 2018 16:00:00 GMT

A couple of months ago I finally had to admit I wasn't smart enough to solve a few of the levels in Snakebird, a puzzle game. The only way to salvage some pride was to write a solver, and pretend that writing a program to do the solving is basically as good as having solved the problem myself. The C++ code for the resulting program is on Github. Most of what's discussed in the post is implemented in search.h and compress.h. This post deals mainly with optimizing a breadth-first search that's estimated to use 50-100GB of memory to run on a memory budget of 4GB.

There will be a follow up post that deals with the specifics of the game. For this post, all you need to know is that that I could not see any good alternatives to the brute force approach, since none of the usual tricks worked. There are a lot of states since there are multiple movable or pushable objects, and the shape of some of them matters and changes during the game. There were no viable conservative heuristics for algorithms like A* to narrow down the search space. The search graph was directed and implicit, so searching both forward and backward simultaneously was not possible. And a single move could cause the state to change in a lot of unrelated ways, so nothing like Zobrist hashing was going to be viable.

A back of the envelope calculation suggested that the biggest puzzle was going to have on the order of 10 billion states after eliminating all symmetries. Even after packing the state representation as tightly as possible, the state size was on the order of 8-10 bytes depending on the puzzle. 100GB of memory would be trivial at work, but this was my home machine with 16GB of RAM. And since Chrome needs 12GB of that, my actual memory budget was more like 4GB. Anything in excess of that would have to go to disk (the spinning rust kind).

How do we fit 100GB of data into 4GB of RAM? Either a) the states would need to be compressed to 1/20th of their original already optimized size, b) the algorithm would need to be able to efficiently page state to disk and back, c) a combination of the above, or d) I should buy more RAM or rent a big VM for a few days. Option D was out of the question due to being boring. Options A and C seemed out of the question after a proof of concept with gzip: a 50MB blob of states compressed to about 35MB. That's about 7 bytes per state, while my budget was more like 0.4 bytes per state. So option B it was, even though a breadth-first search looks pretty hostile to secondary storage.

This is a somewhat long post, so here's a brief overview of the sections ahead:

A textbook BFS - What's the normal formulation of breadth-first search like, and why is it not suitable for storing parts of the state on disk?
A sort + merge BFS - Changing the algorithm to efficiently do deduplications in batches.
Compression - Reducing the memory use by 100x with a combination of off-the-shelf and custom compression.
Oh no, I've cheated! - The first few sections glossed over something; it's not enough to know there is a solution, we need to know what the solution is. In this section the basic algorithm is updated to carry around enough data to reconstruct a solution from the final state.
Sort + merge with multiple outputs - Keeping more state totally negates the compression gains. The sort + merge algorithm needs to be updated to keep two outputs: one that compresses well used during the search, and another that's just used to reconstruct the solution after one is found.
Swapping - Swapping on Linux sucks even more than I thought.
Compressing new states before merging - So far the memory optimizations have just been concerned with the visited set. But it turns out that the list of newly generated states is much larger than one might think. This section shows a scheme for representing the new states more efficiently.
Saving space on the parent states - Investigate some CPU/memory tradeoffs for reconstructing the solution at the end.
What didn't or might not work - Some things that looked promising but I ended up reverting, and others that research suggested would work but my intuition said wouldn't for this case.

A textbook BFS

So what does a breadth-first search look like, and why would it be disk-unfriendly? Before this little project I'd only ever seen variants of the textbook formulation, something like this:

def bfs(graph, start, end):
    visited = {start}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return True
        for kid in adjacent(node):
            if kid not in visited:
                visited.add(kid)
                todo.push_back(kid)
    return False

As the program produces new candidate nodes, each node is checked against a hash table of already visited nodes. If it's already present in the hash table, we ignore the node. Otherwise it's added both to the queue and the hash table. Sometimes the 'visited' information is carried in the nodes rather than in a side-table; but that's a dodgy optimization to start with, and totally impossible when the graph is implicit rather than explicit.

Why is a hash table problematic? Because hash tables will tend to have a totally random memory access pattern. If they don't, it's a bad hash function and the hash table will probably perform terribly due to collisions. This random access pattern can cause performance issues even when the data fits in memory: an access to a huge hash table is pretty likely to cause both a cache and TLB miss. But if a significant chunk of the data is actually on disk rather than in memory? It'd be disastrous: something on the order of 10ms per lookup.

With 10G unique states wed be looking at about four months of waiting for disk IO just for the hash table accesses. That can't work; the problem absolutely needs to be transformed such that the program can process big batches of data in one go.

A sort + merge BFS

If we wanted to batch the data access as much as possible, what would be the maximum achievable coarseness? Since the program can't know which nodes to processes on depth layer N+1 before layer N has been fully processed, it seems obvious that we have to do our deduplication of states at least once per depth.

Dealing with a whole layer at one time allows ditching hash tables, and representing the visited set and the new states as sorted streams of some sort (e.g. file streams, arrays, lists). We can trivially find the new visited set with a set union on the streams, and equally trivially find the todo set with a set difference.

The two set operations can be combined to work on a single pass through both streams. Basically peek into both streams, process the smaller element, and then advance the stream that the element came from (or both streams if the elements at the head were equal). In either case, add the element to the new visited set. When advancing just the stream of new states, also add the element to the new todo set:

def bfs(graph, start, end):
    visited = Stream()
    todo = Stream()
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return True
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited)
    return False

# Merges sorted streams new and visited. Return a sorted stream of
# elements that were just present in new, and another sorted
# stream containing the elements that were present in either or
# both of new and visited.
def merge_sorted_streams(new, visited):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            if visited.peek() == new.peek():
                out_visited.add(visited.pop())
                new.pop()
            elif visited.peek() < new.peek():
                out_visited.add(visited.pop())
            elif visited.peek() > new.peek():
                out_todo.add(new.peek())
                out_visited.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek())
            out_visited.add(new.pop())
    return out_todo, out_visited

The data access pattern is now perfectly linear and predictable, there are no random accesses at all during the merge. Disk latency thus becomes irrelevant, and the only thing that matters is throughput.

What does the theoretical performance look like with the simplified data distribution of 100 depth levels and 100M states per depth? The average state will be both read and written 50 times. That's 10 bytes/state * 5G states * 50 = 2.5TB. My hard drive can supposedly read and write at a sustained 100MB/s, which would mean (2 * 2.5TB) / (100MB/s) =~ 50k/s =~ 13 hours spent on the IO. That's a couple of orders of magnitude better than the earlier four month estimate!

It's worth noting that this simplistic model is not considering the size of the newly generated states. Before the merge step, they need to be kept in-memory for the sorting + deduplication. We'll look closer at that in a later section.

Compression

In the introduction I mentioned that compressing the states didn't look very promising in the initial experiments, with a 30% compression ratio. But after the above algorithm change the states are now ordered. That should be a lot easier to compress.

To test this theory, I used zstd on a puzzle of 14.6M states, with each state being 8 bytes. After the sorting they compressed to an average of 1.4 bytes per state. That seems like a solid improvement. Not quite enough to run the whole program in memory, but it could plausibly cut the disk IO to just a couple of hours.

Is there any way to do better than a state of the art general purpose compression algorithm, if you know something about the structure of the data? Almost certainly. One good example is the PNG format. Technically the compression is just a standard Deflate pass. But rather than compress the raw image data, the image is first transformed using PNG filters. A PNG filter is basically a formula for predicting the value of a byte in the raw data from the value of the same byte on the previous row and/or the same byte of the previous pixel. For example the 'up' filter transforms each byte by subtracting the previous row's value from it during compression, and doing the inverse when decompressing. Given the kinds of images PNG is meant for, the result will probably mostly consist of zeroes or numbers close to zero. Deflate can compress these far better than the raw data.

Can we apply a similar idea to the state records of the BFS? Seems like it should be possible. Just like in PNGs, there's a fixed row size, and we'd expect adjacent rows to be very similar. The first tries with a subtraction/addition filter followed by zstd resulted in another 40% improvement in compression ratios: 0.87 bytes per state. The filtering operations are trivial, so this was basically free from a CPU consumption point of view.

It wasn't clear if one could do a lot better than that, or whether this was a practical limit. In image data there's a reasonable expectation of similarity between adjacent bytes of the same row. For the state data that's not true. But actually slightly more sophisticated filters could still improve on that number. The one I ended up using worked like this:

Let's assume we have adjacent rows R1 = [1, 2, 3, 4] and R2 = [1, 2, 6, 4]. When outputting R2, we compare each byte to the same byte on the previous row, with a 0 for match and 1 for mismatch: diff = [0, 0, 1, 0]. We then emit that bitmap encoded as a VarInt, followed by just the bytes that did not match the previous row. In this example, the two bytes '0b00000100 6'. This filter alone compressed the benchmark to 2.2 bytes / state. But combining this filter + zstd got it down to 0.42 bytes / state. Or to put it another way, that's 3.36 bits per state, which is just a little bit over what the back of the envelope calculation suggested was needed to fit in RAM.

In practice the compression ratios improve as the sorted sets get more dense. Once the search gets to a point where memory starts getting an issue, the compression ratios can get a lot better than that. The largest problem turned out to have 4.6G distinct visited states in the end. These states took 405MB when sorted and compressed with the above scheme. That's 0.7 bits per state. The compression and decompression end up taking about 25% of the program's CPU time, but that seems like a great tradeoff for cutting memory use to 1/100th.

The filter above does feel a bit wasteful due to the VarInt header on every row. It seems like it should be easy to improve on it with very little extra cost in CPU or complexity. I tried a bunch of other variants that transposed the data to a column-major order, or wrote the bitmasks in bigger blocks, etc. These variants invariably got much better compression ratio by themselves, but then didn't do as well when the output of the filter was compressed with zstd. It wasn't just due to some quirk of zstd either, the results were similar with gzip and bzip2. I don't have any great theories on why this particular encoding ended up compressing much better than the alternatives.

Another mystery is the compression ratio ended up far better when the data was sorted little-endian rather than big-endian. I initially thought it was due to the little-endian sort ending up with more leading zeros on the VarInt-encoded bitmask. But this difference persisted even for filters that didn't have such dependencies.

(There's a lot of research on compressing sorted sets of integers, since they're a basic building block of search engines. I didn't find a lot on compressing sorted fixed-size records though, and didn't want to start jumping through the hoops of representing my data as arbitrary precision integers.q)

Oh no, I've cheated!

You might have noticed that the above pseudocode implementations of BFS were only returning a boolean for solution found / not found. That's not very useful. For most purposes you need to be able to produce a list of the exact steps of the solution, not just state that a solution exists.

On the surface the solution is easy. Rather than collect sets of states, collect mappings from states to a parent state. Then after finding a solution, just trace back the list of parent states from the end to the start. For the hash table based solution, it'd be something like:

def bfs(graph, start, end):
    visited = {start: None}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return trace_solution(node, visited)
        for kid in adjacent(node):
            if kid not in visited:
                visited[kid] = node
                todo.push_back(kid)
    return None

def trace_solution(state, visited):
  if state is None:
    return []
  return trace_solution(start, visited[state]) + [state]

Unfortunately this will totally kill the compression gains from the last section; the core assumption was that adjacent rows would be very similar. That was true when we just looked at the states themselves. But there is no reason to believe that's going to be true for the parent states; they're effectively random data. Second, the sort + merge solution has to read and write back all seen states on each iteration. To maintain the state / parent state mapping, we'd also have to read and write all this badly compressing data to disk on each iteration.

Sort + merge with multiple outputs

The program only needs the state/parent mappings at the very end, when tracing back the solution. We can thus maintain two data structures in parallel. 'Visited' is still the set of visited states, and gets recomputed during the merge just like before. 'Parents' is a mostly sorted list of state/parent pairs, which doesn't get rewritten. Instead the new states + their parents get appended to 'parents' after each merge operation.

def bfs(graph, start, end):
    parents = Stream()
    visited = Stream()
    todo = Stream()
    parents.add((start, None))
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return trace_solution(node, parents)
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited, parents)
    return None

# Merges sorted streams new and visited. New contains pairs of
# key + value (just the keys are compared), visited contains just
# keys.
#
# Returns a sorted stream of keys that were just present in new,
# another sorted stream containing the keys that were present in either or
# both of new and visited. Also adds the keys + values to the parents
# stream for keys that were only present in new.
def merge_sorted_streams(new, visited, parents):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            visited_head = visited.peek()
            new_head = new.peek()[0]
            if visited_head == new_head:
                out_visited.add(visited.pop())
                new.pop()
            elif visited_head < new_head:
                out_visited.add(visited.pop())
            elif visited_head > new_head:
                out_todo.add(new_head)
                out_visited.add(new_head)
                out_parents.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek()[0])
            out_visited.add(new.peek()[0])
            out_parents.add(new.pop())
    return out_todo, out_visited

This gives us the best of both worlds from a runtime and working set perspective, but does mean using more secondary storage. A separate copy of the visited states grouped by depth turns out to also be useful later on for other reasons.

Swapping

Another detail ignored in the snippets of pseudocode is that there is no explicit code for disk IO, just an abstract interface Stream. The Stream might be a file stream or an in-memory array, but we've been ignoring that implementation detail. Instead the pseudocode is concerned with having a memory access pattern that would be disk friendly. In a perfect world that'd be enough, and the virtual memory subsystem of the OS would take care of the rest.

At least with Linux that doesn't seem to be the case. At one point (before the working set had been shrunk to fit in memory) I'd gotten the program to run in about 11 hours when the data was stored mostly on disk. I then switched the program to use anonymous pages instead of file-backed ones, and set up sufficient swap on the same disk. After three days the program had gotten a quarter of the way through, and was still getting slower over time. My optimistic estimate was that it'd finish in 20 days.

Just to be clear, this was exactly the same code and exactly the same access pattern. The only thing that changed was whether the memory was backed by an explicit on-disk file or by swap. It's pretty much axiomatic that swapping tends to totally destroy performance on Linux, whereas normal file IO doesn't. I'd always assumed it was due to programs having the gall to treat RAM as something to be randomly accessed. But that wasn't the case here.

Turns out that file-backed and anonymous pages are not treated identically by the VM subsystem after all. They're kept in separate LRU caches with different expiration policies, and they also appear to have different readahead / prefetching properties.

So now I know: Linux swapping will probably not work well even under optimal circumstances. If parts of the address space are likely to be paged out for a while, it's better to arrange manually for the to be file-backed than to trust swap. I did it by implementing a custom vector class that started off as a purely in-memory implementation, and after a size threshold is exceeded switches to mmap on an unlinked temporary file.

Compressing new states before merging

In the simplified performance model the assumption was that there would be 100M new states per depth. That turned out not to be too far off reality (the most difficult puzzle peaked at about 150M unique new states from one depth layer). But it's also not the right thing to measure; the working set before the merge isn't related to just the unique states, but all the states that were output for this iteration. This measure peaks at 880M output states / depth. These 880M states a) need to be accessed with a random access pattern for the sorting, and b) can't be compressed efficiently due to not being sorted, c) need to be stored along with the parent state. That's a roughly 16GB working set.

The obvious solution would be to use some form of external sorting. Just write all the states to disk, do an external sort, do a deduplication, and then execute the merge just as before. This is the solution I went with first, but while it mostly solved problem A, it did nothing for B and C.

The alternative I ended up with was to collect the states into an in-memory array. If the array grows too large (e.g. more than 100M elements), it's sorted, deduplicated and compressed. This gives us a bunch of sorted runs of states, with no duplicates inside the run but potentially some between the runs. The code for merging the new and visited states is fundamentally the same; it's still based on walking through the streams in lockstep. The only change is that instead of walking through just the two streams, there's a separate stream for each of the sorted runs of new states.

The compression ratios for these 100M state runs are of course not quite as good as for compressing the set of all visited states. But even so, it cuts down both the working set and the disk IO requirements by a ton. There's a little bit of extra CPU from having to maintain a priority queue of streams, but it was still a great tradeoff.

Saving space on the parent states

At this point the vast majority of the space used by this program is spent on storing the parent states, so that we can reconstruct the solution after finding it. They are unlikely to compress well, but is there maybe a CPU/memory tradeoff to be made?

What we need is a mapping from a state S' at depth D+1 to its parent state S at depth D. If we could iterate all possible parent states of S', we could simply check if any of them appear at depth D in our visited set. (We've already produced the visited set grouped by depth as a convenient byproduct when outputting the state/parent mappings from merge). Unfortunately that doesn't work for this problem; it's simply too hard to generate all the possible states S given S'. It'd probably work just fine for many other search problems though.

If we can only generate the state transitions forward, not backward, how about just doing that then? Let's iterate through all the states at depth D, and see what output states they have. If some state produces S' as an output, we've found a workable S. The issue with the plan is that it increases the total CPU usage of the program by 50%. (Not 100%, since on average we find S after looking at half the states of depth D).

So I don't like either of the extremes, but at least there is a CPU/memory tradeoff available there. Is there maybe a more palatable option somewhere in the middle? What I ended up doing was to not store the pair (S', S), but instead (S', H(S)), where H is an 8 bit hash function. To find an S given S', again iterate through all the states at depth D. But before doing anything else, compute the same hash. If the output doesn't match H(S), this isn't the state we're looking for, and we can just skip it. This optimization means doing the expensive re-computation for just 1/256 states, which is a negligible CPU increase, while cutting down memory the memory spent for storing the parent states from 8-10 bytes to 1 byte.

What didn't or might not work

The previous sections go through a sequence of high level optimizations that worked. There were other things that I tried that didn't work, or that I found in the literature but decided would not actually work in this particular case. Here's a non-exhaustive list.

At one point I was not recomputing the full visited set at every iteration. Instead it was kept as multiple sorted runs, and those runs were occasionally compacted. The benefit was fewer disk writes and less CPU spent on compression. The downside was more code complexity and a worse compression ratio. I originally thought this design made sense since in my setup writes were more expensive than reads. But in the end the compression ratio was worse by a factor of 2. The tradeoffs are non-obvious, but in the end I reverted back to the simpler form.

There is a little bit of research done into executing huge breadth first searches for implicit graphs on secondary storage, a 2008 survey paper is a good starting point. As one might guess, the idea of doing the deduplication in a batch with sort+merge, on secondary store, isn't novel. The surprising part is that it was apparently only discovered in the 1993. That's pretty late! There are then some later proposals for secondary storage breadth first search that don't require a sorting step.

One of them was to map the states to integers, and to maintain an in-memory bitmap of the visited states. This is totally useless for my case, since the sizes of the encodable vs. actually reachable state spaces are so different. And I'm a bit doubtful about there being any interesting problems where this approach works.

The other viable sounding alternative is based on temporary hash tables. The visited states are stored unsorted in a file. Store the outputs from depth D in a hash table. Then iterate through the visited states, and look them up in the hash table. If the element is found in the hash table, remove it. After iterating through the whole file, only the non-duplicates remain. They can then be appended to the file, and used to initialize the todo list for the next iteration. If the number of outputs is so large that the hash table doesn't fit in memory, both the files and the hash tables can be partitioned using the same criteria (e.g. top bits of state), with each partition getting processed independently.

While there are benchmarks claiming the hash-based approach is roughly 30% faster than sort+merge, the benchmarks don't really seem to consider compression. I just don't see how giving up the compression gains could be worth it, so didn't experiment with these approaches at all.

The other relevant branch of research that seemed promising was database query optimization. The deduplication problem seems very much related to database joins, with exactly the same sort vs. hash dilemma. Obviously some of these findings should carry over to a search problem. The difference might be that the output of a database join is transient, while the outputs of a BFS deduplication persist for the rest of the computation. It feels like that changes the tradeoffs: it's not just about how to process one iteration most efficiently, it's also about having the outputs in the optimal format for the next iteration.

Conclusion

That concludes the things I learned from this project that seem generally applicable to other brute force search problems. These tricks combined to get the hardest puzzles of the game from an effective memory footprint of 50-100GB to 500MB, and degrading gracefully if the problem exceeds available memory and spills to disk. It is also 50% faster than a naive hash table based state deduplication even for puzzles that fit into memory.

The next post will deal with optimizing grid-based spatial puzzle games in general, as well as some issues specific just to this particular game.

In the meanwhile, Snakebird is available at least on Steam, Google Play, and the App Store. I recommend it for anyone interested in a very hard but fair puzzle game.

Why PS4 downloads are so slow

jsnell@iki.fi — Sat, 19 Aug 2017 19:00:00 GMT

Game downloads on PS4 have a reputation of being very slow, with many people reporting downloads being an order of magnitude faster on Steam or Xbox. This had long been on my list of things to look into, but at a pretty low priority. After all, the PS4 operating system is based on a reasonably modern FreeBSD (9.0), so there should not be any crippling issues in the TCP stack. The implication is that the problem is something boring, like an inadequately dimensioned CDN.

But then I heard that people were successfully using local HTTP proxies as a workaround. It should be pretty rare for that to actually help with download speeds, which made this sound like a much more interesting problem.

This is going to be a long-winded technical post. If you're not interested in the details of the investigation but just want a recommendation on speeding up PS4 downloads, skip straight to the conclusions.

Background

Before running any experiments, it's good to have a mental model of how the thing we're testing works, and where the problems might be. If nothing else, it will guide the initial experiment design.

The speed of a steady-state TCP connection is basically defined by three numbers. The amount of data the client is will to receive on a single round-trip (TCP receive window), the amount of data the server is willing to send on a single round-trip (TCP congestion window), and the round trip latency between the client and the server (RTT). To a first approximation, the connection speed will be:

    speed = min(rwin, cwin) / RTT

With this model, how could a proxy speed up the connection? Well, with a proxy the original connection will be split into two mostly independent parts; one connection between the client and the proxy, and another between the proxy and the server. The speed of the end-to-end connection will be determined by the slower of those two independent connections:

    speed_proxy_client = min(client rwin, proxy cwin) / client-proxy RTT
    speed_server_proxy = min(proxy rwin, server cwin) / proxy-server RTT
    speed = min(speed_proxy_client, speed_server_proxy)

With a local proxy the client-proxy RTT will be very low; that connection is almost guaranteed to be the faster one. The improvement will have to be from the server-proxy connection being somehow better than the direct client-server one. The RTT will not change, so there are just two options: either the client has a much smaller receive window than the proxy, or the client is somehow causing the server's congestion window to decrease. (E.g. the client is randomly dropping received packets, while the proxy isn't).

Out of these two theories, the receive window one should be much more likely, so we should concentrate on it first. But that just replaces our original question with a new one: why would the client's receive window be so low that it becomes a noticeable bottleneck? There's a fairly limited number of causes for low receive windows that I've seen in the wild, and they don't really seem to fit here.

Maybe the client doesn't support the TCP window scaling option, while the proxy does. Without window scaling, the receive window will be limited to 64kB. But since we know Sony started with a TCP stack that supports window scaling, they would have had to go out of their way to disable it. Slow downloads, for no benefit.
Maybe the actual downloader application is very slow. The operating system is supposed to have a certain amount of buffer space available for each connection. If the network is delivering data to the OS faster than the application is reading it, the buffer will start to fill up, and the OS will reduce the receive window as a form of back-pressure. But this can't be the reason; if the application is the bottleneck, it'll be a bottleneck with or without the proxy.
The operating system is trying to dynamically scale the receive window to match the actual network conditions, but something is going wrong. This would be interesting, so it's what we're hoping to find.

The initial theories are in place, let's get digging.

Experiment #1

For our first experiment, we'll start a PSN download on a baseline non-Slim PS4, firmware 4.73. The network connection of the PS4 is bridged through a Linux machine, where we can add latency to the network using tc netem. By varying the added latency, we should be able to find out two things: whether the receive window really is the bottleneck, and whether the receive window is being automatically scaled by the operating system.

This is what the client-server RTTs (measured from a packet capture using TCP timestamps) look like for the experimental period. Each dot represents 10 seconds of time for a single connection, with the Y axis showing the minimum RTT seen for that connection in those 10 seconds.

The next graph shows the amount of data sent by the server in one round trip in red, and the receive windows advertised by the client in blue.

First, since the blue dots are staying constantly at about 128kB, the operating system doesn't appear to be doing any kind of receive window scaling based on the RTT. (So much for that theory). Though at the very right end of the graph the receive window shoots out to 650kB, so it isn't totally fixed either.

Second, is the receive window the bottleneck here? If so, the blue dots would be close to the red dots. This is the case until about 10:50. And then mysteriously the bottleneck moves to the server.

So we didn't find quite what we were looking for, but there are a couple of very interesting things that are correlated with events on the PS4.

The download was in the foreground for the whole duration of the test. But that doesn't mean it was the only thing running on the machine. The Netflix app was still running in the background, completely idle [1]. When the background app was closed at 11:00, the receive window increased dramatically. This suggests a second experiment, where different applications are opened / closed / left running in the background.

The time where the receive window stops being the bottleneck is very close to the PS4 entering rest mode. That looks like another thing worth investigating. Unfortunately, that's not true, and rest mode is a red herring here. [2]

Experiment #2

Below is a graph of the receive windows for a second download, annotated with the timing of various noteworthy events.

The differences in receive windows at different times are striking. And more important, the changes in the receive windows correspond very well to specific things I did on the PS4.

When the download was started, the game Styx: Shards of Darkness was running in the background (just idling in the title screen). The download was limited by a receive window of under 7kB. This is an incredibly low value; it's basically going to cause the downloads to take 100 times longer than they should. And this was not a coincidence, whenever that game was running, the receive window would be that low.
Having an app running (e.g. Netflix, Spotify) limited the receive window to 128kB, for about a 5x reduction in potential download speed.
Moving apps, games, or the download window to the foreground or background didn't have any effect on the receive window.
Launching some other games (Horizon: Zero Dawn, Uncharted 4, Dreadnought) seemed to have the same effect as running an app.
Playing an online match in a networked game (Dreadnought) caused the receive window to be artificially limited to 7kB.
Playing around in a non-networked game (Horizon: Zero Dawn) had a very inconsistent effect on the receive window, with the effect seemingly depending on the intensity of gameplay. This looks like a genuine resource restriction (download process getting variable amounts of CPU), rather than an artificial limit.
I ran a speedtest at a time when downloads were limited to 7kB receive window. It got a decent receive window of over 400kB; the conclusion is that the artificial receive window limit appears to only apply to PSN downloads.
Putting the PS4 into rest mode had no effect.
Built-in features of the PS4 UI, like the web browser, do not count as apps.
When a game was started (causing the previously running game to be stopped automatically), the receive window could increase to 650kB for a very brief period of time. Basically it appears that the receive window gets unclamped when the old game stops, and then clamped again a few seconds later when the new game actually starts up.

I did a few more test runs, and all of them seemed to support the above findings. The only additional information from that testing is that the rest mode behavior was dependent on the PS4 settings. Originally I had it set up to suspend apps when in rest mode. If that setting was disabled, the apps would be closed when entering in rest mode, and the downloads would proceed at full speed.

A 7kB receive window will be absolutely crippling for any user. A 128kB window might be ok for users who have CDN servers very close by, or who don't have a particularly fast internet. For example at my location, a 128kB receive window would cap the downloads at about 35Mbp to 75Mbps depending on which CDN the DNS RNG happens to give me. The lowest two speed tiers for my ISP are 50Mbps and 200Mbps. So either the 128kB would not be a noticeable problem (50Mbps) or it'd mean that downloads are artificially limited to to 25% speed (200Mbps).

Conclusions

If any applications are running, the PS4 appears to change the settings for PSN store downloads, artificially restricting their speed. Closing the other applications will remove the limit. There are a few important details:

Just leaving the other applications running in the background will not help. The exact same limit is applied whether the download progress bar is in the foreground or not.
Putting the PS4 into rest mode might or might not help, depending on your system settings.
The artificial limit applies only to the PSN store downloads. It does not affect e.g. the built-in speedtest. This is why the speedtest might report much higher speeds than the actual downloads, even though both are delivered from the same CDN servers.
Not all applications are equal; most of them will cause the connections to slow down by up to a factor of 5. Some games will cause a difference of about a factor of 100. Some games will start off with the factor of 5, and then migrate to the factor of 100 once you leave the start menu and start playing.
The above limits are artificial. In addition to that, actively playing a game can cause game downloads to slow down. This appears to be due to a genuine lack of CPU resources (with the game understandably having top priority).

So if you're seeing slow downloads, just closing all the running applications might be worth a shot. (But it's obviously not guaranteed to help. There are other causes for slow downloads as well, this will just remove one potential bottleneck). To close the running applications, you'll need to long-press the PS button on the controller, and then select "Close applications" from the menu.

The PS4 doesn't make it very obvious exactly what programs are running. For games, the interaction model is that opening a new game closes the previously running one. This is not how other apps work; they remain in the background indefinitely until you explicitly close them.

And it's gets worse than that. If your PS4 is configured to suspend any running apps when put to rest mode, you can seemingly power on the machine into a clean state, and still have a hidden background app that's causing the OS to limit your PSN download speeds.

This might explain some of the superstitions about this on the Internet. There are people who swear that putting the machine to rest mode helps with speeds, others who say it does nothing. Or how after every firmware update people will report increased download speeds. Odds are that nothing actually changed in the firmware; it's just that those people had done their first full reboot in a while, and finally had a system without a background app running.

Speculation

Those were the facts as I see them. Unfortunately this raises some new questions, which can't be answered experimentally. With no facts, there's no option except to speculate wildly!

Q: Is this an intentional feature? If so, what its purpose?

Yes, it must be intentional. The receive window changes very rapidly when applications or games are opened/closed, but not for any other reason. It's not any kind of subtle operating system level behavior; it's most likely the PS4 UI explicitly manipulating the socket receive buffers.

But why? I think the idea here must be to not allow the network traffic of background downloads to take resources away from the foreground use of the PS4. For example if I'm playing an online shooter, it makes sense to harshly limit the background download speeds to make sure the game is getting ping times that are both low and predictable. So there's at least some point in that 7kB receive window limit in some circumstances.

It's harder to see what the point of the 128kB receive window limit for running any app is. A single game download from some random CDN isn't going to muscle out Netflix or Youtube... The only thing I can think of is that they're afraid that multiple simultaneous downloads, e.g. due to automatic updates, might cause problems for playing video. But even that seems like a stretch.

There's an alternate theory that this is due to some non-network resource constraints (e.g. CPU, memory, disk). I don't think that works. If the CPU or disk were the constraint, just having the appropriate priorities in place would automatically take care of this. If the download process gets starved of CPU or disk bandwidth due to a low priority, the receive buffer would fill up and the receive window would scale down dynamically, exactly when needed. And the amounts of RAM we're talking about here are miniscule on a machine with 8GB of RAM; less than a megabyte.

Q: Is this feature implemented well?

Oh dear God, no. It's hard to believe just how sloppy this implementation is.

The biggest problem is that the limits get applied based just on what games/applications are currently running. That's just insane; what matters should be which games/applications someone is currently using. Especially in a console UI, it's a totally reasonable expectation that the foreground application gets priority. If I've got the download progress bar in the foreground, the system had damn well give that download priority. Not some application that was started a month ago, and hasn't been used since. Applying these limits in rest mode with suspended apps is beyond insane.

Second, these limits get applied per-connection. So if you've got a single download going, it'll get limited to 128kB of receive window. If you've got five downloads, they'll all get 128kB, for a total of 640kB. That means the efficiency of the "make sure downloads don't clog the network" policy depends purely on how many downloads are active. That's rubbish. This is all controlled on the application level, and the application knows how many downloads are active. If there really were an optimal static receive window X, it should just be split evenly across all the downloads.

Third, the core idea of applying a static receive window as a means of fighting bufferbloat is just fundamentally broken. Using the receive window as the rate limiting mechanism just means that the actual transfer rate will depend on the RTT (this is why a local proxy helps). For this kind of thing to work well, you can't have the rate limit depend on the RTT. You also can't just have somebody come up with a number once, and apply that limit to everyone. The limit needs to depend on the actual network conditions.

There are ways to detect how congested the downlink is in the client-side TCP stack. The proper fix would be to implement them, and adjust the receive window of low-priority background downloads if and only if congestion becomes an issue. That would actually be a pretty valuable feature for this kind of appliance. But I can kind of forgive this one; it's not an off the shelf feature, and maybe Sony doesn't employ any TCP kernel hackers.

Fourth, whatever method is being used to decide on whether a game is network-latency sensitive is broken. It's absurd that a demo of a single-player game idling in the initial title screen would cause the download speeds to be totally crippled. This really should be limited to actual multiplayer titles, and ideally just to periods where someone is actually playing the game online. Just having the game running should not be enough.

Q: How can this still be a problem, 4 years after launch?

I have no idea. Sony must know that the PSN download speeds have been a butt of jokes for years. It's probably the biggest complaint people have with the system. So it's hard to believe that nobody was ever given the task of figuring out why it's slow. And this is not rocket science; anyone bothering to look into it would find these problems in a day.

But it seems equally impossible that they know of the cause, but decided not to apply any of the the trivial fixes to it. (Hell, it wouldn't even need to be a proper technical fix. It could just be a piece of text saying that downloads will work faster with all other apps closed).

So while it's possible to speculate in an informed manner about other things, this particular question will remain as an open mystery. Big companies don't always get things done very efficiently, eh?

Footnotes

[1] How idle? So idle that I hadn't even logged in, the app was in the login screen.

[2] To be specific, the slowdown is caused by the artifical latency changes. The PS4 downloads files in chunks, and each chunk can be served from a different CDN. The CDN that was being used from 10:51 to 11:00 was using a delay-based congestion control algorithm, and reacting to the extra latency by reducing the amount of data sent. The CDN used earlier in the connection was using a packet-loss based congestion control algorithm, and did not slow down despite seeing the latency change in exactly the same pattern.

A rating system for asymmetric multiplayer games

jsnell@iki.fi — Wed, 18 Nov 2015 16:00:00 GMT

Introduction

A couple of years ago I wrote a quick and dirty rating system for a online boardgame site I run. It wasn't particularly well thought out, but it did the job. Some discussion about the system made me revisit it, with two years of hindsight and orders of magnitude more data.

How well does the system actually work, and how predictive are the ratings? There are some obvious tweaks to the system — would implementing them make things better or worse? Would anything be gained from switching to a more principled (but more complicated) approach. For this last bit, I used Microsoft's TrueSkill as the benchmark. It has some desirable properties and appears to be the gold standard of team based rating systems right now.

The code and the data are available on GitHub in my rating-eval repository.

The game - What properties of this game are relevant to a rating system?
The rating system - How does the current rating system on the site work?
Evaluating rating quality - What are valid metrics for evaluating the predictive quality of a rating system?
TrueSkill - What's TrueSkill, and how was this game translated to TrueSkill concepts?
Results - The evaluation results in tedious detail.
Conclusions - The tl;dr on the results.
Footnotes

The game

The game in question is Terra Mystica. I won't go into the exact mechanics of the game in this article, since that just doesn't matter. What does matter is the answer: What makes the game such a unique snowflake that I can't just plop in a standard Elo or Glicko implementation and call it a day?

First, TM is a multiplayer game and indeed most commonly played as a 3-5 player game rather than a 2p one. This means that a pure two player rating system would not work. But there are well known ways of coercing a two player system to a multiplayer one, so that's not a major concern.

Second, TM is very asymmetric especially by the standards of the eurogame genre. The base game of Terra Mystica comes with 14 different factions, freely chosen by the players at the start of the game. The only restriction is that the factions come in 7 colors; picking a faction of a given color blocks the other faction of that color from the game. The first expansion added 6 more factions to the game. The factions can be very different from each other, with each of them having different special powers, building costs, resource production, and so on. Every faction also has a preference for different parts of the map, and will be on a different point on the symbiotic-competitive spectrum with different opposing factions.

Why would asymmetry matter for a rating system? Well, one reason is that asymmetry is a potential source of imbalance. Especially early on in the game's lifecycle it was not clear whether the different factions were balanced or not, and if not how unbalanced they were.

The arguments about this were complicated by there being a chicken and egg problem. Some people thought that statistics on the win rates or average scores for each faction were invalid. Clearly, the argument went, some factions were just doing well because they're more popular along good players (or the converse for factions doing badly due to being popular among bad players). And then there were people arguing the opposite! They'd say that you could not actually tell anything about how good a player was, because a high win rate or high average score for for a player might just be due to them playing good factions.

The way to open up this deadlock is a rating system that can simultaneously determine the skill of players and the relative strength of factions, with a feedback system between the two.

Third, first full expansion to the game didn't just introduce the new factions mentioned above. It also introduced two completely new maps. Now, Terra Mystica as a game is incredibly sensitive to the map design. I was part of playtesting one of the new maps, and it was an kind of an infuriating process. The developer would make a tiny tweak, just swapping a couple of hexes around, and there would be a butterfly effect. So there would be an attempt to make Darklings a little worse with a change, and they're not actually affected but Halflings become borderline unplayable.

From a rating system view the question is then whether these three maps should be considered as separate games or the same one. (The game also has some random setup aspects; these setups can't be considered as separate games from a rating perspective since there's something like 2 million different configurations).

The rating system

My primary goal for the rating system was for it to be useful for predicting game outcomes [1]. And as mentioned in the previous section, it would need to compute both player and faction ratings. But in addition to these things, there were a secondary goals.

A lot of board game sites use a unmodified Elo system with the standard constants, with the ratings updated after every match finishes. This can make the ratings absurdly volatile. It also encourages people not play against opponents with much lower ratings. Both of these are driven by the way multiplayer games are less computable than two player games. You can lose a game essentially through no fault of your own, which won't happen in a two player game, and especially if you finish dead last in a five player game, that single game can erase what feels like 20 games of progress to a higher rating. It's no wonder that the players become very conservative and risk-averse.

Some players will even take advantage of this volatility and manipulate their ranking by changing the order in which their games finish. Do you have 10 games about to finish? Just stall in the 5 that you're doing well in, and rush the ones you're losing in. Your final rating will be significantly higher than if you alternated the good and bad games.

All of the above is bad. So a secondary goal was that the system shouldn't be quite that sensitive to the most recent games. The exception are players with very few games played; for those players you do want very high sensitivity so that they get roughly to their proper rating as soon as possible.

And second, I wanted to encourage players to pick the factions perceived as worse more often. So a player should be rewarded more / penalized less for winning with a bad faction than for winning with a good faction.

So, this is what I ended up with.

We start with the usual hack for converting a two player rating system to work with multiplayer games: treat each N player match as (N/2) / (N+1) two player submatches, and apply the rating algorithm to those submatches.

The core of the system is the normal Elo equation, which computes an expected score (somewhere between 0 for a loss and 1 for a win) for both players based on the difference of their ratings:

  my $ep1 = 1 / (1 + 10**(-$diff / 400));
  my $ep2 = 1 / (1 + 10**($diff / 400));

Sorry, I lied! We don't actually use the difference of the ratings of the players. Instead we add up the rating of the player and the rating of the faction they are playing first, and only take the difference after that.

  my $p1_score = $p1->{score} + $f1->{score} * $fw;
  my $p2_score = $p2->{score} + $f2->{score} * $fw;
  my $diff = $p1_score - $p2_score;

What's the $fw variable there? It's a configurable faction weight, in case we want to run experiments with the faction choice being considered more or less important. There's one other thing $fw is used for. If a player drops out, we still want to compute a rating change for them (otherwise they'd just drop out of games they're about to lose to avoid the ranking penalty). But we really should not be penalizing the faction the player was using when they dropped out. It's not like the faction had any way of influencing that! So in the specific case of a dropout, $fw gets set to 0 for that submatch.

  if ($res->{a}{dropped} or $res->{b}{dropped}) {
      if ($settings->{ignore_dropped}) {
          next;
      } else {
          $fw = 0;
      }
  }

We then need the actual result of the game, in the same format as our expected results. So again it's 1 for win, 0 for loss, and now also 0.5 for a draw.

  if ($a_vp == $b_vp) {
      ($ap1, $ap2) = (0.5, 0.5);
  } elsif ($a_vp > $b_vp) {
      ($ap1, $ap2) = (1, 0);
  } else {
      ($ap1, $ap2) = (0, 1);
  }

The difference between the actual and expected ratings is then used to compute a rating change for both players. Let's return later to what the value of $pot actually is.

  my $p1_delta = $pot * ($ap1 - $ep1);
  my $p2_delta = $pot * ($ap2 - $ep2);

As the last step of dealing with each submatch we apply the rating changes. There's some subtleties here. First, we don't necessarily do all the updates immediately but batch up the rating records and apply a batch in one go. (For example apply all the changes from a single game in one go rather than have the results of the first submatch affect the interpretation of the other submatches — but see the results section for more on that).

Second, until a player has played a minimum number of games (default 5) they won't have an effect on other players. Or to be clear, we update the rating of a player either if both players are new, or if the opponent is not new. (But not when the player is old and the opponent is new). This has the effect that new players will have a "shadow rating" computed for them, but will not affect the ratings of the opponents or factions.

The idea here is that we have very little idea of the strength of the new player. So it makes no sense to just assume they're an exactly average new player, and use that assumption to adjust the ratings of players for whom we already have a lot of better quality data.

  my $count = ($p1->{games} >= $settings->{min_games}) +
    ($p2->{games} >= $settings->{min_games});

  if ($p2->{games} >= $settings->{min_games} or !$count) {
      push @rating_changes, [$p1, $p1_delta];
  }
  if ($p1->{games} >= $settings->{min_games} or !$count) {
      push @rating_changes, [$p2, $p2_delta];
  }

Faction ratings only get updated if both players have played enough games. The faction weight factor is taken into account here as well.

  next if $count != 2;

  push @rating_changes, [$f1, $pot * $p1_delta * $fw];
  push @rating_changes, [$f2, $pot * $p2_delta * $fw];

The most dodgy part of what I implemented is that (unlike traditional Elo) this system doesn't run as a streaming process where the ratings are computed purely from old ratings and some number of finished games. Instead the ratings are computed from iteratively, taking into account the full history every time the computation is done. There are obvious practical reasons for why you wouldn't want to do things that way, but I'm not expecting to ever have enough games for that to matter.

Every iteration works on exactly the same data. The same game results handled in the same order, using the same settings. However, the ratings are not reset between iterations. At the start of the first iteration everyone has a rating of 1000. At the start of the second iteration they'll have some other value. These different starting values will affect all the subsequent computations. It's kind of back-propagating the later results, such that they affect the algorithm's interpretation of earlier events.

  for (1..$settings->{iters}) {
    iterate_results @matches, %players, %factions, $_, $settings;
  }

There's another difference between the iterations, which is the iteration count being passed in as the 4th parameter. This is where the value for $pot comes from. With the default settings the progression goes 16, 4, 1.77. The later iterations have a smaller effect.

  my $pot = $settings->{pot_size} / $iter ** $settings->{iter_decay_exponent};

This also means that the later games have a smaller effect than you might initially think. Sure, every result can contribute up to 22 rating points to a player's rating. But in practice a chunk of the 1st iteration's 16 point rating change would have been undone by the time the 2nd iteration finishes and it's time to divvy up 4 points for that game. The reason is that the player will have a higher starting rating for the 2nd iteration, so every win will count for less and every loss will count for more.

For new players (who only have few games played) this effect of the older games undoing part of the result of the first game will be almost non-existent. There's not very many of those old games around, after all. This is exactly what we want, for players with few games every new game tells us a lot relative to what we knew at the start. Later games will however always matter more than the earlier ones, so it's also not the case that the rating of someone who has played hundreds of games can't shift their rating at all.

What's not considered by this system? This system only uses on the final ranks of the game as input. It doesn't take consider the in-game victory points, either as an absolute value or relative to the scores of the opponents. Does this make sense? Surely a win by 10 points should be worth more than a win by 1 point. The latter is basically a draw!

But this seems like a necessary restriction. People love seeing numbers go up, even if the numbers are in reality totally irrelevant. They love it so much that they'll try to optimize for the number going up. If you introduce something other than winning as a component in the rating system, some players will start optimizing for that other thing instead. That's what they're rewarded for, so it must be the right thing! And at that point the system as a whole must become less useful at predicting winning, andbetter at predicting that other thing.

So even if using more detailed in-game statistics as part of the rating computation would almost certainly provide more accurate statistics, it'd only work if the players are unaware of it.

Evaluating rating quality

The basic process I used for evaluating the rating quality was to first split all my data into two parts. The first 75% or so of the data was essentially a training set, used to compute ratings for all players and factions in the game. The output of the rating system should be such that the ratings of two entities A and B can be converted to a win probability - (0 if A is basically guaranteed to lose, 1 if they're basically guaranteed to win, but mostly clustered in the 0.25-0.75 range). We then compare these predictions against the actual output in the remaining 25%, the evaluation set, with a loss being 0, a win being 1, and a tie (rare but possible) being 0.5.

How should this comparison work? There's a few kinds of metrics we could compute. Some metrics are basically self-contained, completely independent of other prediction sets. "This prediction set scored 1500". Other metrics are derived from comparing two sets of predictions directly against each other. "A was better than B, 100 to 80", "A was better than C, 150 to 60", but this tells you nothing about how B and C would compare.

The rest of this section basically goes through my process of trying to find metrics that worked. It'll therefore stumble in and out of a couple of dead ends.

The simplest possible metric is just to sum up the absolute errors. If the prediction for a match is 0.8 and the actual result is 1, penalize the prediction set by 0.2 points. The closer to zero the score, the better the prediction set. This doesn't actually work. The problem is that it's suboptimal for a system to use its actual prediction; it should always round the prediction to the closest of 0 or 1.

Example: A and B are playing 5 matches. The rating system predicts an 80% win rate for A. And in fact A gets exactly that win rate. The absolute error would be: 4*(1 - 0.8) + 1*(0.8) = 1.6. What if we predict a 100% win rate instead? That'd mean: 4*(1 - 1) + 1*(1) = 1. The less accurate prediction was judged to be significantly better. That's just no good at all.

What you instead need is the sum of squares of errors (SSE). For the same example, the 80% prediction then produces 4*((1 - 0.8)**2) + 1*(0.8**2) = 0.8 while the 100% prediction gives 4*((1 - 1)**2) + 1*(1.0**2) = 1.

What about a direct comparison between two rating systems? Could we just count how many times each system was closer to the actual result?

Example: if one system predicts 70% win rate for A over B while the other predicts a 90% win rate, award the first system a point every time A loses and the second a point for every win. The system with the higher score is better. This fails for a very similar reason as the first method, as accurate predictions are penalized. In this example, if A wins 4 out of 5 matches the first system would get one point while the second one gets 4 points. But both predictions were actually equally far from the actual win rate of 80%.

How can this be fixed? Clearly the reward for the better prediction can't be fixed, but must somehow depend on the odds that the predictions are implying. That is, the predictions are essentially used to place bets. Actually doing a betting system is a bit tricky though, since betting is fundamentally a random process, but we'd like our metrics to be computed in a deterministic manner. After some doodling, I came up with the following method that's deterministic but still retains the core flavor of betting.

Let $e1 and $e2 be the win probability each system gives to player A, and $res be the actual result (0, 0.5, 1). If the predictions are the same, there's no bet to be made. But otherwise both players should be happy to make a bet using the implied odds of the midpoint of those predictions:

my $em = ($e1 + $e2) / 2;

The set that gave a higher prediction will then be adjusted by $res - $em points, while the other will be adjusted by $em - $res points. This has the effect that when the outcome is the one that both systems expected, the winner of the bet will be awarded a small number of points. While if the outcome is a surprise to both systems, the winner of the bet will get a larger amount. It will also be symmetrical: the outcome will be the same if you swap the players around (i.e. invert the predictions and the result).

Example: A and B are playing 5 matches, with A winning 4 games. System X predicts a win rate of 0.7 for A, system Y predicts 0.8, with an average of 0.75. Since Y's prediction was higher, it will be receive 1 - 0.75 = 0.25points for every game that A wins, but lose 0 - 0.75 = -0.75 points for the ones A loses. The end result is that Y gains 4*0.25 - 0.75 = 0.25 points from these 5 matches (and X loses the same amount, since this is a zero sum process). This makes sense since Y's prediction was indeed more accurate.

Finally, one more way to compare two rating systems is by only looking at matchups where the two systems produced split predictions; that is, one system gives a win probability of under 0.5 while the other gives a probability of over 0.5. In this metric we simply count which of the two picked the correct winner more often.

This system doesn't suffer from any tactical misprediction issues. But it means throwing away most of the data. It also makes the value judgement that the most important (only important?) part of the prediction space are the matches between players of almost equal skill. Whether that's the right call or not seems to depend on the ultimate purpose of the rating system. The needs of a matchmaking system are different from the needs of a system that tries to predict the results of games between arbitrary players.

TrueSkill

TrueSkill is a rating system from Microsoft for use for Xbox Live online games. It's got three interesting properties. First, it deals natively with multiplayer matches. Second, it supports teams with multiple members. And third, it doesn't track skill as a single number but as a combination of two numbers: an estimate of the skill, and an uncertainty of the estimate.

The system is also very complicated compared to something like Elo. The best explanation of how it works is an epic blog post by Jeff Moser, who also made the first open source implementation in C#. I wasn't man enough to re-implement the algorithm from scratch, and used the Python TrueSkill implementation by Heungsub Lee

For the purposes of this investigation, we'll represent each player + faction combination as a 2 player team, each game as a match between 3-5 such teams, and then just trust TrueSkill to do the right thing.

The other thing we need is deriving a win probability from two TrueSkill ratings (which, again, is a combination of a skill estimate and an uncertainty). This isn't completely trivial since it needs to take into account the possible skill distribution of all players (expressed as normal distributions) as well as the distribution for how different skill ranges affect the win probability.

Somewhat surprisingly, this doesn't appear to be addressed at all in the literature. I ended up writing the following based on a suggestion by Moser in one of the blog comments (linked above), but have to admit I didn't think too much about whether it's really correct.

def win_probability(a, b):
    deltaMu = sum([x.mu for x in a]) - sum([x.mu for x in b])
    sumSigma = sum([x.sigma ** 2 for x in a]) + sum([x.sigma ** 2 for x in b])
    playerCount = len(a) + len(b)
    denominator = math.sqrt(playerCount * (BETA * BETA) + sumSigma)
    return cdf(deltaMu / denominator)

Results

This section will go through individual changes to the rating system starting from nothing, ending up with the rating system in its current form. We'll then look at potential extra features, before finishing with the comparison to TrueSkill. If you skipped over the previous section on evaluating rating quality, below is a quick summary of the three metrics that I'll use:

SSE: Sum square of errors
Betting: The predictions of the two systems are used to determine the odds for a bet. The
Split predictions: Look only at cases where the two systems disagree on who is going to win a given pairwise matchup. Give a point to the system that predicts the winner correctly.

The process was to split the game data into two parts. A training data set with about 140k pairwise matchups was used to compute ratings for all players / factions. These ratings were then used to predict the results on the evaluation set of about 55k matchups. The split between training and evaluation sets was done using a cutoff date. The training set contained the games that finished before 2015-06-01, the evaluation set had the games that finished later.

One final point on the test setup is that all the win predictions were done on pairwise submatches, rather than the match as a whole. That applies even in cases where the system used for computing the ratings from the training data was match-based rather than pairwise submatch-based.

A. The dummy rating system

Let's start with the stupidest possible system, which totally ignores the training set, and just predicts a 50% win rate in every pairwise match. On the 29996 matchups in the evaluation set, we get a SSE of 7386.25. That should be our absolute floor.

B. Normal Elo

Moving on, we'll use a normal Elo system with a K-factor of 24. The SSE drops to 6689, and on our "betting" metric B beats the dummy system A by 1658 to -1658. (There's no result on the split prediction test, since the dummy rated everything as 0.5. That metric needs one algorithm to give a result of over 0.5 and the other a result of under 0.5).

C. Normal Elo, minimum of 5 games

Then we introduce the restriction of users not having an effect on other players before they have played at least 5 games (as described in the rating system section). Compared to B, the results are inconsistent. There's a tiny improvement in SSE which drops to 6687. There's also a noticeable improvement in the betting metric, which C wins over B 152 to -152. We can now also get results on split predictions, where B is better by 390 to 370.

This feature seems unlikely to be worthwhile, but it's not harmful either and it's what's used in the current production implementation. So we'll carry on using it in the following test cases too.

D. Iterated Elo

This test uses the iteration model, with the default parameters as explained in the algorithm description: three iterations with K-factors of 16/4/1.77. SSE drops to 6612. More significantly, D wins the betting over C by 830 to -830, and the split predictions by 1355 to 1133.

E. Faction ratings

The next step is to compute faction ratings in parallel to the player ratings, to feed the faction ratings back into the player rating computations, and to take into account both the player and the faction ratings when predicting win probabilities. This is also the system I'm currently using, so it's kind of my "par" value. Any further changes would hopefully be improvements.

When faction ratings are mixed in, SSE drops by a lot to 6471. E wins the betting metric over D by 710 to -710, and split predictions by 2488 to 2110.

F. Per-map faction ratings

The faction ratings being global, rather than computed separately for each map, is a common criticism of the rating system from the player base. In this test each faction and map combination is treated as a distinct entity, with completely separate ratings. Doing this does indeed improve the results a bit. SSE is reduced to 6437. F also wins the betting over E by 383 to -383, and the split predictions by 1147 to 1033.

Making this change would however be tricky from a UI perspective. The rating UI wouldn't scale nicely to 60 "factions", but you can't really aggregate the results either. Some thought required on whether this change would be worth it.

G. Ignoring dropouts

The test driver skips any pairwise matchups where one or both players dropped out. What would happen if we did the same when computing ratings? That is, if either party of a pairwise matchup drops out, don't update the rating for either. This would indeed improve things a bit, which makes some sense. SSE drops to 6419. G also wins betting over F by 113 to -113, and split predictions won by 840 to 814.

This isn't an acceptable change though. Introducing it would immediately make players start dropping out of games they expect to lose. But it is interesting to see what the effect might be in a perfect world, where the rating system would not affect player behavior. [2]

H. Different faction weights

Using lower faction weights than 1 produces almost imperceptibly better results, and not even across the board (it generally makes two categories better, one category worse). This change also seems hard to justify from first principles, so it feels too much like curve fitting. So this is another idea not worth implementing.

I. Batched rating updates

One thing that was kind of dodgy in my original implementation is that the rating changes from a single multiplayer game were not applied in a single atomic unit. Instead we'd first compute a rating changes from a single pairwise submatch, apply those changes, then compute a rating change for the next pairwise submatch based on the already updated ratings. What if the algorithm instead first computed all the changes from that one match, and then applied all of them in one go?

This should make sense, but also not have all that big an impact. Neither of those expectations is true though. The SSE is significantly higher at 6523 (vs. 6437 for F), I loses the betting by -934 to 934, and the split decisions by 660 to 783.

I don't know why this change makes the system perform worse. The most likely reason is that batching the updates makes single events have larger effects on the ratings. That's not entirely satisfactory, since if that were the case just a slightly smaller K-factor should have a similar effect. But that's not what happens in practice.

J. TrueSkill without factions

I'll do all the TrueSkill comparisons to the "Per-map faction ratings" version of the system (F). The first variant to test is TrueSkill that ignores factions completely, both when computing the ratings and when making predictions on win probabilities. This (heavily crippled!) version of TrueSkill has a SSE of 6658, which is somewhere between the B and D. J loses the betting against F by -732 to 732, and the split decisions by 2981 to 3500.

K. TrueSkill with factions

Next up we use TrueSkill as intended. Each faction is now treated as a TrueSkill player, and each player / faction combination is considered to be a two player TrueSkill teams. This gives a much more respectable SSE of 6474 (still a bit off from the 6437 of F), loses betting by only -181 to 181, and loses the split predictions by 2128 to 2203.

L. TrueSkill with per-map factions

Treating each faction and map combination separately (i.e. changing from E to F) gave a good boost to the accuracy of the iterative algorithm. Would the same approach work for TrueSkill? The SSE does indeed drop to 6441. L still loses to F in the other two metrics: betting by -148 to 148, and the split predictions by a very tight margin of 1921 to 1935.

Again, just as in H, tiny and somewhat inconsistent improvements could be achieved by tweaking the faction weights a bit. Doing so would not change the big picture, and the same arguments against doing so are still valid.

Conclusions

When I did the experiments, I was kind of expecting the conclusion to be a classic tradeoff between complexity and great results vs. simple and "good enough". That turns out to not be the case; the two rating systems were very closely matched, and in fact the quick hack that's currently used seemed to measure marginally better than the matching TrueSkill version.

The current production version isn't quite at a local optimum though, there's one improvement that would be worth adding to it (separate ratings for each map + faction combination). But there doesn't seem to be any great urge to switch to a completely different system. That's a happy ending, since I'd much rather have 150 lines of my own code than 1500 lines written by someone else and that I don't entirely understand.

There are a couple of caveats. It's possible that the formula for going from sets of TrueSkill ratings to win probabilities is suboptimal, and skewing the results. The data set isn't huge and I can't compute a confidence interval of any kind, so it's possible that the results aren't actually significant (but on the other hand I'm happy even with a statistical tie).

Also, while here we just looked at how effective each rating system was at predicting outcomes, there are other criteria that matter in practice. For example, one thing my users are pretty vocally unhappy about is that they perceive the faction ratings to be too unstable. It seems likely that TrueSkill would do a better job there, since the faction ratings it produces have a low uncertainty (the factions have thousands or tens of thousands of games played), so new results would only cause tiny changes to them.

If anyone would like to test how their pet rating system works for this use case, the test data and the evaluation scripts are available.

Footnotes

[1] In the first draft, I wrote that this was "obviously" the goal. But on reflection, it's probably not obvious at all. After all, there are so many rating systems around that seem to have about zero predictive value. Maybe there's a bias towards people who play a lot, or there's no concept of opponent skill. Clearly these systems were designed with some goal in mind, but predictive power certainly wasn't it. Sirlin's delightfully cynical story on Starcraft 2's rating system might be relevant here.

[2] Now that I'm writing this article, it occurs to me that there might be a decent compromise solution: apply a rating penalty to the player who drops out, but don't give any reward to the winner. You'd lose the zero sum property, but that doesn't seem critical. The same is true of Elo variants that use variable K factors depending on number of games played, and they seem to work just fine.

Detecting cheaters in an asynchronous online game

jsnell@iki.fi — Wed, 22 Jul 2015 16:00:00 GMT

Introduction

This post is a description of some tools and data analysis I did for detecting players using multiple user accounts in an asynchronous online game. The code is available at GitHub.

A couple of months ago one of the players on my Online Terra Mystica site had some concerns that some of the players in the tournament were playing with multiple accounts. So I decided to do a bit of digging into the logs to see whether it was really happening or just paranoia.

Gathering and preprocessing the data

Before doing anything else, we need some data to work with. It happens that I store a log record for every move done in any game, mostly for debug purposes. That record contains a bunch of information, but the ones we're going to work with in this analysis are the following:

username
game
timestamp
IP address

I used the records for all games played in period the tournament ran for (two months starting May 1st), but only looked at players who had entered the tournament (about 750).

The first order of business was to find some suspicious users. I defined this as two users making at least one move from the same IP address on the same day. There were a couple of surprising things. First, 230 players were "suspicious", which was a lot higher than I would have expected. Second, it wasn't just that a user happened to arrive from the same IP as a single other user. It could be as high as 10 other users. There must be a lot more IP address reuse and sharing going on than I was assuming. So this really won't work as anything except a first filter to reduce the search space a bit.

The next step is to process this data into something that can be used for automatically assessing the similarity of the access patterns of two users. The data in the current form makes it easy to determine whether two players did a move at roughly the same time. But there's also information in the times where no moves were made, as long as it was the turn of one or both of the players to move.

The transformation to get that data is simple: during processing of the input data we keep track of the timestamp T of the last move done in each game. When we see a new record for a game, we mark it having been that users turn from T to the timestamp of the new record. This is an approximation, since a game can be waiting for input from multiple players at the same time. But it should be good enough, since the only effect of getting this wrong is moving some samples from weak dissimilarity to no signal.

Finally, the analysis I had in mind wasn't going to work on a continuous scale, so the data is bucketed into 30 minute intervals [1].

Quantifying similarity

The next step is to compute a similarity score for the access patterns of two accounts.

There's three states a player could be in for a given time segment, which we'll label "moved" (player moved at least once), "stalled" (player's turn in at least one game, but did not move) and "idle" (otherwise). There's 9 combinations of these states for two players, and depending on the whether the combination supports or refutes similarity we assign a positive or negative score based on the following table:

	B moved	B stalled	B was idle
A moved	-10/10	-5	0
A stalled	-1	1	0
A was idle	0	0	0

Moves being done at the same time is a very strong signal, either positive or negative depending on whether the moves were done from the same location or not. Both players being stalled at the same time is a very weak signal (though if the players are stalled for a long time, it can add up since this score is computed once per time segment).

You might have noticed that the matrix is not symmetric. We treat A moving while B stalls different from B moving while A stalls. This is done to deal with a common pattern where the main account makes moves from both a main computer and a mobile phone, while the sockpuppet makes moves only from the computer. These accounts will appear very similar on visual inspection but just mildly similar if both cases were treated as a strong negative signal. With an asymmetric scoring matrix, all the strong negative signals will accumulate for one user, leaving the other account with a high similarity score. [2]

There's also a non-local signal that I'm using. In the case where both accounts moved from the same IP, the score is adjusted higher if both accounts had been stalled for a long time just before making those moves. The theory here is that it's a much stronger signal for two players to make a near-simultaneous move after say two days of unforced inactivity than after half an hour of inactivity, since long periods of inactivity tend to be rare, and it should be rarer yet for that period to be finished at exactly the same time for persons that are truly distinct.

These scores are then summed up, divided by weight (essentially the sum of the absolute values of the scores, but there's a small initial weight so that account pairs require a decent number of samples before showing alarming levels of similarity). This -1 to 1 range is then further normalized to a 0 to 1 range, which I find that more convenient to work with.

A tiny bit of fudging in the above description is that to deal with quantization discontinuities (A makes a move at 16:29, B makes a move at 16:31), we actually look at a sliding window of three time segments rather than process them completely invidually. Since time segments with moves are sparser than non-moves, this will increase the further effective weight of the top-left 10/-10 cell in the matrix. When picking the single cell in the table that gets scored, moved > stalled > idle, so a single matching (or mismatching) pair of moves might get scored up to three times.

Visualization

In order to get some idea of whether the similarity measure is reasonable or not, we also need some way of visualizing the data of some subset of users. After an hour of trying to get something even remotely readable from R (which usually is pretty decent at visualizing data), I gave up in disgust and just wrote a tiny Perl script to just generate a SVG file directly.

It simply has one wide column for each IP address running horizontally, and time running vertically. Each user has a different color for drawing their interactions with the site, as well as their own smaller strips of the full per-IP columns. A time segment where the user did a move is drawn as a square, a time segment where they stalled is drawn as a thin rectangle / line.

Here's an example from a hilarious clique of 8 separate accounts, mostly playing in near-perfect lockstep. How many distinct persons are involved? Click on the thumbnail to open the full image:

This visualization is not perfect. The biggest problem is that with more than about 20 addresses involved there's just too much horizontal scrolling involved on comfortable zoom levels. This would not be impossible to solve. Generally if there are large amounts of IPs involved most of them are completely ephemeral, and could somehow be folded together.

Evaluation

I looked at the similarity scores of some people I know in real life and suspected might be false positives (e.g. working in the same place, or living in the same place. These tended to be in the 0.4-0.6 range. I also looked at some cases where I was essentially certain were being played by the same person based on other metadata. The scores for those were over 0.9. The latter seems like a reasonable threshold value for flagging particular pairs accounts for more scrutiny.

When the 7th season of the tournament started, there was a rules change put in place explicitly to forbid playing with multiple accounts. There were no retroactive penalties of course, but looking at how people's behavior changed might provide us with some hints about whether this algorithm was at all on the right track.

I got emails from people who were horrified that they might be banned because both they and their SO played the game. All of these cases were well below the threshold. There were also people apologizing for playing with multiple accounts, saying they hadn't thought it was wrong. These cases tended to be at or above the threshold.

A decent way of visualizing the effect of the rule change is looking at the distribution of the similarity scores for season 6 vs season 7. The way to read these graphs is that the most similar pair of users in season 6 had a similarity of 0.99, the 80th most similar pair a score of 0.9. (Also remember that the scores are not symmetric, so every pair of users with any similarity shows up twice in this graph, once per direction). You can see the proportion of very suspicious accounts being cut in half from one season to the next while the rest of the graph has roughly the same shape:

And as some final behavioral metrics, there were 13 suspicious cliques of users in season 6. At least one player continued on to season 7 from every clique, but in 8 out of 13 cases each clique shrunk down to just one player. After the rules change, players above the flagging threshold were about 3 times as likely to not join season 7 than the average player.

Conclusion

A comment I got when I gave a draft of this article to some friends was that it's stupid to write an article about exactly how any kind of anti-cheating setup works. All it does is allow cheaters to figure out exactly what they need to do to game the system. This is definitely true in the general case.

I'd like to hope that this isn't an issue in this case. It's a relatively small community, the stakes are low, getting any advantage from cheating would be hard work since you'd need about a year if playing to work a sockpuppet player into the top divisions. And the workarounds you'd need are pretty obvious, not anything clever. My impression is that everyone who had been doing this was simply wanting to play in more games, and thought that it wasn't really forbidden. Trying to work around this kind of a detection system by connecting different sockpuppets through different proxies should produce some fairly heavy cognitive dissonance with the idea that this is an allowed practice :-)

Oh, and what about the user whose complaint triggered this little investigation? It turns out that everyone else playing in his league was squeaky clean. Unfortunately the setup of the tournament (with each division having twice as many leagues as the previous one) means that you'll always have pretty high skill differences in the lowest division. The lowest division will after all on average contain a third of the entire tournament's players, and might contain as many as half.

Footnotes

[1] Half an hour is totally arbitrary. I wonder if there's some good way of determining an appropriate time bucket size.

[2] Why not make it a weak symmetric signal then? Because the strong signal is great for genuinely distinct accounts. The penalties will tend to be distributed fairly equally between the two, correctly depressing the scores of both accounts.

A Monte Carlo simulation of Red7

jsnell@iki.fi — Mon, 30 Mar 2015 14:00:00 GMT

Red7 is a very clever little card game, and one of my favorite 2014 releases. But I have wondered about the density of meaningful decisions in the game. Sometimes it doesn't feel like you have all that much agency, and are just hanging on in the game with a single valid move every time it's your turn.

So here's some automated exploration of what a game of Red7 actually looks like from a statistical point of view. The method used here is a pure Monte Carlo simulation, with the players choosing randomly from the set of their valid moves.

Why a Monte Carlo simulation? I started trying to do a full game tree for a given starting setup but to my surprise the game tree is actually too large for that to be feasible; 2 weeks of computation even for a single two player game and a lot of optimization. The branching factor is just much bigger than it feels like when playing the game.

The rules

(Skip this section if you're already familiar with the game. All you need to know is that we're using the advanced version of the game but without the optional special action rules.)

The rules of the game are very simple. There's a deck of 49 cards (7 colors, numbers 1-7 in each color). In the middle is a discard pile ("canvas"). The color topmost card of the discard pile determines the victory condition. You must be "winning" at the end of each turn you take, or you're out of the game.

There are three options to choose from on your turn. Play a card from your hand to the table in front of you (your "palette"), discard a card from your hand to the canvas, or first play a card and then discard a card. If you discard a card with a number higher than the number of cards in your palette, you get to draw a card.

The winning condition is determined based on the color of the canvas (i.e. top card in discard pile):

Red	Highest card
Orange	Most cards of the same number
Yellow	Most cards of the same color
Green	Most even cards
Blue	Most different colors
Indigo	Longest run of sequential numbers (e.g. 4/5/6)
Violet	Most cards with a number lower than 4

If two players are tied for the winning condition (e.g. the rule is blue and both of them have three even cards in their palette), the winner is the player who had a higher card included in their card combination (cards that didn't contribute to the winning condition are ignored for the tie breaker). This is primarily based on the numeric value of the card. But if two cards have the same value, the one closer to red in the spectrum wins the tie (e.g. green 5 > indigo 5 > green 4).

The implementation

(Ignore this section if you're not interested in the programming, and skip straight on to the results).

I suspect that every Common Lisp program will eventually evolve to using a clever bit-packing of fixnums as its primary data structure. That's the case here as well.

Cards

A card is an integer between 0 and 55 (inclusive). The low 3 bits are the color, with a 0 being a dummy color that's not used for anything, 1 for violet going all the way to 7 for red. The next 3 bits are the card's numeric value minus one (0-6). Note that with this representation determining the higher of two cards is simply a matter of making an integer comparison.

(deftype card () '(mod 56))

(defun card-color (card)
  (ldb (byte 3 0) card))

(defun card-value (card)
  (1+ (ash card -3)))

We'll also need a way to represent a set of cards, for a player's hand or palette. We're going to use a 56-bit integer for that, with bit X being 1 if the set contains card X.

(deftype card-set () '(unsigned-byte 56))

Adding and removing cards is simple. (Except how annoying is it that SETF LOGBITP is not specified in the standard?).

(defun remove-card (card card-set)
  (logandc2 card-set (ash 1 card)))

(defun add-card (card card-set)
  (logior card-set (ash 1 card)))

;; Create a new set from a list of cards.
(defun make-card-set (cards)
  (reduce #'add-card cards))

We'll also need to be able to iterate through all the cards in a set. This is most easily achieved by using INTEGER-LENGTH to find the highest bit currently set, executing the loop body, clearing out the highest bit, and carrying on.

(defmacro do-cards ((card card-set) &body body)
  (let ((modified-set (gensym)))
    `(loop with ,modified-set of-type card-set = ,card-set
           until (zerop ,modified-set)
           for ,card = (1- (integer-length ,modified-set))
           do (setf ,modified-set (remove-card ,card ,modified-set))
           do ,@body)))

Scoring

With these primitives we can then write a very fast function to determine who is currently winning the game. We'll base this evaluation function on scoring a combination of a palette + rule, and comparing the score that each player gets with the current rule. This is a much better way than trying to directly compare the palettes. If you're caching this evaluation function, you get a much higher cache hit rate when the cache key depends only on the state of one player rather than a combined state of two players. (I'm also pretty sure that given this data layout, computing a score will be faster than any kind of direct comparison).

Let's start off with the general structure, and fill in the details as functions under LABELS afterwards. So given a card-set and a color, we'll return a score for that set:

(defun card-set-score (card-set type)
  (labels (...)
    (ecase type
      (7 (red))
      (6 (orange))
      (5 (yellow))
      (4 (green))
      (3 (blue))
      (2 (indigo))
      (1 (violet)))))

Red (highest card) is trivial. We just find the highest card in the set with a call to INTEGER-LENGTH.

           (red ()
             (integer-length card-set))

For other rules we can make good use of the following helper function. It matches the set against a bitmask, and returns a score based on the number of bits that are set both in the set and the mask (main part of score) which we get with LOGCOUNT, as well as the highest bit set in both (the tiebreaker). Given this definition, most of the scoring types can be written in a very concise manner:

           (score-for-mask (mask)
             (let ((matching-cards (logand card-set mask)))
               (let ((matching-cards (logcount matching-cards))
                     (best-matching-card (integer-length matching-cards)))
                 (+ best-matching-card (* 64 matching-cards)))))

For orange (cards of one number) we start with a bitmask that matches all bits corresponding to a card with the value 7. We compute the score for that mask, then shift the mask right by 8 bits such that it covers the cards with the value 6. Repeat 7 times, and find the maximum score. (We don't need to know which iteration produced the highest score, only what the score was).

           (orange ()
             (loop for mask = #xff000000000000 then (ash mask -8)
                   repeat 7
                   maximize (score-for-mask mask)))

Yellow (most cards with the same number) is very similar. We start off with a bitmask that matches all the red cards (so bit 55, 47, 39, etc) and compute the score. Then shift it right by one, such that the mask matches all orange cards instead. Again repeat 7 times and maximize.

           (yellow ()
             (loop for mask = #x80808080808080 then (ash mask -1)
                   repeat 7
                   maximize (score-for-mask mask)))

Green (most even cards) and violet (most cards under 4) are trivial; we can just score a single mask matching the even cards for green, all cards of value 1, 2 or 3 for violet.

           (green ()
             (score-for-mask #x00ff00ff00ff00))
           (violet ()
             (score-for-mask #x00000000ffffff))

Blue (most cards of different colors) is where we get into unintuitive territory. Let's start with the tiebreaker; it's obviously guaranteed that he highest card in the palette as a whole can be included in this winning set, so we can just use INTEGER-LENGTH on the whole set the same way we did for the red scoring rule.

To get the number of different colors, we will fold the cardset multiple times. First we'll do a bitwise OR of the high 32 bits and the low 32 bits. Then we'll take OR bits 0-15 of that result with bits 16-31. And finally one more OR of bits 0-7 with 8-15. The low 8 bits are now such that bit 7 is set if any of the "red" bits in the original were set, bit 6 if any of the "orange" bits, etc. We can then just use LOGCOUNT on that byte to get the number of colors present in the palette, and combine it together with the tiebreaker score computed above.

           (blue ()
             (let* ((palette card-set)
                    (best-card (integer-length palette)))
               (setf palette (logior palette (ash palette -32)))
               (setf palette (logior palette (ash palette -16)))
               (setf palette (logior palette (ash palette -8)))
               (+ best-card
                  (* 64 (logcount (ldb (byte 8 0) palette))))))

Finally, there's indigo (longest straight). There does not appear to be any clever bit manipulation trick to compute this quickly (if you can think of one, please let me know!). We need to iterate through the cards in order of descending value, ignore any consecutive cards with the same number, and reset our scoring computation when the straight gets interrupted by a missing number.

           (indigo ()
             (let ((prev nil)
                   (current-run-score 0)
                   (best-score 0))
               (declare (type (unsigned-byte 16) current-run-score best-score))
               (do-cards (card card-set)
                 (cond ((not prev)
                        (setf current-run-score card)
                        (setf prev card))
                       ((= (card-value card) (card-value prev)))
                       ((= (card-value card) (1- (card-value prev)))
                        (incf current-run-score 64)
                        (setf prev card))
                       (t
                        (setf current-run-score card)
                        (setf prev card)))
                 (setf best-score (max best-score current-run-score)))
               best-score))

Players

A player is defined as a normal structure, with the only oddity being that they form a circular linked list using the NEXT slot. This tends to be more convenient for iterating through players in turn order than keeping them stored in an external collection of some sort.

(defstruct (player)
  (id 0 :type (mod 5))
  eliminated
  (hand 0 :type card-set)
  (palette 0 :type card-set)
  (score-cache (make-array 16) :type (simple-vector 16))
  (next nil :type (or null player)))

The core operation of generating a list of valid moves is deciding whether the player is winning the game after those a move is made. When doing this we'll end up repeatedly evaluating the scores for the same palettes over and over again. To speed this up, there's a minimal cache; for each player / rule combination we store both the last palette we evaluated for that rule, as well as the score.

(defun player-score (player rule)
  (declare (type (mod 8) rule))
  (let* ((palette (player-palette player))
         (cache (player-score-cache player))
         (cached-key (aref cache rule)))
    (if (eql cached-key palette)
        (aref cache (+ rule 8))
        (progn
          (setf (aref cache rule) palette)
          (setf (aref cache (+ rule 8))
                (card-set-score palette rule))))))

Given that way to score a player against a rule, we can then check whether the current player is winning the game with the rule.

(defun player-is-winning (player rule)
  (loop with orig-player = player
        with orig-score of-type fixnum = (card-set-score player rule)
        for player = (player-next orig-player) then (player-next player)
        until (eql player orig-player)
        do (when (>= (the fixnum (player-score player rule))
                     orig-score)
             (return-from player-is-winning nil)))
  t)

We can then generate all valid moves by iterating through all the PLAY, PLAY+DISCARD, and DISCARD combinations for the player's current state, and collecting the ones result in the player winning.

(defun valid-moves (player current-rule)
  (let (valid-moves)
    (labels ((check-discard (play-card)
               (do-cards (discard-card (player-hand player))
                 (unless (or (eql play-card discard-card)
                             ;; Filter out cases where player discards a card
                             ;; without changing rule or gaining a new card.
                             (and (eql current-rule (card-color discard-card))
                                  (>= (logcount (player-palette player))
                                      (card-value discard-card))))
                   (when (player-is-winning player (card-color discard-card))
                     (push (cons (cons :play play-card)
                                 (cons :discard discard-card))
                           valid-moves)))))
             (check-plays ()
               (do-cards (play-card (player-hand player))
                 (setf (player-palette player)
                       (add-card play-card (player-palette player)))
                 (when (player-is-winning player current-rule)
                   (push (cons :play play-card) valid-moves))
                 (check-discard play-card)
                 (setf (player-palette player)
                       (remove-card play-card (player-palette player))))))
      (check-plays)
      (check-discard nil))
    valid-moves))

Other stuff

There's a little bit more code required to generate the scaffolding for a game, and to actually do the random walk through the game tree. None of that code is particularly interesting, nor are the INLINE or TYPE declarations that you'd need to sprinkle on the above code to make it fast. The full code is available on GitHub.

Performance

In the optimal case of trying to iterate through the whole game tree in a 2p game, the average cost of making a move is about 500 cycles, with my desktop doing 7 million moves per second. This is however amortizing the cost of computing the set of valid moves across all of those moves (since in a full search every valid move gets executed). If you're just doing a pure random walk with no backtracking, you'd get no amortization at all. That effect makes an order of magnitude difference.

But it's funny that the biggest profiler hotspot in the program is the PLAYER-SCORE function. Which, if you remember, will simply do an array lookup to get the previous cache key, compare it to the card-set that should be evaluated, and either return a previous result or call out to the real scoring function. The function does basically nothing, but it does nothing really often. When all of the things of substance are pretty fast as well, it's maybe not a surprise that the bottleneck ends up in a place like that.

Results

(Skip this section if you're not actually interested in the game, and just wanted to read some Common Lisp code).

The following results are computed from running simulations of 10k different initial setups, with 100k matches for each simulation with each player making random but valid moves. (So a total of one billion games). All plays were with 3 players, the only player count I consider worth playing.

As a sanity check, I ran a smaller simulation of 1000 initial setups where the players would not play a card + discard, if just playing that same card was sufficient to get into the lead without a discard. The results were very close to the large fully random simulation (e.g. the average game length was 14.6 instead of 14.1 turns, and the win percentage of the best turn order position was 39% rather than 40%).

Finally, an even smaller scale experiment had the AIs use move selection heuristics very similar to those I personally use when playing the game. Those results didn't differ materially from random play either.

Caveats

Unless stated otherwise, all of the numbers are from games with players making completely random moves. It is possible that the aggregate statistics are different when players consciously build toward palettes that are strong in multiple scoring rules, or strong in rules that they have a lot of cards in hand for.

The games are always played with the full deck, rather than in reality as the deck slowly depletes from hand to hand as cards are moved to the scoring piles of players.

Starting player effect

One thing I was curious about is whether the starting player has an advantage, a disadvantage, or neither. It's not obvious, since there are effects both ways.

The case for a disadvantage: Running out of cards means losing the game, and the all other things being equal the first player will also run out of cards first. Due to the way in which the player order is picked, the last player is also guaranteed to have the highest value starting card in their palette giving them a leg up on winning future tiebreakers.

The case for an advantage: The earlier in turn order a player is, the fewer cards the opponents have in their palettes. It's much easier to pass two players with one card each, than two players with two cards each. And this effect continues throughout the game, so it should accumulate over time.

It turns out that at least with undirected random play there's a major disadvantage to being first. It could be that the effect is smaller when players are making "good" moves.

Position	Win rate
1st	27.20%
2nd	32.42%
3rd	40.37%

Number of possible moves

Like mentioned above, the branching factor in the game was higher than I'd been expecting. There are cases where players have a lot more moves available than I would have expected.

The theoretical maximum number of options is 7 + 7 + 7 * 6 = 56, where a player can get in the lead either by discarding any of their cards, playing any of their cards, or with a combination of the two. This situation actually happened a total of 483986 times in 14 billion moves (0.03% of the time). A lot more common than I would have thought.

But of course we don't particularly care about the 0.03% case. The more common cases are more interesting. The following graph shows how often you have at least X moves available in the game.

For example, you can see that about a 1/3rd of the time a player had 10 or more options to choose from. It appears that the game is nowhere as constrained as I thought, even when playing without the special action rules.

Length of game

The average game lasted for 14.2 turns, which is perhaps less than I expected given 2 of those 14 turns were by definition a player just dropping out from the match.

There were some games that already ended on turn 4, which meant that only two cards were played in the game. That number was a mercifully low 0.01%. And while there were players who got eliminated before playing a card, there at least were no games ending in turn 2 or 3 even if that's theoretically possible. And a single game lasted all the way to turn 28.

The following graph shows how large a proportion of the games were still running on a given turn.

Effect of player decisions

The final question is about how strongly predetermined a single hand of Red7 is, and how much a player can affect it.

We've already established that at least with this skill level of play there's a very large start player advantage, but is that an isolated issue or does the setup matter even more than that. In these simulations all players are by definition equally skilled. If the end result of the game is primarily determined by player skill, you'd thus expect them to have similar win rates from game to game. So let's graph the distribution of per-setup win rates for each starting position:

Now, this graph is a little abstract since we're looking at probabilities of probabilities. The way to read this is that across those 10000 starting setups, the most common win percentage for player 1 (red) across the 100000 games in a specific setup was around 15% (the peak of the red line is at around 0.15). You can see that the later players in turn order have a graph that's shifted further to the right, which is what you'd expect when they have a substantially higher win percentage. But you can also see that from any starting position you might get absolutely dismal win rates (near 0) or very high win rates (over 80%). The ridiculously high win rates (95%) appear to be purely reserved for the player last in turn order.

There were two setups where a player didn't manage to win even a single match out of 100000 (in both cases that was player 1). In 25% of the cases the player with the worst chance of winning a setup had a 10% win rate or lower, in 7% of the cases a win rate of 5% or lower. It does appear that within a single hand of Red7, luck plays a massive role.

Out of all of the questions we've been looking at, this is of course the one where the applicability of a purely random search strategy is the most questionable. If we're investigating the effect of player skill, how can results from the least skillful play imaginable be relevant? I'm sympathetic to that argument, but before buying into it I'd really like to understand the mechanism by which one player is supposed to disproportionately benefit from the random play.

Also... As mentioned earlier, I also tried extending the AIs to be smarter about selecting each move. This was not based on any kind of lookahead, but simply the kinds of heuristics I'd usually use myself when playing the game. If I can get into the lead either by playing a card or discarding a card (without drawing a new one to replace it), I'd rather play a card since that's going to be useful on future rounds. When choosing which of two cards to play, I'd usually prefer to play the one that adds strength to more different scoring rules.

Experiments with one AI player getting use of these kinds of heuristics while the others played completely randomly did not show a big effect, the changes in the win rate were on the order of 1-2 percentage points.

Future work

I might be done with this little project, but if I pick it up again there's a couple of obvious directions to take this. Implementing the optional special action rules would be nice. That's my preferred form of the game anyway.

The more interesting one is to extend the current system to be a full AI using the Monte Carlo Tree Search approach. This would allow generating statistics based on "good" play of the game, maybe provide information on what kinds of moves are in general successful, as well as give a more conclusive answer to the level of skill the game has.

The tricky bit with evolving this code to a MCTS is that the system in the current form would allow the MCTS to exploit knowledge of future random events and hidden information. It would need to randomize all card draws (currently deterministic), as well as swap the opponents hands for random cards for the duration of the evaluation phase, and then swap the original deck and original hands back in for the move execution. That's going to slow down each individual move a lot, which is a problem when MCTS will intrinsically require computing several orders of magnitude more moves than a random walk.

Command languages as game user interfaces

jsnell@iki.fi — Mon, 08 Dec 2014 12:00:00 GMT

In the previous post in this series, I promised to discuss in detail some of the positive and negative consequences of the less conventional design choices of my online Terra Mystica implementation. If you have no idea of what that is, reading at least the intro of that post might be a good idea. This post will just deal with one design choice, but it's the elephant in the room: the command language.

The canonical internal representation of a game in my TM implementation is as a sequence of rows, each describing a some number of player actions specified in an ad hoc mini language, or administrative commands that change the game setup in some way (for example setting game options, or dropping a player from the game partway through). This is what it might look like:

yetis: action ACT4
cultists: upgrade E6 to TE
cultists: +FAV6
giants: Leech 3 from cultists
giants: pass BON4
yetis: Leech 2 from cultists
cultists: +WATER
dragonlords: Decline 2 from cultists
dragonlords: dig 1. build G6
yetis: send p to EARTH
cultists: action FAV6. +AIR
dragonlords: pass BON7
yetis: upgrade E7 to TE. +FAV11
giants: Leech 3 from yetis
dragonlords: Leech 2 from yetis
cultists: Leech 2 from yetis

That's a short excerpt from the middle of a random game. A full game generally runs for about 400 rows.

What do I mean by this being the canonical internal representation? Only a few parts of the game state are actually persisted separately in the DB; these are things that might almost qualify as metadata, such as whose turn is it to move, is the game still running, and what were the final rankings of a finished game. But in general the only way to find out the current state of the game is to evaluate the whole sequence of commands from start to finish. This is in fact done for almost every operation on the site (viewing a game, previewing a move, saving a move, viewing the or editing the game in an admin mode, and so on).

In addition to being the canonical internal representation, the command language is also the canonical user interface; the fundamental operation players do is enter new rows into the command sequence. Often this is done by writing the commands manually, though there are GUI shortcuts of one form or another available for almost all operations.

This might sound like a slightly insane way of doing things, but it does have some benefits as well. I've made several digital board game adaptations of varying levels of completeness over the years, used tens of other ones, and this solution hits the closest to my personal sweetspot.

A taxonomical diversion

Before discussing the fallout of this design decision in more detail, it's probably useful to do a quick tour of some of the main axes in the design space. (I'm of course just describing the extremes, while in the real world most examples would fall on a continuum).

First, there's the question of the interaction model which might be abstract or skeuomorphic. In a skeuomorphic design the player doing input on a computer would still be mimicking the actions of someone playing the game with physical pieces and no computer assistance.

In an abstract design the player would only input the parts of the move that are necessary to uniquely distinguish it from other possible moves, with any bookkeeping and mandatory intermediate steps being carried out automatically. Likewise in a skeuomorphic design the software provides information through the same methods as the original physical game, while an abstract design will automate some of the mechanical parsing of the game state. Or even just the question of using the graphical assets of the original game, generally optimized for sales, versus using digital-first assets optimized for clarity.

As an example of this axis, in the 18xx series of games a substantial amount of playtime is spent computing the exact routes of a number of trains on a complex rail network. I'm aware of three solutions that are actually in use, and there is a fourth plausible one, in order from least to most abstract:

The user manually decides on the routes, computes their values with no computer assistance, and those values are used with no validation. Examples: ps18xx, early versions of Rails.
The user enters valid routes through a user interface. The software computes the values of the routes, and distributes the income from the company appropriately. Example: rr18xx.
In games with requirements that all routes must be optimal, the software could compute an optimal route but only for the purpose of rejecting any manually computed unoptimal ones. Examples: None. (Though it's similar to what's done in the SlothNinja implementation of Indonesia, a game that probably counts as an honorary 18xx)
The software automatically finds an optimal set of routes and computes their values. Examples: The ancient DOS-based 1830 from Simtex, recent versions of Rails.

My own tastes run toward maximum abstraction, I've rarely if ever seen a digital boardgame conversion that needed to be more skeuomorphic. But this is not a universal view. There are definitely people who will refuse to play a conversion that does not use the same graphics as the physical version. Or who will strenuously argue against automatic finding of optimal routes in 18xx, on the basis that being evaluating routes is a core skill in the game when making decision about route building, and that skill can only be acquired by getting sufficient practice in manual route computation.

A second axis is the internal representation, which could be based on either log replay or stored state. In a log replay system the game is stored as a series of steps from the starting setup to the current state. In a stored state system the game is stored as the current values of all pieces of the game. How much money does every player have, which round is it right now, what's in this exact space on the map, and so on.

A third axis is the input model. Moves could be entered either through direct or indirect manipulation. In a system using direct manipulation, the player would for example see a graphical display a map and be able to click or drag on a unit to enter a move for it. In an indirect system the player observes the game state in one place, and enters their moves using some completely unrelated system.

I think most digital boardgames use a direct input model, but there are also a fair number that have a menu-driven system of some sort. The only examples I know of that go a bit further with indirection by providing a command language are my ancient Paths of Glory mapper and the even older Diplomacy PBEM judges. If you have other examples, I'd love to hear of them.

Direct manipulation is often, but not always, linked to excessive skeuomorphism in the interaction model. For example I find it almost painful to play most Vassal modules, with their hyper-direct interaction model of dragging and dropping counters around, manually drawing cards from a deck or rolling dice. Digital boardgames are not the same media as physical boardgames, and should play to their unique strengths. But these are in fact orthogonal concerns, and there's no reason for why a direct manipulation model couldn't also provide useful input and computational abstractions.

Whew, so much for the theory. In this taxonomy Online Terra Mystica is pretty far toward the abstract end, and is fully in the log replay camp. While it has a half-hearted attempt at adding some direct manipulation concepts to the UI, it started off as an indirect system and deep inside that's what it is. It also chooses to merge the input format and the log format into one entity. So what does this mean?

Feature set

Perhaps the signature feature of the site is the planner. This tool allows the player to enter an arbitrarily long sequence of actions - all the way to the end of the game - and see what the effects would be. Are all the moves valid? Are there sufficient resources available to do all of this? Oh, I don't have enough resources? Well what if I do this on round 5, and delay that action to round 6. In cases where the plan fundamentally depends on the opponents doing something, it's possible for the plan to also contain arbitrary resource adjustments. And finally, since the command language supports comments, these plans can be properly documented so that when you return to them in a day or two, you can remember why you wanted to do these particular moves.

I think this feature is intrinsically linked to the command language as a user interface, and it might actually be unique. There are some games with other kinds of interfaces that allow you to play the game forward, and then undo / rewind / reload. But simply being able to play the game forward is not sufficient to make this a useful tool. It's only the ease of inserting, reordering and deleting moves that makes it possible to use this as a matter of course, rather than only under the most exceptional circumstances.

A somewhat related feature is undo. Inflexibility in allowing moves to be taken back is the bane of many forms of digital boardgames. When playing a game face to face, most groups will generally allow at least some level of taking back moves. In some cases all moves are final immediately (this has always been the primary problem of the otherwise brilliant implementation of Brass at Order of the Hammer). In some other implementations there are distinct checkpoints, for example BGO's Through the Ages allows undoing back to the start of your full turn, but no other rollbacks (clicking 'finish turn' is final, as is any kind of action during an auction or war resolution). These two are, I believe, examples of undo being limited for design reasons. At rr18xx meanwhile rollbacks are possible until the previous action of each player. Here my understanding is that the overriding issue is technical, as the rollback is essentially a full restore to a previous database snapshot, and there are resource constraints on how many snapshots can be kept.

The solution Online TM takes to this is to grant the creator of the game arbitrary powers to edit the history at will, the admin mode. Not only can they undo the last move or couple of moves. If there was a mistake made three moves back, they can go and fix it (and they can fix it without forcing the intervening moves to be redone). This feature is fully tied to a log replay mode of operation. While more limited forms of undoing could be implemented as a reverse log replay from the end state or through state snapshots, this more complete form depends on the log being directly editable. And realistically the log also needs to be the input format; it would not be reasonable to expect the admin to be able to edit a more formal log representation correctly (whether the log format is XML, protocol buffers, JSON, or something else). But in the case where the log format and the move input system match, just playing the game has taught the game admin the necessary skills.

This is a very nice feature for friendly games. It does have downsides though, more on that later in the section on the social implications.

There's also a potential as yet unimplemented feature of pre-programmed actions, that people frequently ask for. "I know exactly what I want to do next turn, why can't I just pre-enter my move". This would be a pretty interesting thing for speeding up games, but to my mind would not be conducive for good play. Circumstances change, often in ways you did not anticipate at all. The only way this could be even remotely usable would be if the language was extended to have some kind of conditional execution. And that's a can of worms I'm interested in opening, and I suspect also a bridge too far for 99% of my users.

It's worth noting that many of the above features are closely tied to a game with no randomness (or at most setup randomness) and no hidden information. As such their existence is something of an anti-feature, preventing other additions to the game.

For a non-hypothetical example, I'm currently thinking about how to implement the faction auction variant from the TM expansion. A full open auction in the beginning would be painfully slow. The most obvious, though still slightly imperfect, solution is a series of blind second price auctions. But this is not a good fit for the site's existing design. The problem is that the blind bid introduces momentary hidden information into the game, and it's possible for that information to leak through either the preview or admin modes. For example the admin could wait for everyone else to bid, peek into the log and see everyone else's bids, and then bid in such a way as to force the winner to pay the maximum amount.

UX

The most obvious UX consequence of using a command language is that it tends to be harder to learn. The following quote, said partly in jest, certainly contains a kernel of truth:

... has done a bang-up job providing a PBEM Terra Mystica experience that includes just enough extra layers of complexity via the interface and game administration tools to keep TM as confusing as ever, long after you master the actual game!

Non-natural languages are simply not a mode of human computer interaction that most people are comfortable with in this day and age. It actually continues to amaze me that I could get non-programmers to play using this implementation at all. Is it possible to evaluate how big a hurdle this has been for people? The best number I can come up with is that around 20% of the players who joined at least one game never finished even one game without dropping out. Note that these are players who have already jumped through hoops such as email validation during account registration. It's possible that there's some other issue beside the UI that's a problem for these players, but it does seem like the most likely candidate.

A smaller problem is that it essentially forces the introduction of a move preview. For those who haven't played the game, when entering moves you need to first enter the moves, then click 'preview', check that the results match what you want, and finally click 'save' to commit the moves. In a game that uses a direct manipulation paradigm, a preview could be skipped. But with a more obscure UI like here, it's absolutely essential since the move might not have had the intended effect. Whether it's doing the entirely wrong move, picking the wrong tile, building on the wrong location, etc. Even with a preview step somebody will request a rollback on average once or twice a game.

So why do I call this a problem? Because despite my best efforts, especially new players will frequently forget to 'save', leaving the game in a limbo state where they think they've done their move, until some other player gets impatient. (To mitigate this a little, the system will automatically do a 'preview' when using the GUI tools to generate the commands rather than type them. Unfortunately performance problems make it unfeasible to trigger continuous parsing + updates when typing).

A horrible mistake I made in the design of the language was the lack of (mandatory) turn delimiters. Originally my implementation treated each row as a complete turn. This caused more confusion than any other part of the command language. In the end I ended up writing a lot of very complicated code for automatically detecting the turn breaks in a command stream.

But that wasn't actually good enough, there are valid command streams where the splitting isn't unambiguous, e.g. the tunneling ability of dwarves, where transform E10. build E10. I had to make an arbitrary choice on that (basically the behavior now is greedy, as many commands as possible are stuffed into the same move). So I had to include the done command to allow players to disambiguate in the few cases where it's needed. This is still supremely confusing for people. All of this could have been avoided by taking this into account right at the start.

Finally, one very surprising outcome is that having a compact vocabulary for game actions makes it much easier to display a useful player-readable log of what happened in the game. The typical user-visible log is structured as natural language, and so verbose as to be hard to read especially when trying to piece together the flow of the game after the fact. It's easy to see why that design choice is made, but it's not necessary when all players are almost by definition going to know how to read a more compact representation.

Likewise this makes it really easy to display a concise summary of what has happened in the game since the player last looked at it (done both in the notification emails and the 'recent moves' tab of games).

Social issues

The unlimited admin access to games has a dark side. Admin malfeasance is rare but I do get about one complaint a month about it. Sometimes these are games where the admin will change their moves after others have already taken moves, rolling the game back by a huge amount, taking over entirely for another player for example forcibly passing them, applying different standards to allowing others to undo vs. doing it themselves, and so on.

This is the kind of drama that I really do not want to deal with, but the general solution is to just mark the game as unrated, and let the players sort out between themselves whether and how the game will continue. And it is a bit of a miracle that it hasn't yet become a more widespread problem, as one might expect to happen for the anonymity + internet combo. If it does ever become intolerable, the solution will almost certainly be to disable admin mode entirely for public games. The TM tournament has already shown that it's at least workable, even if people do occasionally get a little bit screwed by the 'no manual administration' policy.

One consequence of a command language is that everything needs to be named. The map needs to have a coordinate system, every component needs a identifier of some sort, and every interaction needs a short and snazzy name. Old school wargames will do this as a matter of course. Of course every hex has an id! Of course the cards are both numbered and uniquely titled! But not so much for eurogames.

The naming we ended up with on the site is far from optimal, and caused yet more drama due to non-online players feeling excluded from conversations. (If you want to know more, you can see an explanation for where the names came from, and why they won't change). That bit is unfortunate. But at least I actually find real value in having convenient shorthands available for everything, when discussing the game, whether when theorycrafting or conducting some tabletalk on IRC during a game.

Implementation issues

The obvious problem for a log replay system is performance. Replaying a full game, which is done for almost every operation, can take around 0.15 seconds in the current implementation, with no obvious low hanging fruit to fix. On the current traffic levels server load is not a problem, but I would start to get worried if usage increased by a factor of 10. As discussed above, there are features I'm unwilling to implement due to CPU load concerns. And it is actually causing real development pain for testing (see below).

It's hard to say exactly how much of the CPU overload is related to command parsing, a step that could be avoided with the use of a more structured log format. Some crude profiling suggests that the parsing takes only 5-10% of the runtime, certainly nowhere enough to warrant using a different format.

A rewrite in a language with higher performance implementations than Perl would almost certainly give a factor of 10 improvement on the actual game evaluation code, moving the bottlenecks to IO. But a full rewrite is not in the cards.

Another potential implementation worry is storage. The current DB size is about 250MB. Unlike CPU usage, this is a cost that accumulates over time. Out of that 250MB maybe 75% is used by the game logs. The logs, stored as a sequence of commands, are not a particularly efficient form of encoding the game data. Simple lossless compression could easily compress them by 80-90%. Luckily disk is cheap (this server still has 600GB free), so this should never become a real issue.

Another consequence of a log replay system is that any change in the game evaluation might break existing games. That change might be a bugfix for a place where the effect of a move was miscomputed, it might be extra validation to prevent illegal moves of some kind, cheating prevention, or something else entirely. This is not a theoretical possibility. Basically every single game evaluation change I make, there are already multiple affected games. No matter how elementary a rule is, somebody has already broken it.

Obviously in a stored state implementation changes like this don't matter. The current state is the current state no matter what. But in a log replay system you need to have some story on how to deal with retroactive changes. I can think of the following strategies:

Punt: Don't make any changes at all.
Ignore: Just make the change, and don't worry about games breaking or the results changing part way through.
Delete: Just delete any games that would be broken.
Fixups: Find all games where the old and new behavior differ, and change the appropriate logs in such a way that the results with the new log and version will be the same as the result with the original log and old version. This change could be manual or automated.
Versioning: Each game file carries a version number. When making a breaking change, keep both the original and new code paths, and choose one of the two based on the version number. Any newly created games use the new version number and get the fixes, existing games keep their original version number and the original behavior.
Positive options: Conditionalize the behavior on an option. Turn that option on for new games, as well as any existing games for which the new and old versions behave the same.
Negative options: Conditionalize the old behavior on an option. Turn that option on only for existing games where the results for old and new versions differ. Never turn the option on for newly created games.

During the lifespan of the site I've used most of these at one time or another. The 'ignore' strategy was appropriate a couple of times (for changes where I decided that the the new behavior was always acceptable, such as situations where a player had ended up overpaying for an action). The 'delete' strategy would be exceptional, the only situations where I used it were games that were aborted, and one case of a single game being completely unsalvageable due to bug abuse by a player. The 'fixup' strategy has the nice benefit that it avoids introducing a new code path, and was my default choice early on. But at this point it'd be an unacceptable amount of manual work, and it's not readily automatable. Especially with the relatively freeform input from the command language. My next default was 'positive options', but after about 3-4 of those I switched to 'negative options'. Positive options had a slightly more complicated rollout procedure, and also permanently clutter up all games, confusing people. ("What's this strict-darkling-sh option?").

None of these options are good, in this instance a log replay model does introduce some major costs either to the developer (who has to do extra work) or the users (who have some games screwed up or completely lost).

But it's not all bad! A log replay model makes testing much easier. First, it'd be very easy to write test cases since there is a very natural serialization format for games already, the command language. I don't actually write explicit tests for TM, but for example at work we need absurd amount of infrastructure for making it easy to write unit tests for TCP/IP packet handling. This kind of design gives the test cases for free. Likewise a Age of Steam implementation I was once doodling around with had lots of test cases, but even with the reasonably friendly format (protocol buffers) they were an absolute pain to write due to the boilerplate.

If I don't write unit tests, how do I test? Mostly by side by side testing; I have a small script that runs every single game in the database against both the new and the previous version. It munges the results a bit removing known harmless diffs, and then displays any changes from game to game. I can then look at those games, and decide whether it's indicating some kind of a problem with my change, an expected result of my change, or a problem of some sort in the game. It also acts as a great regression test that prevents failures from creeping in, and is the source of data for finding the games that would be broken by a game, so that one of the fixes discussed in the previous section can be applied.

This has been one of my favorite forms of testing for a long time, and works tremendously well in a case like Online TM where we have access to all games ever played. Thinking specifically of digital boardgames, it's also a model that wouldn't work well without a replayable log. The only problem is, as alluded to above, the CPU usage. Right now a full diffgame run takes about 90 minutes of CPU time on a rather beefy machine. Even with parallelization it's not a fast feedback cycle. (Makes me kind of miss being able to just casually run a sxs test on a thousand machines).

Conclusion

I'm afraid this ended up longer than intended, despite only covering one design decision. It's also a design decision that I feel is overall a win. You'll have to wait for the next post for the embarrassing technical missteps.

A brief history of Online Terra Mystica

jsnell@iki.fi — Thu, 27 Nov 2014 21:00:00 GMT

What's this Online Terra Mystica thing?

For the last couple of years my main hobby hacking project (over a thousand commits, and probably an order of magnitude more time spent on it than all other non-work projects combined) has been an asynchronous multiplayer web implementation of the brilliant board game Terra Mystica (Feuerland Spiele, 2012). At the moment it's roughly 2/3 Perl, 1/3 Javascript, and uses Postgres as the data storage.

It's been a fairly successful project for something that was originally intended as a one-off. The usage statistics at the end of November 2014 are:

Almost 6000 registered users
About 1200 monthly active users (as in playing at least one game; not passive use like looking at the statistics pages).
14000 moves executed on a normal weekday (10000 on weekends)
16500 games either ongoing or finished.
Bi-monthly online TM tournament run by Daniel Åkerlund with 400+ players.
1038 commits as of this writing.

This was not supposed to be a general use program. It was originally a one night hack to help keep track of a hand-moderated play-by-forum game of TM, which was obviously headed for failure due to the massive amount of errors people were making while describing their moves in natural language or when manually tracking their resources in the game.

From there the project snowballed, slowly gathering features including just about everything I ever marked in the TODO as being 'out of scope'. Since I often had only very limited amounts of time to work on this, and my expectation was always that the interest in the site would soon fizzle out, the project management method was to always get the maximum short-term bang for the buck.

A project whose direction is literally guided by 'what can I get done in the next two hours' is of course massively path dependent; the early decisions made with very little consideration had outsized influence on where the site ended up. Sometimes the expedient gambles on 'do the simplest possible thing' failed, and the results were just rubbish. At other times things ended up at a slightly odd local maximum. And in some rare cases the gamble turned out to produce wonderful and unexpected results.

Timeline

Future posts will discuss the actual lessons learned; what didn't work and what did work - both in the mechanics of programming and in the peculiarities of online boardgames. But in this one let's just have a look at the history of the site, how long it took for it to get features that one might consider absolutely necessary, and how amazingly bad user experience people are willing to put up with when it's the only way they can play their favorite game online.

Feel free to skip past the bulleted list if you get bored, it's still a bit long even if I include only changes I consider fairly major (indeed, a lot has to get filtered out given it's 1000+ commits).

2012

December - Early January: The smallest program that did anything useful related to a game. I'd enter moves into a text file and run the script to produce the final game state as JSON. This JSON was rendered to HTML + Canvas by some Javascript code that was half ripped off from an old project. There was some minimal rules checking and automation, and support for only 5 out of the 14 factions in the game. Users of the current site might want to see the old look.

2013

January: A rudimentary dynamic web site, implemented simply as a wrapper CGI script around the JSON generator script. After that a clumsy web-based editor was added for game files (a textarea that could be used to edit specific files in a git repository, no authentication except for each game having a random 160 bit identifier as part of the URL). This allowed other people to moderate their own games, as long as I created a game for them and sent the link with the secret embedded. Players would post / email a natural language description of their move to the moderator, who would then enter the moves into the admin tool using the correct syntax. Amazingly some 20 games were run using this insane system, while by all rights the project should have died there.

This version of the software had automation for resolving the effect most game events, but did very little validation to notice completely invalid moves.
February: Added an ability to easily rewind the game state back to any time in history, to help with post-game strategy analysis. Also added a way for players to enter their own moves (a textarea in the main game view, a preview button and a save button, and some verification to make sure they could only enter their own moves). Again there was no real authentication here, just links with an embedded faction token derived from the per-game secret key.
March: The hackiest email integration in the world: Store the email addresses players in the same text file with the commands. After a player has entered a move, the software would create a mailto: link with prefilled subject, content and receivers (the other players). The player would clicks on the mailto: link, the email loads up in their mailer (even GMail), and they'd press send.

Compute and display a VP projection on the last round assuming no further moves, to give players some idea of who is really winning.
April: I continued to resist adding any user management or authentication. But my friend Gareth wanted a better way to manage his ongoing games than a spreadsheet, and wrote a small App Engine site into which players entered their secret game URLs. His site then used my site's API to figure out which games the player needed to act in. And it went even a bit further, by embedding the move entry UI into the same app.

After a few weeks of using Gareth's site, I had to admit that he was totally right about this being required functionality. So I finally added a DB to the project for storing user accounts and game metadata, and a 'your games' list on the front page after login. It's also only at this point in the lifetime of the site where I added a UI for people to make new games. Until then every game was created by somebody asking for a new game via email.

Finally, this month also saw the addition of a statistics page on how often each faction was winning (since balance was a hot topic on the BGG forums of Terra Mystica right from the start), and soon after a list of achieved high scores for each faction and player count.
May: This month mostly introduced all kinds of stricter validation, as the reduced barrier to entry for playing was causing significantly more illegal moves to be entered (early on players were enthusiasts of the game and thus had good knowledge of the rules; at this point people started to learn the game through the site, which was quite scary).

The main new feature of the month was the 'planner', an alternate text entry box which could be used to enter commands arbitrarily far into the future, and check that the moves are valid and what kind of effect they have. This is useful for example for checking that you have sufficient resources for making certain moves without manual computation. Another use is leaving 'notes to self', so that the player doesn't need to re-evaluate the board for every single move. (Some people were suddenly playing tens of games at a time, so this was a real problem).
June-August: This time period saw only minor fixes and improvements from the user's point of view. There was a bit of infrastructure work behind the scenes, such as moving the actual game moves into the database, though they still remained just plaintext.
October: The mini expansion for TM was released at the Spiel fair in Essen. I implemented the new features the very next morning in lobby of my hotel at Essen, with a ChromeBook, a ssh connection to the production server, and and the world's worst WiFi. After some reflection I decided not to make the change visible to the public before getting back home and a more reliable work environment :-)
November: I finally made the site automatically send email notifications, rather than require players to jump through the fragile mailto: hoops to let other players know whose turn it is. Replacement of the mailto-style notification of moves also required the addition of an in-site chat feature for communication.
December: Another consequence of the real email support from the previous month was that players no longer needed to expose their email addresses to other players. This finally made it possible to allow players to create 'public games' that anyone can join, rather than only play people with whom they've done some kind of an out-of band email address exchange. (At this point 1500+ games had been started, amazing how far such a kludgy system could go).

At the time 25-30% of moves were being entered from smartphones or tablets. But the move entry interface was typing commands like 'convert 2pw to 2c. upgrade d3 to tp' into a text box. What's wrong with this picture? :-) In the month we finally got a slightly friendlier UI, though the textual command representation still remained the canonical one.

The site finally got a ranking system: a multi-iteration version of the ELO algorithm, which computed not only player strengths but also faction strengths, and credited good results with the weaker factions more than good results with the strong ones.

Finally, in very late December I went on a big refactoring spree to move the game from CGI scripts to a more persistent application server (FCGI with Plack and CGI::PSGI, but no framework). Eradicating all global data and all modification of literal data structures was way too much work, those were not corners worth cutting in the first place.

The new UI went live a year from starting the project (almost exactly; from December 22nd 2012 to December 21st 2013), and is the point where I'd consider the site to be actually usable by mere mortals.

2014

February: Support for variant maps, for testing parts of the upcoming Terra Mystica expansion for the designers. I also added a map editor that could import map definitions from Lode's TM AI, which the design team had been using for the map. The online playtest team proceeded to play 100 games with different map versions before the expansion finally went to print.
April: A bunch of work on the expansion, which was still being kept under wraps. So the support for the new final scoring types and four of the new factions was not visible to most users at this time.

The main user-visible change was automatically dropping players from games after a week of inactivity, to support the inaugural season of the online Terra Mystica tournament. People's irritation about others playing slowly had been constant ever since the addition of public games (95% of my games are private with a few separate groups of friends, so I'm pretty isolated from this myself). Unfortunately this change appears did not appear to help enough.

This month also saw the addition of individual profile pages, showing all kinds of statistics for each player (games started, finished, performance with given factions, performance and play counts against specific opponents, etc).
September:The next attempt at reducing the anguish caused by slow players was to allow setting shorter move timers than the default one week (from 12 hours to 14 days). Lots of people started 12 hour deadline games, and moved on to complaining about so many people dropping out. Sometimes you just can't win.
October:Public support for the two new expansion maps, as well as the new final scoring types.
November:Public support for all six new factions from the expansion, as well as the variable turn order variant.

I find it interesting that it really did basically take a year of real time (and maybe 2 months of hacking time) before the implementation was in a shape where I would've thought about publishing it. And there's no way I'd put that amount of time into a project like this up front. Usually these projects are active for a couple of weekends before getting abandoned; fun parts are done but all the hard work of making it really usable remains.

In this case people were eager to use even the incredibly crude early versions, so I got over that hump very quickly. And at that point every incremental improvement to the site was affecting tens, hundreds, or thousands of people. This is of course always more motivating than working on polishing the perfect piece of software that nobody is using.

There were many architectural and design decisions done along the way that I ended up deeply regretting, and which cost me lots of time later on. But without all those early shortcuts there would've been no implementation at all. Easily the best example of Worse is Better that I've been personally involved with.

Juho Snellman's Weblog

Writing a procedural puzzle generator

The rules

Requirements

The solver

The Generator

The optimizer

Unique single solution

Conclusion

Footnotes

Optimizing a breadth-first search

Table of contents

A textbook BFS

A sort + merge BFS

Compression

Oh no, I've cheated!

Sort + merge with multiple outputs

Swapping

Compressing new states before merging

Saving space on the parent states

What didn't or might not work

Conclusion

Why PS4 downloads are so slow

Background

Experiment #1

Experiment #2

Conclusions

Speculation

Footnotes

A rating system for asymmetric multiplayer games

Introduction

Table of contents

The game

The rating system

Evaluating rating quality

TrueSkill

Results

A. The dummy rating system

B. Normal Elo

C. Normal Elo, minimum of 5 games

D. Iterated Elo

E. Faction ratings

F. Per-map faction ratings

G. Ignoring dropouts

H. Different faction weights

I. Batched rating updates

J. TrueSkill without factions

K. TrueSkill with factions

L. TrueSkill with per-map factions

Conclusions

Footnotes

Detecting cheaters in an asynchronous online game

Introduction

Gathering and preprocessing the data

Quantifying similarity

Visualization

Evaluation

Conclusion

Footnotes

A Monte Carlo simulation of Red7

The rules

The implementation

Cards

Scoring

Players

Other stuff

Performance

Results

Caveats

Starting player effect

Number of possible moves

Length of game

Effect of player decisions

Future work

Command languages as game user interfaces

A taxonomical diversion

Feature set

UX

Social issues

Implementation issues