<rss version='2.0'><channel><title>Juho Snellman's Weblog</title><link>https://www.snellman.net/blog/</link><description>Lisp, Perl Golf</description><item><title>Writing a procedural puzzle generator</title><link>https://www.snellman.net/blog/archive/2019-05-14-procedural-puzzle-generator/</link><description>
  &lt;p&gt;
    This blog post describes the level generator for my puzzle game
    &lt;a href=&#039;https://linjat.snellman.net&#039;&gt;Linjat&lt;/a&gt;. The post is
    standalone, but might be a bit easier to digest if you play
    through a few levels. The &lt;a href=&#039;https://github.com/jsnell/linjat/&#039;&gt;source code&lt;/a&gt; is available; anything discussed below is in
    &lt;code&gt;src/main.cc&lt;/code&gt;.
  &lt;/p&gt;

  &lt;p&gt;
    A rough outline of this post:
    &lt;ul&gt;
      &lt;li&gt;Linjat is a logic game of covering all the numbers and dots
        on a grid with lines.
      &lt;li&gt;The puzzles are procedurally generated by a combination of a
        solver, a generator, and an optimizer.
      &lt;li&gt;The &lt;a href=&#039;#solver&#039;&gt;solver&lt;/a&gt; tries to solve puzzles the
        way a human would, and assign a score for how interesting
        a given puzzle is.
      &lt;li&gt;The &lt;a href=&#039;#generator&#039;&gt;puzzle generator&lt;/a&gt; is designed
        such that it&#039;s easy to change one part of the puzzle (the
        numbers) and have other parts of the puzzle (the dots) get
        re-organized such that the puzzle remains solvable.
      &lt;li&gt;A &lt;a href=&#039;#optimizer&#039;&gt;puzzle optimizer&lt;/a&gt; repeatedly
        solves levels and generates new variations from the most
        interesting ones that have been found so far.
    &lt;/ul&gt;
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;a name=&#039;rules&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The rules&lt;/h3&gt;

  &lt;p&gt;
    To understand how the level generator works, you unfortunately
    must first know the rules of the game. Luckily the rules are very
    simple. The puzzle consists of a grid containing empty squares,
    numbers, and dots. Like this:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image5.png&#039;&gt;

  &lt;p&gt;
    The goal is to draw a vertical or horizontal line through each of
    the numbers, with three constraints:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;The line going through a number must be of the same length
      as the number.
    &lt;li&gt;The lines can&#039;t cross.
    &lt;li&gt;All the dots need to be covered by a line.
  &lt;/ul&gt;

  &lt;p&gt;
    Like this:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image3.png&#039;&gt;

  &lt;p&gt;
    Whee! The game is all designed, the UI is implemented, now all I
    need are a few hundred good puzzles, and we&#039;re good to go. And for
    a game like this, there&#039;s really no point in trying to make those
    puzzles by hand. That&#039;s a job for a computer.
  &lt;/p&gt;

  &lt;a name=&#039;requirements&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Requirements&lt;/h3&gt;

  &lt;p&gt;
    What makes for a good puzzle in this game? I tend to think of
    puzzle games as coming in two categories. There&#039;s the ones where
    you&#039;re exploring a complicated state space from the start to the
    end (something like &lt;a href=&#039;https://en.wikipedia.org/wiki/Sokoban&#039;&gt;Sokoban&lt;/a&gt;
    or &lt;a href=&#039;https://en.wikipedia.org/wiki/Rush_Hour_(puzzle)&#039;&gt;Rush Hour&lt;/a&gt;), and where it might not
    even obvious exactly what states exist in the game. Then there are
    ones where all the states are known at the start, and you&#039;re
    slowly whittling the state space down by process of elimination
    (e.g. &lt;a href=&#039;https://en.wikipedia.org/wiki/Sudoku&#039;&gt;Sudoku&lt;/a&gt;
    or &lt;a href=&#039;https://en.wikipedia.org/wiki/Nonogram&#039;&gt;Picross&lt;/a&gt;).
    This game is clearly in the latter category.
  &lt;/p&gt;

  &lt;p&gt;
    Now, players have very different expectations for these two
    different kinds of puzzles. For this latter kind there&#039;s a very
    strong expectation that the puzzle is solvable just with
    deduction, and that there should never be a need for backtracking
    / guessing / trial and error. &lt;a id=&#039;fnref0&#039;&gt;[&lt;a href=&#039;#fn0&#039;&gt;0&lt;/a&gt;] &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]
  &lt;/p&gt;

  &lt;p&gt;
    It&#039;s not enough to know if a puzzle can be solved with just
    logic. In addition to that we need to have some idea of how good
    the produced puzzles are. Otherwise most of the levels might be
    just trivial dross. In an ideal world this could also be used for
    building a smooth progression curve, where the levels get
    progressively harder as the player progresses through the game.
  &lt;/p&gt;

  &lt;a name=&#039;solver&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The solver&lt;/h3&gt;

  &lt;p&gt;
    The first step to meeting the above requirements is a solver for
    the game that&#039;s optimized for this purpose. A backtracking
    brute-force solver will be fast and accurate at telling whether
    the puzzle is solvable, and could also be changed to determine
    whether the solution is unique. But it
    can&#039;t give any idea of how challenging the puzzle actually
    is, since that&#039;s not how a human would solve these puzzles.
    The solver needs to imitate humans.
  &lt;/p&gt;

  &lt;p&gt;
    How does a human solve this puzzle? There&#039;s a couple of obvious
    moves, which the tutorial teaches:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;
      &lt;p&gt;
        If a dot can only be reached from one number, the line from
        that number should be extended to cover the dot. Here the dot
        can only be reached from the three, not the four:
      &lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image4.png&#039;&gt;
      &lt;p&gt;Leading to:&lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image1.png&#039;&gt;
    &lt;li&gt;
      &lt;p&gt;
        If the line doesn&#039;t fit in one orientation, it must be placed in the other orientation instead. In the above example the 4 can no longer be placed vertically, so we know it has to be horizontal. Like this:
      &lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image2.png&#039;&gt;
    &lt;li&gt;
      &lt;p&gt;
        If a line of size X is known to be in a certain orientation
        and there isn&#039;t enough space to fit a line of X spaces on both
        sides, some of the squares in the middle must be covered. For
        example if in the above example the &amp;quot;4&amp;quot; had been a &amp;quot;3&amp;quot; instead,
        we wouldn&#039;t know whether it extended all the way to the right
        or to the left of the board. But we would know it must cover
        the two middle squares:
      &lt;/p&gt;
      &lt;img src=&#039;/blog/stc/images/procedural-puzzle/image6.png&#039;&gt;
  &lt;/ul&gt;

  &lt;p&gt;
    This kind of thinking is the meat and potatoes of the game. You
    figure a way to extend one line a little bit, make that move, and
    then inspect the board again since that hopefully gave you
    the information to make a new deduction elsewhere. Writing a
    solver that follows these rules would be enough to determine
    if a human &lt;i&gt;could&lt;/i&gt; solve the puzzle without backtracking.
  &lt;/p&gt;

  &lt;p&gt;
    It doesn&#039;t really say anything about how hard or interesting the
    level is though. In addition to the solvability, we need to
    somehow quantify the difficulty.
  &lt;/p&gt;

  &lt;p&gt;
    The obvious first idea for a scoring function is that a puzzle
    that takes more moves to finish is the harder one. That&#039;s probably
    a good metric in other games, but in this one the number of valid
    moves that the player has at any one time is probably more
    important. If there are 10 possible deductions a player could
    make, they&#039;ll find one of those very quickly. If there&#039;s only one
    valid move, it&#039;ll take longer.
  &lt;/p&gt;

  &lt;p&gt;
    So as a first approximation you want the solution tree to be deep
    and narrow: there&#039;s a long dependency chain of moves from start to
    finish, and at any one time there are only a few ways of moving
    forward on the chain. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]
  &lt;/p&gt;

  &lt;p&gt;
    How do you figure out the width and depth of the tree? Just
    solving the puzzle once and evaluating the produced tree doesn&#039;t
    give a precise answer. The exact order in which you make the moves
    will end up affecting the shape of the tree. You&#039;d need to look at
    all the possible solutions, and do something like optimize for the
    best worst-case. Now, I&#039;m no stranger to &lt;a href=&#039;https://www.snellman.net/blog/archive/2018-07-23-optimizing-breadth-first-search/&#039;&gt;brute-forcing
      puzzle game search graphs&lt;/a&gt;, but for this project I wanted a
    single-pass solver rather than any kind of exhaustive search.
    Due to the opimization phase, the goal was for the solver runtime to be
    measured in microseconds rather than seconds.
  &lt;/p&gt;

  &lt;p&gt;
    I decided not to do that. Instead my solver doesn&#039;t actually make
    one move at a time, but solves the puzzle by layers: given
    a state, find all valid moves that could be made. Then apply all
    of those moves at once. Finally start over from the new state. The
    number of layers and the maximum number of moves ever found in a
    single layer are then used as proxies for the depth and the width
    of the search tree as a whole.
  &lt;/p&gt;

  &lt;p&gt;
    Here&#039;s what the solution for one of the harder puzzles looks like
    with this model (click on the thumb-nail to expand). Dotted lines are
    the lines that were extended on that solver layer, solid ones
    didn&#039;t change. Green lines are of the right length, red are not
    yet complete.
  &lt;/p&gt;

  &lt;a href=&#039;/blog/stc/images/procedural-puzzle/solver.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;/blog/stc/images/procedural-puzzle/solver.png&#039; width=&#039;700&#039;&gt;&lt;/img&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The next problem is that not all moves a player makes are created
    equal. What was listed at the start of this section is really just
    common sense. Here&#039;s an example of a more complicated deduction rule,
    which would require some more thought to find. Consider a board like:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/rule-square.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    The dots at C and D can only be covered by the 5 and the middle 4
    (and neither piece can cover both of them at the same time). This
    means that the middle 4 needs to cover one of the two, and thus
    can&#039;t be used to cover A. Instead A has to be covered with the
    lower left 4.
  &lt;/p&gt;

  &lt;p&gt;
    It&#039;d clearly be silly to treat this chain of deductions the same
    as a one-step &amp;quot;this dot can only be reached from that number&amp;quot;. Can
    these more complex rules just be weighted more heavily in the
    scoring function?  Unfortunately not with the layer-based solver,
    since it&#039;s not guaranteed to find lowest cost solution. It&#039;s not
    just a theoretical concern, in practice it&#039;s pretty common for a part
    of the board to be solvable in either a single complex deduction
    or a chain of several much simpler moves. The layer-based solver
    basically finds the shortest path, not the cheapest one, and that
    can&#039;t just be fixed in the scoring function.
  &lt;/p&gt;

  &lt;p&gt;
    The method I ended up using was to change the solver such that
    each layer consists of only one kind of deduction. The algorithm
    goes through the deduction rules in a rough order of
    difficulty. If a rule finds any moves, they&#039;re applied and the
    iteration is over, and the next iteration starts the list over
    from the beginning.
  &lt;/p&gt;

  &lt;p&gt;
    The solution is then scored by assigning each layer a cost based
    on the single rule used for it. This is still not guaranteed to
    find the cheapest solution, but with a good selection of weights
    it&#039;ll at least not find an expensive solution if a cheap solution
    exists.
  &lt;/p&gt;

  &lt;p&gt;
    It also seems to map out pretty well to how humans solve the
    puzzle. You look for the gimmes first, and only start thinking
    hard once there are no easy moves.
  &lt;/p&gt;

  &lt;a name=&#039;generator&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The Generator&lt;/h3&gt;

  &lt;p&gt;
    The previous section took care of figuring out if a level is any
    good or not. But that alone isn&#039;t enough, you also need to somehow
    generate levels for the solver to score. It&#039;s quite unlikely that
    a randomly generated level would be solvable, let alone
    interesting.
  &lt;/p&gt;

  &lt;p&gt;
    The key idea (which is by no means novel) is to interleave the
    solver and the generator. Let&#039;s start with a puzzle that&#039;s
    probably unsolvable, consisting just of numbers 2-5 placed in
    random locations on the grid:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-start.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    The solver runs until it can&#039;t make any more progress:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-blocked.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    The generator then adds some information to the puzzle, in the
    form of a dot, and continues solving.
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-one.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    In this case that one added is not enough to allow the solver to make
    any progress. So the generator will keep on adding more dots until
    the solver is happy:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-more.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    And then the solver resumes normal operation:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/add-dots-resume.png&#039;&gt;&lt;/img&gt;

  &lt;p&gt;
    This process continues either until the puzzle is solved or there
    is no more information to add (i.e. every space that&#039;s reachable
    from a number is covered by a dot).
  &lt;/p&gt;

  &lt;p&gt;
    This method works only if the new information that&#039;s being added
    can&#039;t invalidate any of the previously made deductions. That would
    be tough to do when adding numbers to the grid &lt;a id=&#039;fnref3&#039;&gt;[&lt;a href=&#039;#fn3&#039;&gt;3&lt;/a&gt;].
    But adding new dots to the board has that property,
    at least given the deduction rules I&#039;m using in this program.
  &lt;/p&gt;

  &lt;p&gt;
    Where shoud the algorithm add the dots? What I ended up doing was
    to add them in the empty space that could have been covered by the
    most lines in the starting state, so each dot tends to give as
    little information as possible. There is no attempt to add it
    specifically to a location where it&#039;ll be useful in advancing the
    puzzle at the point where the solver got stuck. This produces a
    pretty neat effect where most of the dots will be totally useless
    at the start of the puzzle, which makes the puzzle seem harder
    than it is. There are all these apparent moves you could make, but
    somehow none of them quite work out. The puzzle generator ends up
    being a bit of a jerk.
  &lt;/p&gt;

  &lt;p&gt;
    This process will not always produce a solution, but it&#039;s pretty
    fast (on the order of 50-100 microseconds) so it can just be repeated a
    bunch of times until it generates a level. Unfortunately it&#039;ll
    generally produce a mediocre puzzle. There are too many obvious
    moves right at the start, the board gets filled in very quickly
    and the solution tree is quite shallow.
  &lt;/p&gt;

  &lt;a name=&#039;optimizer&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;The optimizer&lt;/h3&gt;

  &lt;p&gt;
    The above process produced a mediocre puzzle. In the final stage,
    we use that level as a seed for an optimization process. The
    process works as follows.
  &lt;/p&gt;

  &lt;p&gt;
    The optimizer sets up a pool of up to 10 puzzle variants. The pool
    is initialized with the newly generated random puzzle. On each
    iteration, the optimizer selects one puzzle from the pool and
    mutates it.
  &lt;/p&gt;

  &lt;p&gt;
    The mutation removes all the dots, and then changes the numbers a
    bit (e.g. reduce/increase the value of a randomly selected number,
    or move a number to a different location on the grid). It might be
    possible to apply multiple mutations to board in one go. We then
    run the solver in the special level-generation mode described in
    the previous section. This adds enough dots to the puzzle to make
    it solvable again.
  &lt;/p&gt;

  &lt;p&gt;
    After that, we run the solver again, this time in the normal
    mode. During this run, the solver keeps track of a) the depth of
    the solution tree, b) how often each of the various kinds of rules
    was needed, c) how wide the solution tree was at times. The puzzle
    is scored based on the above criteria. The scoring function will
    basically prefer deep and narrow solutions, and at higher
    difficulty levels also rewards puzzles that require use of one or
    more of the advanced deduction rules.
  &lt;/p&gt;

  &lt;p&gt;
    The new puzzle is then added to the pool. If the pool ever
    contains more than 10 puzzles, the worst one is discarded.
  &lt;/p&gt;

  &lt;p&gt;
    This process is repeated a number of times (anything from 10k to 50k
    iterations seemed to be fine). After that, the version of the puzzle with the
    highest score is saved into the puzzle&#039;s level database. This is
    what the progress of the best puzzle looks like through
    one optimization run:
  &lt;/p&gt;

  &lt;a href=&#039;/blog/stc/images/procedural-puzzle/progress-opt.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;/blog/stc/images/procedural-puzzle/progress-opt.png&#039; width=&#039;700&#039;&gt;&lt;/img&gt;
  &lt;/a&gt;

  &lt;p&gt;
    I tried a few other ways of structuring the optimization as
    well. One version used simulated annealing, the others were
    genetic algorithms with different crossover operations. None of
    these performed as well as the naive pool of hill-climbers.
  &lt;/p&gt;

  &lt;a name=&#039;unique&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Unique single solution&lt;/h3&gt;

  &lt;p&gt;
    There&#039;s an interesting complication that arises when the puzzle
    has a single unique solution. Is it valid for the player to assume
    that&#039;s the case, and make deductions based on that? Is it fair for
    the puzzle generator to assume that the player will do so?
  &lt;/p&gt;

  &lt;p&gt;
    In a post on HN, I mentioned four options for how to deal with
    this:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;State the &amp;quot;only a single solution&amp;quot; up front, and
      make the puzzle generator generate levels that require this
      form of deduction. This sucks, since it&#039;ll make the rules far more
      complicated to understand. And it&#039;s also exactly the kind of
      detail people would forget.
    &lt;li&gt;Don&#039;t guarantee a single solution: have potentially multiple
      solutions, and accept any of them. This doesn&#039;t really solve
      the problem, it just moves it around.
    &lt;li&gt;Punt, and just assume this is a very rare event that won&#039;t
      matter in practice. (This is was the original implementation.)
    &lt;li&gt;Change the puzzle generator such that it doesn&#039;t generate
      puzzles where the knowing the solution is unique helps.
      (Probably the right thing to do, but also extra work.)
  &lt;/ul&gt;

  &lt;p&gt;
    I originally went with the last option, and that was a horrible
    mistake. It turns out that I&#039;d only considered one way in which
    the uniqueness of the solution leaks information, and that&#039;s
    indeed pretty rare. But there&#039;s others, and one was present in
    basically every level I&#039;d generated, and often kind of trivialized
    the solution. So in May 2019 I updated the Hard and Expert mode
    levels to go with the third option instead.
  &lt;/p&gt;

  &lt;p&gt;
    The most annoying case is the 2 with the dotted line in the
    following board:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/uncontested.png&#039;&gt;

  &lt;p&gt;
    Why could a sneaky player make that deduction? The 2 can cover
    any of the 4 adjacent squares. None of them have any dots, so they
    don&#039;t necessarily need to be covered by anything. And the square
    that&#039;s downwards doesn&#039;t have any overlap with other pieces. If
    there&#039;s a single solution, it has to be the case that other pieces
    cover the other three squares, and the 2 covers the downwards square.
  &lt;/p&gt;

  &lt;p&gt;
    The solution is to add some dots when these cases are detected,
    like this:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/ambiguate.png&#039;&gt;

  &lt;p&gt;
    Another common case was the dotted 2 on this board:
  &lt;/p&gt;

  &lt;img src=&#039;/blog/stc/images/procedural-puzzle/unique.png&#039;&gt;

  &lt;p&gt;
    Nothing distinguishes the squares to the left and up of the 2.
    Neither has a dot, and neither is reachable from any other number.
    Any solution where the 2 covers the upward square would have a
    matching solution where it covers the leftware square instead, and
    vice versa. If there&#039;s a single unique solution, it can&#039;t be either
    and thus the 2 must cover the downward square instead.
  &lt;/p&gt;

  &lt;p&gt;
    This kind of case I just solved by the &amp;quot;if it hurts, just don&#039;t do
    it&amp;quot; method. I.e. having the solver use this rule very early on in
    the priority list, and assigning these moves a large negative
    weight. Puzzles with this kind of property will mostly end up
    discarded by the optimizer, and the few that make it through will
    be discarded when doing the final level selection for the
    published game.
  &lt;/p&gt;

  &lt;p&gt;
    This is not an exhaustive list, I found a lot of other
    unique-solution rules when adversarially play-testing. But most of
    them felt like they were rare and difficult enough to find that
    they&#039;re not really shortcuts. If somebody solves a puzzle using
    that kind of deduction, I&#039;m not going to begrudge them that.
  &lt;/p&gt;

  &lt;h3&gt;Conclusion&lt;/h3&gt;

  &lt;p&gt;
    The game was originally designed as an experiment for procedural
    puzzle generation. The game design and the generator go hand in
    hand, so the exact techniques won&#039;t be directly applicable to
    existing games.
  &lt;/p&gt;

  &lt;p&gt;
    The part I can&#039;t answer is whether putting this much effort into
    the procedural generation was worth it. The feedback from
    players has been pretty inconsistent when it comes to the level design.
    A common theme for positive comments has been about how the
    puzzles always feel like there&#039;s a clever gotcha in there.
    The most common negative complaint has been that there&#039;s not
    enough of a difficulty gradient in the game.
  &lt;/p&gt;

  &lt;p&gt;
    I have a couple of other puzzle games in an embryonic stage, and
    felt good enough about this generator that I&#039;d probably at least
    try similar procedural generation methods for those too. One
    thing I&#039;d definitely do differently the next time around is
    to do adversarial playtesting from the start.
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn0&#039;&gt;[&lt;a href=&#039;#fnref0&#039;&gt;0&lt;/a&gt;] Or at least that&#039;s what I believed. But when I observed a
        bunch of players in person, about half of them just
        made guesses and then iterated on those guesses. Oh, well.
    &lt;/p&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] Anyone reading this should also read
        &lt;a href=&#039;https://magnushoff.com/minesweeper/&#039;&gt;Solving Minesweeper
          and making it better&lt;/a&gt; by Magnus Hoff, which has a fascinating
        twist on the perceived need for puzzle games with hidden information
        to be guaranteed solvable.
    &lt;/p&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] Just to be clear, this depth / narrowness of the tree is a
        metric that I thought was meaningful to this puzzle, not something
        that&#039;s going to be applicable to all or even most puzzles. For
        example there&#039;s a &lt;a href=&#039;https://web.archive.org/web/20130703141244/http://www.thinkfun.com/microsite/rushhour/creating2500challenges&#039;&gt;good argument&lt;/a&gt;
        to be made that a Rush Hour puzzle
        is interesting if there are multiple paths paths to the solution
        of almost but not quite the same length. But that&#039;s because Rush
        Hour is a game of finding the shortest solution, not just some
        solution.
    &lt;/p&gt;
    &lt;p&gt;
      &lt;a id=&#039;fn3&#039;&gt;[&lt;a href=&#039;#fnref3&#039;&gt;3&lt;/a&gt;] With the exception if adding 1s. The first version of the puzzle
        didn&#039;t have the dots, and the plan was to have the generator add 1s
        when it needed to add more information. But that felt a little too
        constrained.
    &lt;/p&gt;
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GAMES</category><pubDate>Tue, 14 May 2019 15:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2019-05-14-procedural-puzzle-generator/</guid></item><item><title>Optimizing a breadth-first search</title><link>https://www.snellman.net/blog/archive/2018-07-23-optimizing-breadth-first-search/</link><description>
&lt;img src=&#039;/blog/stc/images/sb-thumb.png&#039; style=&#039;float: right; margin: 16px&#039;&gt;

&lt;p&gt;
  A couple of months ago I finally had to admit I wasn&#039;t smart enough to
  solve a few of the  levels in &lt;a href=&#039;http://snakebird.noumenongames.com/&#039;&gt;
    Snakebird&lt;/a&gt;, a puzzle game.
  The only way to salvage
  some pride was to write a solver, and pretend that writing
  a program to do the solving is basically as good as having solved
  the problem myself. The C++ code for the resulting program
  is &lt;a href=&#039;https://github.com/jsnell/snakebird&#039;&gt;on Github&lt;/a&gt;.
  Most of what&#039;s discussed in the post is implemented in
  &lt;a href=&#039;https://github.com/jsnell/snakebird/blob/master/src/search.h&#039;&gt;
     search.h&lt;/a&gt; and
  &lt;a href=&#039;https://github.com/jsnell/snakebird/blob/master/src/compress.h&#039;&gt;
    compress.h&lt;/a&gt;. This post deals mainly with optimizing a
  breadth-first search that&#039;s estimated to use 50-100GB of memory to
  run on a memory budget of 4GB.
&lt;/p&gt;

&lt;p&gt;
  There will be a follow up post that deals with the specifics of the game.
  For this post, all you need to know is
  that that I could not see any good alternatives to the brute force
  approach, since none of the usual tricks worked. There are a lot of states
  since there are multiple movable or pushable objects, and the shape of
  some of them matters and changes during the game.
  There were no viable conservative
  heuristics for algorithms like A* to narrow down the search
  space. The search graph was directed and implicit, so
  searching both forward and backward simultaneously was not possible.
  And a single move could cause the state to change in a lot of unrelated
  ways, so nothing like &lt;a href=https://en.wikipedia.org/wiki/Zobrist_hashing&gt;
    Zobrist hashing&lt;/a&gt; was going to be viable.
&lt;/p&gt;

&lt;p&gt;
  A back of the envelope calculation suggested that the biggest
  puzzle was going to have on the order of 10 billion states after
  eliminating all symmetries. Even after packing the state
  representation as tightly as possible, the state size was on the
  order of 8-10 bytes depending on the puzzle. 100GB of memory would
  be trivial at work, but this was my home machine with 16GB of
  RAM. And since Chrome needs 12GB of that, my actual memory budget
  was more like 4GB. Anything in excess of that would have to go to
  disk (the spinning rust kind).
&lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
  How do we fit 100GB of data into 4GB of RAM?  Either a) the states
  would need to be compressed to 1/20th of their original already
  optimized size, b) the algorithm would need to be able to
  efficiently page state to disk and back, c) a combination of the
  above, or d) I should buy more RAM or rent a big VM for a few
  days. Option D was out of the question due to being boring. Options
  A and C seemed out of the question after a proof of concept with
  gzip: a 50MB blob of states compressed to about 35MB. That&#039;s about 7
  bytes per state, while my budget was more like 0.4 bytes per
  state. So option B it was, even though a breadth-first search looks
  pretty hostile to secondary storage.
&lt;/p&gt;

&lt;h2&gt;Table of contents&lt;/h2&gt;

&lt;p&gt;
  This is a somewhat long post, so here&#039;s a brief overview of the
  sections ahead:
&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&#039;#textbook-bfs&#039;&gt;A textbook BFS&lt;/a&gt; - What&#039;s the normal formulation
    of breadth-first search like, and why is it not suitable for storing
    parts of the state on disk?
  &lt;li&gt;&lt;a href=&#039;#sort-merge&#039;&gt;A sort + merge BFS&lt;/a&gt; - Changing the algorithm
    to efficiently do deduplications in batches.
  &lt;li&gt;&lt;a href=&#039;#compression&#039;&gt;Compression&lt;/a&gt; - Reducing the memory
    use by 100x with a combination of off-the-shelf and custom compression.
  &lt;li&gt;&lt;a href=&#039;#cheated&#039;&gt;Oh no, I&#039;ve cheated!&lt;/a&gt; - The first
    few sections glossed over something; it&#039;s not enough to know there
    is a solution, we need to know what the solution is.
    In this section the basic algorithm is updated to carry around
    enough data to reconstruct a solution from the final state.
  &lt;li&gt;&lt;a href=&#039;#sort-merge-multiple-outputs&#039;&gt;Sort + merge with multiple outputs&lt;/a&gt; -
    Keeping more state totally negates the compression gains. The
    sort + merge algorithm needs to be updated to keep two outputs:
    one that compresses well used during the search, and another
    that&#039;s just used to reconstruct the solution after one is found.
  &lt;li&gt;&lt;a href=&#039;#swapping&#039;&gt;Swapping&lt;/a&gt; - Swapping on Linux sucks
    even more than I thought.
  &lt;li&gt;&lt;a href=&#039;#compression-before-merging&#039;&gt;Compressing new states before merging&lt;/a&gt; - So far the memory optimizations have just been concerned with the visited
    set. But it turns out that the list of newly generated states is much
    larger than one might think. This section shows a scheme for representing
    the new states more efficiently.
  &lt;li&gt;&lt;a href=&#039;#parent-states&#039;&gt;Saving space on the parent states&lt;/a&gt; -
    Investigate some CPU/memory tradeoffs for reconstructing the solution
    at the end.
  &lt;li&gt;&lt;a href=&#039;#did-not-work&#039;&gt;What didn&#039;t or might not work&lt;/a&gt; -
    Some things that looked promising but I ended up reverting, and
    others that research suggested would work but my intuition said
    wouldn&#039;t for this case.
&lt;/ul&gt;

&lt;a id=&#039;textbook-bfs&#039;&gt;&lt;/a&gt;
&lt;h2&gt;A textbook BFS&lt;/h2&gt;

&lt;p&gt;
  So what does a breadth-first search look like, and why would it be
  disk-unfriendly? Before this little project I&#039;d only ever seen
  variants of the textbook formulation, something like this:
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    visited = {start}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return True
        for kid in adjacent(node):
            if kid not in visited:
                visited.add(kid)
                todo.push_back(kid)
    return False
&lt;/pre&gt;

&lt;p&gt;
  As the program produces new candidate nodes, each node is checked
  against a hash table of already visited nodes. If it&#039;s already
  present in the hash table, we ignore the node. Otherwise it&#039;s added
  both to the queue and the hash table. Sometimes the &#039;visited&#039;
  information is carried in the nodes rather than in a side-table; but
  that&#039;s a dodgy optimization to start with, and totally impossible
  when the graph is implicit rather than explicit.
&lt;/p&gt;

&lt;p&gt;
  Why is a hash table problematic? Because hash tables will tend to
  have a totally random memory access pattern. If they don&#039;t, it&#039;s a
  bad hash function and the hash table will probably perform terribly
  due to collisions. This random access pattern can cause performance
  issues even when the data fits in memory: an access to a huge
  hash table is pretty likely to cause both a cache and TLB miss. But
  if a significant chunk of the data is actually on disk rather than
  in memory? It&#039;d be disastrous: something on the order of 10ms per
  lookup.
&lt;/p&gt;

&lt;p&gt; With 10G unique states wed be looking at about four months of
  waiting for disk IO just for the hash table accesses. That can&#039;t
  work; the problem absolutely needs to be transformed such that the
  program can process big batches of data in one go.  &lt;/p&gt;

&lt;a id=&#039;sort-merge&#039;&gt;&lt;/a&gt;
&lt;h2&gt;A sort + merge BFS&lt;/h2&gt;

&lt;p&gt;
  If we wanted to batch the data access as much as possible, what would
  be the maximum achievable coarseness? Since the program can&#039;t know which nodes
  to processes on depth layer N+1 before layer N has been fully processed, it seems
  obvious that we have to do our deduplication of states at least once
  per depth.
&lt;/p&gt;

&lt;p&gt;
  Dealing with a whole layer at one time allows ditching hash tables,
  and representing the visited set and the new states as sorted
  streams of some sort (e.g. file streams, arrays, lists). We can
  trivially find the new visited set with a set union on the streams, and
  equally trivially find the todo set with a set difference.
&lt;/p&gt;

&lt;p&gt;
  The two set operations can be combined to work on a single pass
  through both streams. Basically peek into both streams, process the
  smaller element, and then advance the stream that the element came
  from (or both streams if the elements at the head were equal).  In
  either case, add the element to the new visited set. When advancing
  just the stream of new states, also add the element to the new todo
  set:
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    visited = Stream()
    todo = Stream()
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return True
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited)
    return False

# Merges sorted streams new and visited. Return a sorted stream of
# elements that were just present in new, and another sorted
# stream containing the elements that were present in either or
# both of new and visited.
def merge_sorted_streams(new, visited):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            if visited.peek() == new.peek():
                out_visited.add(visited.pop())
                new.pop()
            elif visited.peek() &lt; new.peek():
                out_visited.add(visited.pop())
            elif visited.peek() &gt; new.peek():
                out_todo.add(new.peek())
                out_visited.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek())
            out_visited.add(new.pop())
    return out_todo, out_visited
&lt;/pre&gt;

&lt;p&gt;
  The data access pattern is now perfectly linear and predictable,
  there are no random accesses at all during the merge. Disk latency
  thus becomes irrelevant, and the only thing that matters is
  throughput.
&lt;/p&gt;

&lt;p&gt;
  What does the theoretical performance look like with the simplified
  data distribution of 100 depth levels and 100M states per depth?
  The average state will be both read and written 50 times. That&#039;s
  10 bytes/state * 5G states * 50 = 2.5TB. My hard drive can supposedly
  read and write at a sustained
  100MB/s, which would mean (2 * 2.5TB) / (100MB/s) =~ 50k/s
  =~ 13 hours spent on the IO. That&#039;s a couple of orders of magnitude
  better than the earlier four month estimate!
&lt;/p&gt;

&lt;p&gt;
  It&#039;s worth noting that this simplistic model is not considering the
  size of the newly generated states. Before the merge step, they need
  to be kept in-memory for the sorting + deduplication. We&#039;ll look
  closer at that in a later section.
&lt;/p&gt;

&lt;a id=&#039;compression&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Compression&lt;/h2&gt;

&lt;p&gt;
  In the introduction I mentioned that compressing the states didn&#039;t
  look very promising in the initial experiments, with a 30% compression
  ratio. But after the above algorithm change the states are now ordered.
  That should be a lot easier to compress.
&lt;/p&gt;

&lt;p&gt;
  To test this theory, I used zstd on a puzzle of 14.6M states, with each
  state being 8 bytes. After the sorting they compressed to an average of
  1.4 bytes per state. That seems like a solid improvement. Not quite
  enough to run the whole program in memory, but it could plausibly
  cut the disk IO to just a couple of hours.
&lt;/p&gt;

&lt;p&gt;
  Is there any way to do better than a state of the art general
  purpose compression algorithm, if you know something about the
  structure of the data? Almost certainly. One good example is the PNG
  format. Technically the compression is just a standard Deflate
  pass. But rather than compress the raw image data, the image is
  first transformed
  using &lt;a href=&#039;https://www.w3.org/TR/PNG-Filters.html&#039;&gt;PNG filters&lt;/a&gt;.
  A PNG filter is basically a formula for predicting the value of a
  byte in the raw data from the value of the same byte on the
  previous row and/or the same byte of the previous pixel. For example the
  &#039;up&#039; filter transforms each byte by subtracting the previous row&#039;s
  value from it during compression, and doing the inverse when
  decompressing. Given the kinds of images PNG is meant for, the
  result will probably mostly consist of zeroes or numbers close to
  zero. Deflate can compress these far better than the raw data.
&lt;/p&gt;

&lt;p&gt;
  Can we apply a similar idea to the state records of the BFS? Seems
  like it should be possible. Just like in PNGs, there&#039;s a fixed row
  size, and we&#039;d expect adjacent rows to be very similar. The first
  tries with a subtraction/addition filter followed by zstd resulted
  in another 40% improvement in compression ratios: 0.87 bytes per
  state. The filtering operations are trivial, so this was basically
  free from a CPU consumption point of view.
&lt;/p&gt;

&lt;p&gt;
  It wasn&#039;t clear if one could do a lot better than that, or whether
  this was a practical limit. In image data there&#039;s a reasonable
  expectation of similarity between adjacent bytes of the same row.
  For the state data that&#039;s not true. But actually slightly more
  sophisticated filters could still improve on that number. The
  one I ended up using worked like this:
&lt;/p&gt;

&lt;p&gt;
  Let&#039;s assume we have adjacent rows R1 = [1, 2, 3, 4] and R2 = [1, 2,
  6, 4].  When outputting R2, we compare each byte to the same byte on
  the previous row, with a 0 for match and 1 for mismatch: diff = [0,
  0, 1, 0]. We then emit that bitmap encoded as a VarInt, followed by
  just the bytes that did not match the previous row. In this example, the
  two bytes &#039;0b00000100 6&#039;. This filter
  alone compressed the benchmark to 2.2 bytes / state. But combining
  this filter + zstd got it down to 0.42 bytes / state. Or to put it
  another way, that&#039;s 3.36 bits per state, which is just a little bit
  over what the back of the envelope calculation suggested was needed
  to fit in RAM.
&lt;/p&gt;

&lt;p&gt;
  In practice the compression ratios improve as the sorted sets get
  more dense. Once the search gets to a point where memory starts
  getting an issue, the compression ratios can get a lot better than
  that. The largest problem turned out to have 4.6G distinct visited states in
  the end. These states took 405MB when sorted and compressed with the
  above scheme. That&#039;s &lt;b&gt;0.7 bits per state&lt;/b&gt;. The compression and
  decompression end up taking about 25% of the program&#039;s CPU time,
  but that seems like a great tradeoff for cutting memory use to 1/100th.
&lt;/p&gt;

&lt;p&gt;
  The filter above does feel a bit wasteful due to the VarInt
  header on every row. It seems like it should be easy to improve on it with very
  little extra cost in CPU or complexity. I tried a bunch of other
  variants that transposed the data to a column-major order, or wrote
  the bitmasks in bigger blocks, etc. These variants invariably got much
  better compression ratio by themselves, but then didn&#039;t do as well
  when the output of the filter was compressed with zstd. It wasn&#039;t
  just due to some quirk of zstd either, the results were similar with
  gzip and bzip2. I don&#039;t have any great theories on why this
  particular encoding ended up compressing much better than the
  alternatives.
&lt;/p&gt;

&lt;p&gt;
  Another mystery is the compression ratio ended up far better when the
  data was sorted little-endian rather than big-endian. I initially thought
  it was due to the little-endian sort ending up with more leading zeros
  on the VarInt-encoded bitmask. But this difference persisted even for
  filters that didn&#039;t have such dependencies.
&lt;/p&gt;

&lt;p&gt;
  (There&#039;s a lot of research on compressing sorted sets of integers,
  since they&#039;re a basic building block of search engines. I didn&#039;t find
  a lot on compressing sorted fixed-size records though, and didn&#039;t want
  to start jumping through the hoops of representing my data as arbitrary
  precision integers.q)
&lt;/p&gt;

&lt;a id=&#039;cheated&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Oh no, I&#039;ve cheated!&lt;/h2&gt;

&lt;p&gt;
  You might have noticed that the above pseudocode implementations of
  BFS were only returning a boolean for solution found / not found.
  That&#039;s not very useful. For most purposes you need to be
  able to produce a list of the exact steps of the solution, not
  just state that a solution exists.
&lt;/p&gt;

&lt;p&gt;
  On the surface the solution is easy. Rather than collect sets of
  states, collect mappings from states to a parent state. Then after
  finding a solution, just trace back the list of parent states from
  the end to the start. For the hash table based solution, it&#039;d be
  something like:
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    visited = {start: None}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return trace_solution(node, visited)
        for kid in adjacent(node):
            if kid not in visited:
                visited[kid] = node
                todo.push_back(kid)
    return None

def trace_solution(state, visited):
  if state is None:
    return []
  return trace_solution(start, visited[state]) + [state]
&lt;/pre&gt;

&lt;p&gt;
  Unfortunately this will totally kill the compression gains
  from the last section; the core assumption was that adjacent rows
  would be very similar. That was true when we just looked at the
  states themselves. But there is no reason to believe that&#039;s going to
  be true for the parent states; they&#039;re effectively random data.
  Second, the sort + merge solution has to read and write back all
  seen states on each iteration. To maintain the state / parent state
  mapping, we&#039;d also have to read and write all this badly compressing
  data to disk on each iteration.
&lt;/p&gt;

&lt;a id=&#039;sort-merge-multiple-outputs&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Sort + merge with multiple outputs&lt;/h2&gt;

&lt;p&gt;
  The program only needs the state/parent mappings at the very end,
  when tracing back the solution. We can thus maintain two data
  structures in parallel. &#039;Visited&#039; is still the set of visited
  states, and gets recomputed during the merge just like before.
  &#039;Parents&#039; is a mostly sorted list of state/parent pairs, which
  doesn&#039;t get rewritten. Instead the new states + their parents get
  appended to &#039;parents&#039; after each merge operation.
&lt;/p&gt;

&lt;pre&gt;
def bfs(graph, start, end):
    parents = Stream()
    visited = Stream()
    todo = Stream()
    parents.add((start, None))
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return trace_solution(node, parents)
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited, parents)
    return None

# Merges sorted streams new and visited. New contains pairs of
# key + value (just the keys are compared), visited contains just
# keys.
#
# Returns a sorted stream of keys that were just present in new,
# another sorted stream containing the keys that were present in either or
# both of new and visited. Also adds the keys + values to the parents
# stream for keys that were only present in new.
def merge_sorted_streams(new, visited, parents):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            visited_head = visited.peek()
            new_head = new.peek()[0]
            if visited_head == new_head:
                out_visited.add(visited.pop())
                new.pop()
            elif visited_head &lt; new_head:
                out_visited.add(visited.pop())
            elif visited_head &gt; new_head:
                out_todo.add(new_head)
                out_visited.add(new_head)
                out_parents.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek()[0])
            out_visited.add(new.peek()[0])
            out_parents.add(new.pop())
    return out_todo, out_visited
&lt;/pre&gt;

&lt;p&gt;
This gives us the best of both worlds from a runtime and working
set perspective, but does mean using more secondary storage. A
separate copy of the visited states grouped by depth turns out
to also be useful later on for other reasons.
&lt;/p&gt;

&lt;a id=&#039;swapping&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Swapping&lt;/h2&gt;

&lt;p&gt;
  Another detail ignored in the snippets of pseudocode is that there
  is no explicit code for disk IO, just an abstract interface
  Stream. The Stream might be a file stream or an in-memory array, but
  we&#039;ve been ignoring that implementation detail. Instead the
  pseudocode is concerned with having a memory access pattern that
  would be disk friendly. In a perfect world that&#039;d be enough, and the
  virtual memory subsystem of the OS would take care of the rest.
&lt;/p&gt;

&lt;p&gt;
  At least with Linux that doesn&#039;t seem to be the case. At one point
  (before the working set had been shrunk to fit in memory) I&#039;d gotten
  the program to run in about 11 hours when the data was stored mostly
  on disk. I then switched the program to use anonymous pages instead
  of file-backed ones, and set up sufficient swap on the same
  disk. After three days the program had gotten a quarter of the way
  through, and was still getting slower over time. My optimistic
  estimate was that it&#039;d finish in 20 days.
&lt;/p&gt;

&lt;p&gt;
  Just to be clear, this was exactly the same code and &lt;i&gt;exactly the
  same access pattern&lt;/i&gt;. The only thing that changed was whether the
  memory was backed by an explicit on-disk file or by swap. It&#039;s
  pretty much axiomatic that swapping tends to totally destroy
  performance on Linux, whereas normal file IO doesn&#039;t. I&#039;d
  always assumed it was due to programs having the gall to treat RAM
  as something to be randomly accessed. But that wasn&#039;t the case here.
&lt;/p&gt;

&lt;p&gt;
  Turns out that file-backed and anonymous pages are not treated
  identically by the VM subsystem after all. They&#039;re kept in separate
  LRU caches with different expiration policies, and they also appear
  to have different readahead / prefetching properties.
&lt;/p&gt;

&lt;p&gt;
  So now I know: Linux swapping will probably not work well even under
  optimal circumstances. If parts of the address space are likely to
  be paged out for a while, it&#039;s better to arrange manually for the to
  be file-backed than to trust swap. I did it by implementing a custom
  vector class that started off as a purely in-memory implementation, and
  after a size threshold is exceeded switches to mmap on an unlinked
  temporary file.
&lt;/p&gt;

&lt;a id=&#039;compression-before-merging&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Compressing new states before merging&lt;/h2&gt;

&lt;p&gt;
  In the simplified performance model the assumption was that there
  would be 100M new states per depth. That turned out not to be too
  far off reality (the most difficult puzzle peaked at about 150M
  unique new states from one depth layer). But it&#039;s also not the right
  thing to measure; the working set before the merge isn&#039;t related to
  just the unique states, but all the states that were output for this
  iteration. This measure peaks at 880M output states / depth. These 880M
  states a) need to be accessed with a random access pattern for the sorting,
  and b) can&#039;t be compressed efficiently due to not being sorted, c)
  need to be stored along with the parent state. That&#039;s a roughly 16GB
  working set.
&lt;/p&gt;

&lt;p&gt;
  The obvious solution would be to use some form of external sorting.
  Just write all the states to disk, do an external sort, do a
  deduplication, and then execute the merge just as before. This is
  the solution I went with first, but while it mostly solved problem
  A, it did nothing for B and C.
&lt;/p&gt;

&lt;p&gt;
  The alternative I ended up with was to collect the states into an
  in-memory array. If the array grows too large (e.g. more than 100M
  elements), it&#039;s sorted, deduplicated and compressed. This gives us
  a bunch of sorted runs of states, with no duplicates inside the run
  but potentially some between the runs. The code for merging the
  new and visited states is fundamentally the same; it&#039;s still based
  on walking through the streams in lockstep. The only change is that
  instead of walking through just the two streams, there&#039;s a separate
  stream for each of the sorted runs of new states.
&lt;/p&gt;

&lt;p&gt;
  The compression ratios for these 100M state runs are of course not
  quite as good as for compressing the set of all visited states. But
  even so, it cuts down both the working set and the disk IO
  requirements by a ton. There&#039;s a little bit of extra CPU from having
  to maintain a priority queue of streams, but it was still a great
  tradeoff.
&lt;/p&gt;

&lt;a id=&#039;parent-states&#039;&gt;&lt;/a&gt;
&lt;h2&gt;Saving space on the parent states&lt;/h2&gt;

&lt;p&gt;
  At this point the vast majority of the space used by this program is
  spent on storing the parent states, so that we can reconstruct the
  solution after finding it. They are unlikely to compress well, but
  is there maybe a CPU/memory tradeoff to be made?
&lt;/p&gt;

&lt;p&gt;
  What we need is a mapping from a state S&#039; at depth D+1 to its parent
  state S at depth D. If we could iterate all possible parent states
  of S&#039;, we could simply check if any of them appear at depth D in our
  visited set. (We&#039;ve already produced the visited set grouped by
  depth as a convenient byproduct when outputting the state/parent
  mappings from merge). Unfortunately that doesn&#039;t work for this
  problem; it&#039;s simply too hard to generate all the possible states S
  given S&#039;. It&#039;d probably work just fine for many other search problems
  though.
&lt;/p&gt;

&lt;p&gt;
  If we can only generate the state transitions forward, not backward,
  how about just doing that then? Let&#039;s iterate through all the states at
  depth D, and see what output states they have. If some state produces S&#039;
  as an output, we&#039;ve found a workable S. The issue with the plan is that
  it increases the total CPU usage of the program by 50%. (Not 100%, since
  on average we find S after looking at half the states of depth D).
&lt;/p&gt;

&lt;p&gt;
  So I don&#039;t like either of the extremes, but at least there is a
  CPU/memory tradeoff available there. Is there maybe a more palatable
  option somewhere in the middle? What I ended up doing was to not
  store the pair (S&#039;, S), but instead (S&#039;, H(S)), where H is an 8 bit
  hash function. To find an S given S&#039;, again iterate through all the
  states at depth D. But before doing anything else, compute the same
  hash. If the output doesn&#039;t match H(S), this isn&#039;t the state we&#039;re
  looking for, and we can just skip it. This optimization means doing
  the expensive re-computation for just 1/256 states, which is a
  negligible CPU increase, while cutting down memory the memory spent
  for storing the parent states from 8-10 bytes to 1 byte.
&lt;/p&gt;

&lt;a id=&#039;did-not-work&#039;&gt;&lt;/a&gt;
&lt;h2&gt;What didn&#039;t or might not work&lt;/h2&gt;

&lt;p&gt;
  The previous sections go through a sequence of high level
  optimizations that worked. There were other things that I tried
  that didn&#039;t work, or that I found in the literature but decided
  would not actually work in this particular case. Here&#039;s a non-exhaustive
  list.
&lt;/p&gt;

&lt;p&gt;
  At one point I was not recomputing the full visited set at every
  iteration. Instead it was kept as multiple sorted runs, and those
  runs were occasionally compacted. The benefit was fewer disk writes
  and less CPU spent on compression. The downside was more code
  complexity and a worse compression ratio. I originally thought this
  design made sense since in my setup writes were more expensive than
  reads. But in the end the compression ratio was worse by a factor of
  2. The tradeoffs are non-obvious, but in the end I reverted back to
  the simpler form.
&lt;/p&gt;

&lt;p&gt;
  There is a little bit of research done into executing huge breadth
  first searches for implicit graphs on secondary storage,
  a &lt;a href=&#039;https://www.cs.helsinki.fi/u/bmmalone/heuristic-search-fall-2013/Korf2008.pdf&#039;&gt;2008 survey paper&lt;/a&gt; is a good starting point. As one might
  guess, the idea of doing the deduplication in a batch with
  sort+merge, on secondary store, isn&#039;t novel. The surprising part is that it was
  apparently only discovered in the 1993. That&#039;s pretty late! There
  are then some later proposals for secondary storage breadth first
  search that don&#039;t require a sorting step.
&lt;/p&gt;

&lt;p&gt;
  One of them was to map the states to integers, and to maintain an
  in-memory bitmap of the visited states. This is totally useless for
  my case, since the sizes of the encodable vs. actually reachable
  state spaces are so different. And I&#039;m a bit doubtful about there
  being any interesting problems where this approach works.
&lt;/p&gt;

&lt;p&gt;
  The other viable sounding alternative is based on temporary hash tables.
  The visited states are stored unsorted in a file. Store the outputs from
  depth D in a hash table. Then iterate through the visited states, and
  look them up in the hash table. If the element is found in the hash table,
  remove it. After iterating through the whole file, only the non-duplicates
  remain. They can then be appended to the file, and used to initialize the
  todo list for the next iteration. If the number of outputs is so large that
  the hash table doesn&#039;t fit in memory, both the files and the hash tables
  can be partitioned using the same criteria (e.g. top bits of state), with
  each partition getting processed independently.
&lt;/p&gt;

&lt;p&gt;
  While there are &lt;a href=&#039;https://pdfs.semanticscholar.org/d9b5/ca0e84ebf8566c34cf218aba1789af6d3111.pdf&#039;&gt;benchmarks&lt;/a&gt; claiming the hash-based approach is
  roughly 30% faster than sort+merge, the benchmarks don&#039;t really seem to
  consider compression. I just don&#039;t see how giving up the compression gains
  could be worth it, so didn&#039;t experiment with these approaches at all.
&lt;/p&gt;

&lt;p&gt;
  The other relevant branch of research that seemed promising was
  database query optimization. The deduplication problem seems very
  much related to database joins, with exactly the same
  &lt;a href=&#039;https://15721.courses.cs.cmu.edu/spring2018/papers/19-hashjoins/schuh-sigmod2016.pdf&#039;&gt;sort vs. hash&lt;/a&gt; &lt;a href=&#039;http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf&#039;&gt;dilemma&lt;/a&gt;. Obviously some of these findings should
  carry over to a search problem. The difference
  might be that the output of a database join is transient, while the
  outputs of a BFS deduplication persist for the rest of the computation.
  It feels like that changes the tradeoffs: it&#039;s not just about how to
  process one iteration most efficiently, it&#039;s also about having the
  outputs in the optimal format for the next iteration.
&lt;/p&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;
  That concludes the things I learned from this project that seem
  generally applicable to other brute force search problems. These
  tricks combined to get the hardest puzzles of the game from
  an effective memory footprint of 50-100GB to 500MB, and degrading
  gracefully if the problem exceeds available memory and spills to
  disk. It is also
  50% faster than a naive hash table based state deduplication
  even for puzzles that fit into memory.
&lt;/p&gt;

&lt;p&gt;
  The next post will deal with optimizing grid-based spatial puzzle
  games in general, as well as some issues specific just to this
  particular game.
&lt;/p&gt;

&lt;p&gt;
  In the meanwhile, Snakebird is available at least on &lt;a href=&#039;https://store.steampowered.com/app/357300/Snakebird/&#039;&gt;Steam&lt;/a&gt;,
  &lt;a href=&#039;https://play.google.com/store/apps/details?id=com.NoumenonGames.SnakeBird_Touch&amp;hl=en&#039;&gt;Google Play&lt;/a&gt;, and the &lt;a href=&#039;https://itunes.apple.com/us/app/snakebird/id1087075743?mt=8&#039;&gt;App Store&lt;/a&gt;. I recommend it for anyone interested
  in a very hard but fair puzzle game.
&lt;/p&gt;
</description><author>jsnell@iki.fi</author><category>GAMES</category><pubDate>Mon, 23 Jul 2018 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2018-07-23-optimizing-breadth-first-search/</guid></item><item><title>Why PS4 downloads are so slow</title><link>https://www.snellman.net/blog/archive/2017-08-19-slow-ps4-downloads/</link><description>
  &lt;p&gt;
    Game downloads on PS4 have a reputation of being very slow, with many people
    reporting downloads being an order of magnitude faster on Steam or
    Xbox. This had long been on my list of things to look into, but at
    a pretty low priority.  After all, the PS4 operating system is
    based on a reasonably modern FreeBSD (9.0), so there should not be
    any crippling issues in the TCP stack. The implication is that the
    problem is something boring, like an inadequately dimensioned CDN.
  &lt;/p&gt;

  &lt;p&gt;
    But then I heard that people were successfully using local HTTP
    proxies as a workaround. It should be pretty rare for that to
    actually help with download speeds, which made this sound like a
    much more interesting problem.
  &lt;/p&gt;

&lt;read-more&gt;&lt;/read-more&gt;

  &lt;p&gt;
    This is going to be a long-winded technical post.  If you&#039;re not
    interested in the details of the investigation but just want a
    recommendation on speeding up PS4 downloads, skip straight to the
    &lt;a href=&#039;#conclusions&#039;&gt;conclusions&lt;/a&gt;.
  &lt;/p&gt;

  &lt;h3&gt;Background&lt;/h3&gt;

  &lt;p&gt;
    Before running any experiments, it&#039;s good to have a mental model
    of how the thing we&#039;re testing works, and where the problems might
    be. If nothing else, it will guide the initial experiment design.
  &lt;/p&gt;

  &lt;p&gt;
    The speed of a steady-state TCP connection is basically defined by
    three numbers. The amount of data the client is will to receive on
    a single round-trip (TCP receive window), the amount of data the
    server is willing to send on a single round-trip (TCP congestion
    window), and the round trip latency between the client and the server (RTT).
    To a first approximation, the connection speed will be:
  &lt;/p&gt;

  &lt;pre&gt;
    speed = min(rwin, cwin) / RTT
&lt;/pre&gt;

  &lt;p&gt;
    With this model, how could a proxy speed up the connection?  Well,
    with a proxy the original connection will be split into two mostly
    independent parts; one connection between the client and the
    proxy, and another between the proxy and the server. The speed of
    the end-to-end connection will be determined by the slower of
    those two independent connections:
  &lt;/p&gt;

  &lt;pre&gt;
    speed_proxy_client = min(client rwin, proxy cwin) / client-proxy RTT
    speed_server_proxy = min(proxy rwin, server cwin) / proxy-server RTT
    speed = min(speed_proxy_client, speed_server_proxy)
&lt;/pre&gt;

  &lt;p&gt;
    With a local proxy the client-proxy RTT will be very low; that
    connection is almost guaranteed to be the faster one. The
    improvement will have to be from the server-proxy connection being
    somehow better than the direct client-server one. The RTT will not
    change, so there are just two options: either the client has a
    much smaller receive window than the proxy, or the client is
    somehow causing the server&#039;s congestion window to
    decrease. (E.g. the client is randomly dropping received packets,
    while the proxy isn&#039;t).
  &lt;/p&gt;

  &lt;p&gt;
    Out of these two theories, the receive window one should be much
    more likely, so we should concentrate on it first. But that just
    replaces our original question with a new one: why would the
    client&#039;s receive window be so low that it becomes a noticeable
    bottleneck? There&#039;s a fairly limited number of causes for low
    receive windows that I&#039;ve seen in the wild, and they don&#039;t really
    seem to fit here.
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; Maybe the client doesn&#039;t support the TCP window scaling option,
      while the proxy does. Without window scaling, the receive window
      will be limited to 64kB. But since we know Sony started with a
      TCP stack that supports window scaling, they would have had to
      go out of their way to disable it. Slow downloads, for no benefit.
    &lt;li&gt; Maybe the actual downloader application is very slow. The operating
      system is supposed to have a certain amount of buffer space available
      for each connection. If the network is delivering data to the OS
      faster than the application is reading it, the buffer will start to
      fill up, and the OS will reduce the receive window as a form
      of back-pressure. But this can&#039;t be the reason; if the application
      is the bottleneck, it&#039;ll be a bottleneck with or without the
      proxy.
    &lt;li&gt; The operating system is trying to dynamically scale the
      receive window to match the actual network conditions, but
      something is going wrong. This would be interesting, so it&#039;s
      what we&#039;re hoping to find.
  &lt;/ul&gt;

  &lt;p&gt;
    The initial theories are in place, let&#039;s get digging.
  &lt;/p&gt;

  &lt;h3&gt;Experiment #1&lt;/h3&gt;

  &lt;p&gt;
    For our first experiment, we&#039;ll start a PSN download on a baseline
    non-Slim PS4, firmware 4.73. The network connection of the PS4 is
    bridged through a Linux machine, where we can add latency to the
    network using &lt;code&gt;tc netem&lt;/code&gt;. By varying the added latency,
    we should be able to find out two things: whether the receive
    window really is the bottleneck, and whether the receive window
    is being automatically scaled by the operating system.
  &lt;/p&gt;

  &lt;p&gt;
    This is what the client-server RTTs (measured from a packet
    capture using TCP timestamps) look like for the experimental
    period. Each dot represents 10 seconds of time for a single
    connection, with the Y axis showing the minimum RTT seen for that
    connection in those 10 seconds.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-rtt-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-rtt-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The next graph shows the amount of data sent by the server in one
    round trip in red, and the receive windows advertised by the
    client in blue.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-win-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl1-win-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    First, since the blue dots are staying constantly at about 128kB,
    the operating system doesn&#039;t appear to be doing any kind of
    receive window scaling based on the RTT. (So much for that
    theory). Though at the very right end of the graph the receive
    window shoots out to 650kB, so it isn&#039;t totally
    fixed either.
  &lt;/p&gt;

  &lt;p&gt;
    Second, is the receive window the bottleneck here? If so, the
    blue dots would be close to the red dots. This is the case
    until about 10:50. And then mysteriously the bottleneck moves to
    the server.
  &lt;/p&gt;

  &lt;p&gt;
    So we didn&#039;t find quite what we were looking for, but there are a
    couple of very interesting things that are correlated with events
    on the PS4.
  &lt;/p&gt;

  &lt;p&gt;The download was in the foreground for the whole duration of the
    test. But that doesn&#039;t mean it was the only thing running on the
    machine. The Netflix app was still running in the background,
    completely idle &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]. When the background app was
    closed at 11:00, the receive window increased dramatically. This
    suggests a second experiment, where different applications are
    opened / closed / left running in the background.
  &lt;/p&gt;

  &lt;p&gt;
    The time where the receive window stops being the bottleneck is
    very close to the PS4 entering rest mode. That looks like another
    thing worth investigating. Unfortunately, that&#039;s not true, and
    rest mode is a red herring here. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]
  &lt;/p&gt;

  &lt;h3&gt;Experiment #2&lt;/h3&gt;

  &lt;p&gt;
    Below is a graph of the receive windows for a second download,
    annotated with the timing of various noteworthy events.
  &lt;/p&gt;

  &lt;a href=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl2-rwin-full.png&#039; target=&#039;_blank&#039;&gt;
    &lt;img src=&#039;https://www.snellman.net/blog/stc/images/ps4-dl/dl2-rwin-thumb.png&#039;&gt;
  &lt;/a&gt;

  &lt;p&gt;
    The differences in receive windows at different times are
    striking. And more important, the changes in the receive
    windows correspond very well to specific things I did on
    the PS4.
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; When the download was started, the game Styx: Shards of
      Darkness was running in the background (just idling in the title
      screen). The download was limited by a receive window of under
      7kB. This is an incredibly low value; it&#039;s basically going to
      cause the downloads to take &lt;b&gt;100 times longer than they should&lt;/b&gt;.
      And this was not a coincidence, whenever that game
      was running, the receive window would be that low.
    &lt;li&gt; Having an app running (e.g. Netflix, Spotify) limited the
      receive window to 128kB, for about a 5x reduction in potential
      download speed.
    &lt;li&gt; Moving apps, games, or the download window to the foreground
      or background didn&#039;t have any effect on the receive window.
    &lt;li&gt; Launching some other games (Horizon: Zero Dawn, Uncharted 4,
      Dreadnought) seemed to have the same effect as running an app.
    &lt;li&gt; Playing an online match in a networked game (Dreadnought) caused the
      receive window to be artificially limited to 7kB.
    &lt;li&gt; Playing around in a non-networked game (Horizon: Zero Dawn)
      had a very inconsistent effect on the receive window, with the
      effect seemingly depending on the intensity of gameplay. This
      looks like a genuine resource restriction (download process
      getting variable amounts of CPU), rather than an artificial
      limit.
    &lt;li&gt; I ran a speedtest at a time when downloads were limited to
      7kB receive window. It got a decent receive window of over
      400kB; the conclusion is that the artificial receive window
      limit appears to only apply to PSN downloads.
    &lt;li&gt; Putting the PS4 into rest mode had no effect.
    &lt;li&gt; Built-in features of the PS4 UI, like the web browser,
      do not count as apps.
    &lt;li&gt; When a game was started (causing the previously running game
      to be stopped automatically), the receive window could increase
      to 650kB for a very brief period of time. Basically it appears
      that the receive window gets unclamped when the old game stops,
      and then clamped again a few seconds later when the new game
      actually starts up.
  &lt;/ul&gt;

  &lt;p&gt;
    I did a few more test runs, and all of them seemed
    to support the above findings. The only additional information
    from that testing is that the rest mode behavior was dependent
    on the PS4 settings. Originally I had it set up to suspend apps
    when in rest mode. If that setting was disabled, the apps would
    be closed when entering in rest mode, and the downloads would
    proceed at full speed.
  &lt;/p&gt;

  &lt;p&gt;A 7kB receive window will be absolutely crippling for any user.
     A 128kB window might be ok for users who have CDN servers very
     close by, or who don&#039;t have a particularly fast internet. For
     example at my location, a 128kB receive window would cap the downloads at about
     35Mbp to 75Mbps depending on which CDN the DNS RNG happens to give me.
     The lowest two speed tiers for my ISP are 50Mbps and 200Mbps.
     So either the 128kB would not be a noticeable problem (50Mbps) or it&#039;d mean
     that downloads are artificially limited to to 25% speed (200Mbps).
  &lt;/p&gt;

  &lt;a name=&#039;conclusions&#039;&gt;&lt;/a&gt;
  &lt;h3&gt;Conclusions&lt;/h3&gt;

  &lt;p&gt;If any applications are running, the PS4 appears to change the
    settings for PSN store downloads, artificially restricting their
    speed. Closing the other applications will remove the limit. There
    are a few important details:
  &lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt; Just leaving the other applications running in the background will
      &lt;b&gt;not help&lt;/b&gt;. The exact same limit is applied whether the download
      progress bar is in the foreground or not.
    &lt;li&gt; Putting the PS4 into rest mode might or might not help,
      depending on your system settings.
    &lt;li&gt;The artificial limit applies only to the PSN store downloads.
      It does &lt;b&gt;not&lt;/b&gt; affect e.g. the built-in speedtest. This
      is why the speedtest might report much higher speeds than the
      actual downloads, even though both are delivered from the same
      CDN servers.
    &lt;li&gt; Not all applications are equal; most of them will cause the
      connections to slow down by up to a factor of 5. Some
      games will cause a difference of about a factor of 100. Some
      games will start off with the factor of 5, and then migrate to
      the factor of 100 once you leave the start menu and start playing.
    &lt;li&gt; The above limits are artificial. In addition to that,
      actively playing a game can cause game downloads to slow down.
      This appears to be due to a genuine lack of CPU resources (with
      the game understandably having top priority).
  &lt;/ul&gt;

  &lt;p&gt;
    So if you&#039;re seeing slow downloads, just closing all the running
    applications might be worth a shot. (But it&#039;s obviously not
    guaranteed to help. There are other causes for slow downloads as
    well, this will just remove one potential bottleneck).
    To close the running applications, you&#039;ll need to
    long-press the PS button on the controller, and then select &amp;quot;Close
    applications&amp;quot; from the menu.
  &lt;/p&gt;

  &lt;p&gt;
    The PS4 doesn&#039;t make it very obvious exactly what programs are
    running. For games, the interaction model is that opening a new
    game closes the previously running one. This is not how other apps
    work; they remain in the background indefinitely until you
    explicitly close them.

  &lt;p&gt;
    And it&#039;s gets worse than that. If your PS4 is configured to
    suspend any running apps when put to rest mode, you can seemingly
    power on the machine into a clean state, and still have a hidden
    background app that&#039;s causing the OS to limit your PSN download
    speeds.
  &lt;/p&gt;

  &lt;p&gt;
    This might explain some of the superstitions about this on the
    Internet. There are people who swear that putting the machine to
    rest mode helps with speeds, others who say it does nothing. Or
    how after every firmware update people will report increased
    download speeds. Odds are that nothing actually changed in the
    firmware; it&#039;s just that those people had done their first full
    reboot in a while, and finally had a system without a background
    app running.
  &lt;/p&gt;

  &lt;h3&gt;Speculation&lt;/h3&gt;

  &lt;p&gt;
    Those were the facts as I see them. Unfortunately this raises some
    new questions, which can&#039;t be answered experimentally. With no
    facts, there&#039;s no option except to speculate wildly!
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: Is this an intentional feature? If so, what its purpose?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    Yes, it must be intentional. The receive window changes very
    rapidly when applications or games are opened/closed, but not for
    any other reason. It&#039;s not any kind of subtle operating system
    level behavior; it&#039;s most likely the PS4 UI explicitly
    manipulating the socket receive buffers.
  &lt;/p&gt;

  &lt;p&gt;
    But why? I think the idea here must be to not allow the network
    traffic of background downloads to take resources away from the
    foreground use of the PS4. For example if I&#039;m playing an online
    shooter, it makes sense to harshly limit the background download
    speeds to make sure the game is getting ping times that are
    both low and predictable. So there&#039;s at least some point in that
    7kB receive window limit in some circumstances.
  &lt;/p&gt;

  &lt;p&gt;
    It&#039;s harder to see what the point of the 128kB receive window
    limit for running any app is. A single game download from some
    random CDN isn&#039;t going to muscle out Netflix or Youtube... The
    only thing I can think of is that they&#039;re afraid that multiple
    simultaneous downloads, e.g. due to automatic updates, might cause
    problems for playing video. But even that seems like a stretch.
  &lt;/p&gt;

  &lt;p&gt;
    There&#039;s an alternate theory that this is due to some non-network
    resource constraints (e.g. CPU, memory, disk). I don&#039;t think that
    works. If the CPU or disk were the constraint, just having the
    appropriate priorities in place would automatically take care of
    this. If the download process gets starved of CPU or disk
    bandwidth due to a low priority, the receive buffer would fill up
    and the receive window would scale down dynamically, exactly when
    needed. And the amounts of RAM we&#039;re talking about here are
    miniscule on a machine with 8GB of RAM; less than a megabyte.
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: Is this feature implemented well?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    Oh dear God, no. It&#039;s hard to believe just how sloppy this
    implementation is.
  &lt;/p&gt;

  &lt;p&gt;
    The biggest problem is that the limits get applied based just on
    what games/applications are currently running.  That&#039;s just
    insane; what matters should be which games/applications someone is
    currently using. Especially in a console UI, it&#039;s a totally
    reasonable expectation that the foreground application gets
    priority. If I&#039;ve got the download progress bar in the foreground,
    the system had damn well give that download priority. Not some
    application that was started a month ago, and hasn&#039;t been used
    since. Applying these limits in rest mode with suspended
    apps is beyond insane.
  &lt;/p&gt;

  &lt;p&gt;
    Second, these limits get applied per-connection.  So if you&#039;ve got
    a single download going, it&#039;ll get limited to 128kB of receive
    window. If you&#039;ve got five downloads, they&#039;ll all get 128kB, for a
    total of 640kB. That means the efficiency of the &amp;quot;make sure downloads
    don&#039;t clog the network&amp;quot; policy depends purely on how many downloads
    are active. That&#039;s rubbish. This is all controlled on the
    application level, and the application knows how many downloads
    are active. If there really were an optimal static receive window
    X, it should just be split evenly across all the downloads.
  &lt;/p&gt;

  &lt;p&gt;
    Third, the core idea of applying a static receive window as a
    means of fighting bufferbloat is just fundamentally broken.
    Using the receive window as the rate limiting mechanism just
    means that the actual transfer rate will depend on the RTT
    (this is why a local proxy helps). For this kind of thing to
    work well, you can&#039;t have the rate limit depend on the
    RTT. You also can&#039;t just have somebody come up with a number
    once, and apply that limit to everyone. The limit needs to
    depend on the actual network conditions.
  &lt;/p&gt;

   &lt;p&gt;
    There are ways to detect how congested the downlink is in the
    client-side TCP stack. The proper fix would be to implement them,
    and adjust the receive window of low-priority background downloads
    if and only if congestion becomes an issue. That would actually be
    a pretty valuable feature for this kind of appliance. But I can
    kind of forgive this one; it&#039;s not an off the shelf feature, and
    maybe Sony doesn&#039;t employ any TCP kernel hackers.
  &lt;/p&gt;

  &lt;p&gt;
    Fourth, whatever method is being used to decide on whether a game
    is network-latency sensitive is broken. It&#039;s absurd that a demo of
    a single-player game idling in the initial title screen would
    cause the download speeds to be totally crippled. This really
    should be limited to actual multiplayer titles, and ideally just
    to periods where someone is actually playing the game online.
    Just having the game running should not be enough.
  &lt;/p&gt;

  &lt;p&gt;&lt;b&gt;Q: How can this still be a problem, 4 years after launch?&lt;/b&gt;&lt;/p&gt;

  &lt;p&gt;
    I have no idea. Sony must know that the PSN download speeds have
    been a butt of jokes for years. It&#039;s probably the biggest
    complaint people have with the system. So it&#039;s hard to believe
    that nobody was ever given the task of figuring out why it&#039;s
    slow. And this is not rocket science; anyone bothering to look
    into it would find these problems in a day.&lt;/p&gt;

  &lt;p&gt;
    But it seems equally impossible that they know of the cause, but
    decided not to apply any of the the trivial fixes to it. (Hell, it
    wouldn&#039;t even need to be a proper technical fix. It could just be
    a piece of text saying that downloads will work faster with all
    other apps closed).
  &lt;/p&gt;

  &lt;p&gt;
    So while it&#039;s possible to speculate in an informed manner about
    other things, this particular question will remain as an open
    mystery.  Big companies don&#039;t always get things done very
    efficiently, eh?
  &lt;/p&gt;

  &lt;h3&gt;Footnotes&lt;/h3&gt;

  &lt;div class=footnotes&gt;

    &lt;p&gt; &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;]
      How idle? So idle that I hadn&#039;t even logged in, the app
        was in the login screen.
    &lt;/p&gt;

    &lt;p&gt;
      &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] To be specific, the
        slowdown is caused by the artifical latency changes. The PS4
        downloads files in chunks, and each chunk can be served from a
        different CDN. The CDN that was being used from 10:51 to 11:00
        was using a delay-based congestion control algorithm, and
        reacting to the extra latency by reducing the amount of data
        sent. The CDN used earlier in the connection was using a
        packet-loss based congestion control algorithm, and did not
        slow down despite seeing the latency change in exactly the same
        pattern.
    &lt;/p&gt;
  &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>NETWORKING</category><category>GAMES</category><pubDate>Sat, 19 Aug 2017 19:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2017-08-19-slow-ps4-downloads/</guid></item><item><title>A rating system for asymmetric multiplayer games</title><link>https://www.snellman.net/blog/archive/2015-11-18-rating-system-for-asymmetric-multiplayer-games/</link><description>
      &lt;h2&gt;Introduction&lt;/h2&gt;

      &lt;p&gt;
        A couple of years ago I wrote a quick and dirty rating system
        for a &lt;a href=&#039;http://terra.snellman.net/&#039;&gt;online boardgame
        site&lt;/a&gt; I run. It wasn&#039;t particularly well thought out, but
        it did the job. Some discussion about the system made me
        revisit it, with two years of hindsight and orders of
        magnitude more data.

      &lt;p&gt;
        How well does the system actually work, and how predictive are
        the ratings? There are some obvious tweaks to the system
        &amp;mdash; would implementing them make things better or worse?
        Would anything be gained from switching to a more principled
        (but more complicated) approach. For this last bit, I used
        Microsoft&#039;s &lt;a href=&#039;http://research.microsoft.com/en-us/projects/trueskill/&#039;&gt;TrueSkill&lt;/a&gt;
        as the benchmark. It has some desirable properties and
        appears to be the gold standard of team based rating systems right now.

      &lt;p&gt;
        The code and the data are available on GitHub in my
        &lt;a href=&#039;https://github.com/jsnell/rating-eval&#039;&gt;rating-eval repository&lt;/a&gt;.
      &lt;/p&gt;

      &lt;read-more&gt;&lt;/read-more&gt;

      &lt;h2&gt;Table of contents&lt;/h2&gt;

      &lt;ul&gt;
        &lt;li&gt;&lt;a href=&#039;#game&#039;&gt;The game&lt;/a&gt; - What properties of this
          game are relevant to a rating system?
        &lt;li&gt;&lt;a href=&#039;#rating-system&#039;&gt;The rating system&lt;/a&gt; - How does the
          current rating system on the site work?
        &lt;li&gt;&lt;a href=&#039;#evaluating&#039;&gt;Evaluating rating quality&lt;/a&gt; - What
          are valid metrics for evaluating the predictive quality of
          a rating system?
        &lt;li&gt;&lt;a href=&#039;#trueskill&#039;&gt;TrueSkill&lt;/a&gt; - What&#039;s TrueSkill, and
          how was this game translated to TrueSkill concepts?
        &lt;li&gt;&lt;a href=&#039;#results&#039;&gt;Results&lt;/a&gt; - The evaluation results in
          tedious detail.
        &lt;li&gt;&lt;a href=&#039;#conclusions&#039;&gt;Conclusions&lt;/a&gt; - The tl;dr on
          the results.
        &lt;li&gt;&lt;a href=&#039;#footnotes&#039;&gt;Footnotes&lt;/a&gt;
      &lt;/ul&gt;

      &lt;a id=&#039;game&#039;&gt;&lt;/a&gt;
      &lt;h2&gt;The game&lt;/h2&gt;

      &lt;p&gt;
        The game in question
        is &lt;a href=&#039;https://boardgamegeek.com/boardgame/120677/terra-mystica&#039;&gt;Terra
        Mystica&lt;/a&gt;. I won&#039;t go into the exact mechanics of the game
        in this article, since that just doesn&#039;t matter. What does
        matter is the answer: What makes the game such a unique snowflake
        that I can&#039;t just plop in a
        standard &lt;a href=&#039;https://en.wikipedia.org/wiki/Elo_rating_system&#039;&gt;Elo&lt;/a&gt;
        or &lt;a href=&#039;https://en.wikipedia.org/wiki/Glicko_rating_system&#039;&gt;Glicko&lt;/a&gt;
        implementation and call it a day?

      &lt;p&gt;
        First, TM is a &lt;b&gt;multiplayer&lt;/b&gt; game and indeed most
        commonly played as a 3-5 player game rather than a 2p
        one. This means that a pure two player rating system would not
        work. But there are well known ways of coercing a two player
        system to a multiplayer one, so that&#039;s not a major concern.

      &lt;p&gt;
        Second, TM is very &lt;b&gt;asymmetric&lt;/b&gt; especially by the standards of
        the &lt;a href=&#039;https://boardgamegeek.com/wiki/page/Eurogame&#039;&gt;eurogame&lt;/a&gt;
        genre. The base game of Terra Mystica comes
        with &lt;a href=&#039;http://www.terra-mystica-spiel.de/en/voelker.php&#039;&gt;
        14 different factions&lt;/a&gt;, freely chosen by the players at the
        start of the game. The only restriction is that the factions
        come in 7 colors; picking a faction of a given color blocks
        the other faction of that color from the game. The first
        expansion added 6 more factions to the game. The factions can
        be very different from each other, with each of them having
        different special powers, building costs, resource production,
        and so on. Every faction also has a preference for different
        parts of the map, and will be on a different point on the
        symbiotic-competitive spectrum with different opposing
        factions.

      &lt;p&gt;
        Why would asymmetry matter for a rating system? Well,
        one reason is that asymmetry is a potential source of
        imbalance.  Especially early on in the game&#039;s lifecycle it was
        not clear whether the different factions were balanced or not,
        and if not how unbalanced they were.

      &lt;p&gt;
        The arguments about this were complicated by there being a
        chicken and egg problem. Some people thought that statistics
        on the win rates or average scores for each faction were
        invalid. Clearly, the argument went, some factions were just
        doing well because they&#039;re more popular along good players (or
        the converse for factions doing badly due to being popular
        among bad players). And then there were people arguing the
        opposite! They&#039;d say that you could not actually tell anything
        about how good a player was, because a high win rate or high
        average score for for a player might just be due to them
        playing good factions.

      &lt;p&gt;
        The way to open up this deadlock is a rating system that can
        simultaneously determine the skill of players and the
        relative strength of factions, with a feedback system between
        the two.

      &lt;p&gt;
        &lt;img src=&#039;/blog/stc/images/tm-maps.png&#039; style=&#039;float: left; padding-right: 2ex;&#039;&gt;

      &lt;p&gt;
        Third, first full expansion to the game didn&#039;t just introduce
        the new factions mentioned above. It also introduced two
        completely &lt;b&gt;new maps&lt;/b&gt;. Now, Terra Mystica as a game is
        incredibly sensitive to the map design. I was part of
        playtesting one of the new maps, and it was an kind of an
        infuriating process. The developer would make a tiny tweak,
        just swapping a couple of hexes around, and there would be a
        butterfly effect. So there would be an attempt to make
        Darklings a little worse with a change, and they&#039;re not
        actually affected but Halflings become borderline unplayable.

      &lt;p&gt;
        From a rating system view the question is then whether these
        three maps should be considered as separate games or the same
        one. (The game also has some random setup aspects; these
        setups can&#039;t be considered as separate games from a rating
        perspective since there&#039;s something like 2 million different
        configurations).

      &lt;a id=&#039;rating-system&#039;&gt;&lt;/a&gt;
      &lt;h2&gt;The rating system&lt;/h2&gt;

      &lt;p&gt;
        My primary goal for the rating system was for it to be useful
        for predicting game outcomes &lt;a id=&#039;fnref1&#039;&gt;[&lt;a href=&#039;#fn1&#039;&gt;1&lt;/a&gt;]. And as mentioned in the
        previous section, it would need to compute both player and
        faction ratings. But in addition to these things, there were
        a secondary goals.

      &lt;p&gt;A lot of board game sites use a unmodified Elo system with
        the standard constants, with the ratings updated after every
        match finishes. This can make the ratings absurdly
        volatile. It also encourages people not play against opponents
        with much lower ratings. Both of these are driven by the way
        multiplayer games are less computable than two player
        games. You can lose a game essentially through no fault of
        your own, which won&#039;t happen in a two player game, and
        especially if you finish dead last in a five player game, that
        single game can erase what feels like 20 games of progress to
        a higher rating. It&#039;s no wonder that the players become very
        conservative and risk-averse.

      &lt;p&gt;
        Some players will even take advantage of this volatility and
        manipulate their ranking by changing the order in which their
        games finish. Do you have 10 games about to finish? Just
        stall in the 5 that you&#039;re doing well in, and rush the ones
        you&#039;re losing in. Your final rating will be significantly
        higher than if you alternated the good and bad games.

      &lt;p&gt;
        All of the above is bad. So a secondary goal was that the
        system shouldn&#039;t be quite that sensitive to the most recent
        games.  The exception are players with very few games played;
        for those players you do want very high sensitivity so that
        they get roughly to their proper rating as soon as possible.

      &lt;p&gt;
        And second, I wanted to encourage players to pick the factions
        perceived as worse more often. So a player should be rewarded
        more / penalized less for winning with a bad faction than for
        winning with a good faction.

      &lt;p&gt;
        So, this is what I ended up with.

      &lt;p&gt;
        We start with the usual hack for converting a two player
        rating system to work with multiplayer games: treat
        each &lt;code&gt;N&lt;/code&gt; player match as &lt;code&gt;(N/2) / (N+1)&lt;/code&gt;
        two player submatches, and apply the rating algorithm to those
        submatches.

      &lt;p&gt;
        The core of the system is the normal Elo equation, which
        computes an expected score (somewhere between 0 for a loss and
        1 for a win) for both players based on the difference of their
        ratings:
      &lt;/p&gt;

&lt;pre&gt;
  my $ep1 = 1 / (1 + 10**(-$diff / 400));
  my $ep2 = 1 / (1 + 10**($diff / 400));
&lt;/pre&gt;

      &lt;p&gt;
        Sorry, I lied! We don&#039;t actually use the difference of the
        ratings of the players. Instead we add up the rating of the
        player and the rating of the faction they are playing first,
        and only take the difference after that.
      &lt;/p&gt;

&lt;pre&gt;
  my $p1_score = $p1-&gt;{score} + $f1-&gt;{score} * $fw;
  my $p2_score = $p2-&gt;{score} + $f2-&gt;{score} * $fw;
  my $diff = $p1_score - $p2_score;
&lt;/pre&gt;

      &lt;p&gt;
        What&#039;s the &lt;code&gt;$fw&lt;/code&gt; variable there? It&#039;s a
        configurable faction weight, in case we want to run
        experiments with the faction choice being considered more or
        less important. There&#039;s one other thing &lt;code&gt;$fw&lt;/code&gt; is
        used for. If a player drops out, we still want to compute a
        rating change for them (otherwise they&#039;d just drop out of
        games they&#039;re about to lose to avoid the ranking penalty). But
        we really should not be penalizing the faction the player was
        using when they dropped out. It&#039;s not like the faction had any
        way of influencing that! So in the specific case of a
        dropout, &lt;code&gt;$fw&lt;/code&gt; gets set to 0 for that submatch.
      &lt;/p&gt;

&lt;pre&gt;
  if ($res-&gt;{a}{dropped} or $res-&gt;{b}{dropped}) {
      if ($settings-&gt;{ignore_dropped}) {
          next;
      } else {
          $fw = 0;
      }
  }
&lt;/pre&gt;

      &lt;p&gt;
        We then need the actual result of the game, in the same format
        as our expected results. So again it&#039;s 1 for win, 0 for loss,
        and now also 0.5 for a draw.
      &lt;/p&gt;

&lt;pre&gt;
  if ($a_vp == $b_vp) {
      ($ap1, $ap2) = (0.5, 0.5);
  } elsif ($a_vp &gt; $b_vp) {
      ($ap1, $ap2) = (1, 0);
  } else {
      ($ap1, $ap2) = (0, 1);
  }
&lt;/pre&gt;

      &lt;p&gt;
        The difference between the actual and expected ratings is then
        used to compute a rating change for both players. Let&#039;s return
        later to what the value of &lt;code&gt;$pot&lt;/code&gt; actually is.
      &lt;/p&gt;

&lt;pre&gt;
  my $p1_delta = $pot * ($ap1 - $ep1);
  my $p2_delta = $pot * ($ap2 - $ep2);
&lt;/pre&gt;

      &lt;p&gt;
        As the last step of dealing with each submatch we apply the
        rating changes. There&#039;s some subtleties here. First, we don&#039;t
        necessarily do all the updates immediately but batch up the
        rating records and apply a batch in one go. (For example apply
        all the changes from a single game in one go rather than have
        the results of the first submatch affect the interpretation of
        the other submatches &amp;mdash; but see the results section for
        more on that).

      &lt;p&gt;
        Second, until a player has played a minimum number of games
        (default 5) they won&#039;t have an effect on other players. Or to
        be clear, we update the rating of a player either if both
        players are new, or if the opponent is not new. (But not when
        the player is old and the opponent is new). This has the
        effect that new players will have a &amp;quot;shadow rating&amp;quot;
        computed for them, but will not affect the ratings of the
        opponents or factions.

      &lt;p&gt;
        The idea here is that we have very little idea of the strength
        of the new player. So it makes no sense to just assume they&#039;re
        an exactly average new player, and use that assumption to
        adjust the ratings of players for whom we already have a lot
        of better quality data.

&lt;pre&gt;
  my $count = ($p1-&gt;{games} &gt;= $settings-&gt;{min_games}) +
    ($p2-&gt;{games} &gt;= $settings-&gt;{min_games});

  if ($p2-&gt;{games} &gt;= $settings-&gt;{min_games} or !$count) {
      push @rating_changes, [$p1, $p1_delta];
  }
  if ($p1-&gt;{games} &gt;= $settings-&gt;{min_games} or !$count) {
      push @rating_changes, [$p2, $p2_delta];
  }
&lt;/pre&gt;

      &lt;p&gt;
        Faction ratings only get updated if both players have played
        enough games. The faction weight factor is taken into account
        here as well.

&lt;pre&gt;
  next if $count != 2;

  push @rating_changes, [$f1, $pot * $p1_delta * $fw];
  push @rating_changes, [$f2, $pot * $p2_delta * $fw];
&lt;/pre&gt;

      &lt;p&gt;
        The most dodgy part of what I implemented is that (unlike
        traditional Elo) this system doesn&#039;t run as a streaming
        process where the ratings are computed purely from old ratings
        and some number of finished games. Instead the ratings are
        computed from iteratively, taking into account the full
        history every time the computation is done. There are obvious
        practical reasons for why you wouldn&#039;t want to do things that
        way, but I&#039;m not expecting to ever have enough games for that
        to matter.

      &lt;p&gt;
        Every iteration works on exactly the same data. The same game
        results handled in the same order, using the same
        settings. However, the ratings are not reset between
        iterations. At the start of the first iteration everyone has a
        rating of 1000. At the start of the second iteration they&#039;ll
        have some other value. These different starting values will
        affect all the subsequent computations. It&#039;s kind of
        back-propagating the later results, such that they affect the
        algorithm&#039;s interpretation of earlier events.

&lt;pre&gt;
  for (1..$settings-&gt;{iters}) {
    iterate_results @matches, %players, %factions, $_, $settings;
  }
&lt;/pre&gt;

      &lt;p&gt;
        There&#039;s another difference between the iterations, which is the
        iteration count being passed in as the 4th parameter. This is where
        the value for &lt;code&gt;$pot&lt;/code&gt; comes from. With the default settings
        the progression goes 16, 4, 1.77. The later iterations have a smaller
        effect.

&lt;pre&gt;
  my $pot = $settings-&gt;{pot_size} / $iter ** $settings-&gt;{iter_decay_exponent};
&lt;/pre&gt;

      &lt;p&gt;
        This also means that the later games have a smaller effect
        than you might initially think. Sure, every result can
        contribute up to 22 rating points to a player&#039;s rating. But in
        practice a chunk of the 1st iteration&#039;s 16 point rating change
        would have been undone by the time the 2nd iteration finishes
        and it&#039;s time to divvy up 4 points for that game. The reason
        is that the player will have a higher starting rating for the
        2nd iteration, so every win will count for less and every loss
        will count for more.

      &lt;p&gt;
        For new players (who only have few games played) this effect
        of the older games undoing part of the result of the first
        game will be almost non-existent. There&#039;s not very many of
        those old games around, after all. This is exactly what we
        want, for players with few games every new game tells us a lot
        relative to what we knew at the start. Later games will
        however always matter more than the earlier ones, so it&#039;s also
        not the case that the rating of someone who has played
        hundreds of games can&#039;t shift their rating at all.

      &lt;p&gt;
        What&#039;s not considered by this system? This system only uses on
        the final ranks of the game as input. It doesn&#039;t take consider
        the in-game victory points, either as an absolute value or
        relative to the scores of the opponents. Does this make sense?
        Surely a win by 10 points should be worth more than a win by 1
        point. The latter is basically a draw!

      &lt;p&gt;
        But this seems like a necessary restriction. People love
        seeing numbers go up, even if the numbers are in reality
        totally irrelevant. They love it so much that they&#039;ll try to
        optimize for the number going up. If you introduce something
        other than winning as a component in the rating system, some
        players will start optimizing for that other thing
        instead. That&#039;s what they&#039;re rewarded for, so it must be the
        right thing! And at that point the system as a whole must
        become less useful at predicting winning, andbetter at
        predicting that other thing.

      &lt;p&gt;
        So even if using more detailed in-game statistics as part of
        the rating computation would almost certainly provide more
        accurate statistics, it&#039;d only work if the players are
        unaware of it.

      &lt;a id=&#039;evaluating&#039;&gt;&lt;/a&gt;
      &lt;h2&gt;Evaluating rating quality&lt;/h2&gt;

      &lt;p&gt;
        The basic process I used for evaluating the rating quality was
        to first split all my data into two parts. The first 75% or so
        of the data was essentially a training set, used to compute
        ratings for all players and factions in the game. The output
        of the rating system should be such that the ratings of two
        entities A and B can be converted to a win probability - (0 if
        A is basically guaranteed to lose, 1 if they&#039;re basically
        guaranteed to win, but mostly clustered in the 0.25-0.75
        range). We then compare these predictions against the actual
        output in the remaining 25%, the evaluation set, with a loss
        being 0, a win being 1, and a tie (rare but possible) being
        0.5.

      &lt;p&gt;
        How should this comparison work? There&#039;s a few kinds of
        metrics we could compute. Some metrics are basically
        self-contained, completely independent of other prediction
        sets. &amp;quot;This prediction set scored 1500&amp;quot;. Other
        metrics are derived from comparing two sets of predictions
        directly against each other. &amp;quot;A was better than B, 100 to
        80&amp;quot;, &amp;quot;A was better than C, 150 to 60&amp;quot;, but this
        tells you nothing about how B and C would compare.

      &lt;p&gt;
        The rest of this section basically goes through my process of
        trying to find metrics that worked. It&#039;ll therefore stumble in
        and out of a couple of dead ends.

      &lt;p&gt;
        The simplest possible metric is just to &lt;b&gt;sum up the absolute
        errors&lt;/b&gt;. If the prediction for a match is 0.8 and the actual
        result is 1, penalize the prediction set by 0.2 points. The
        closer to zero the score, the better the prediction set. This
        doesn&#039;t actually work. The problem is that it&#039;s suboptimal for
        a system to use its actual prediction; it should always round
        the prediction to the closest of 0 or 1.

      &lt;p&gt;
        &lt;b&gt;Example&lt;/b&gt;: A and B are playing 5 matches. The rating system predicts
        an 80% win rate for A. And in fact A gets exactly that win rate.
        The absolute error would be: &lt;code&gt;4*(1 - 0.8) + 1*(0.8) = 1.6&lt;/code&gt;.
        What if we predict a 100% win rate instead? That&#039;d mean:
        &lt;code&gt;4*(1 - 1) + 1*(1) = 1&lt;/code&gt;. The less accurate prediction
        was judged to be significantly better. That&#039;s just no good at all.

      &lt;p&gt;
        What you instead need is the &lt;b&gt;sum of squares of errors&lt;/b&gt;
        (SSE). For the same example, the 80% prediction then produces
        &lt;code&gt;4*((1 - 0.8)**2) + 1*(0.8**2) = 0.8&lt;/code&gt; while the 100%
        prediction gives &lt;code&gt;4*((1 - 1)**2) + 1*(1.0**2) = 1&lt;/code&gt;.

      &lt;p&gt;
        What about a direct comparison between two rating systems?
        Could we just &lt;b&gt;count&lt;/b&gt; how many times each system was
        closer to the actual result?

      &lt;p&gt;
        &lt;b&gt;Example&lt;/b&gt;: if one system predicts 70% win rate for A over B
        while the other predicts a 90% win rate, award the first
        system a point every time A loses and the second a point for
        every win. The system with the higher score is better. This
        fails for a very similar reason as the first method, as
        accurate predictions are penalized. In this example, if A wins
        4 out of 5 matches the first system would get one point while
        the second one gets 4 points. But both predictions were
        actually equally far from the actual win rate of 80%.

      &lt;p&gt;
        How can this be fixed? Clearly the reward for the better
        prediction can&#039;t be fixed, but must somehow depend on the odds
        that the predictions are implying. That is, the predictions
        are essentially used to place &lt;b&gt;bets&lt;/b&gt;. Actually doing a
        betting system is a bit tricky though, since betting is
        fundamentally a random process, but we&#039;d like our metrics to
        be computed in a deterministic manner. After some doodling, I
        came up with the following method that&#039;s deterministic but
        still retains the core flavor of betting.

      &lt;p&gt;
        Let &lt;code&gt;$e1&lt;/code&gt; and &lt;code&gt;$e2&lt;/code&gt; be the win
        probability each system gives to player A,
        and &lt;code&gt;$res&lt;/code&gt; be the actual result (0, 0.5, 1). If the
        predictions are the same, there&#039;s no bet to be made. But
        otherwise both players should be happy to make a bet using the
        implied odds of the midpoint of those predictions:

&lt;pre&gt;
my $em = ($e1 + $e2) / 2;
&lt;/pre&gt;

      &lt;p&gt;
        The set that gave a higher prediction will then be adjusted
        by &lt;code&gt;$res - $em&lt;/code&gt; points, while the other will be
        adjusted by &lt;code&gt;$em - $res&lt;/code&gt; points. This has the
        effect that when the outcome is the one that both systems
        expected, the winner of the bet will be awarded a small number
        of points. While if the outcome is a surprise to both systems,
        the winner of the bet will get a larger amount. It will also
        be symmetrical: the outcome will be the same if you swap the
        players around (i.e. invert the predictions and the result).

      &lt;p&gt;
        &lt;b&gt;Example&lt;/b&gt;: A and B are playing 5 matches, with A winning
        4 games. System X predicts a win rate of 0.7 for A, system Y
        predicts 0.8, with an average of 0.75. Since Y&#039;s prediction
        was higher, it will be receive &lt;code&gt;1 - 0.75 =
        0.25&lt;/code&gt;points for every game that A wins, but lose &lt;code&gt;
        0 - 0.75 = -0.75&lt;/code&gt; points for the ones A loses. The end
        result is that Y gains &lt;code&gt;4*0.25 - 0.75 = 0.25&lt;/code&gt;
        points from these 5 matches (and X loses the same amount, since
        this is a zero sum process). This makes sense since Y&#039;s prediction
        was indeed more accurate.

      &lt;p&gt;
        Finally, one more way to compare two rating systems is by only
        looking at matchups where the two systems produced
        &lt;b&gt;split predictions&lt;/b&gt;; that is, one system gives a win
        probability of under 0.5 while the other gives a probability
        of over 0.5. In this metric we simply count which of the two
        picked the correct winner more often.

      &lt;p&gt;
        This system doesn&#039;t suffer from any tactical misprediction
        issues. But it means throwing away most of the data. It also
        makes the value judgement that the most important (only
        important?) part of the prediction space are the matches
        between players of almost equal skill. Whether that&#039;s the
        right call or not seems to depend on the ultimate purpose
        of the rating system. The needs of a matchmaking system are
        different from the needs of a system that tries to predict
        the results of games between arbitrary players.
      &lt;/p&gt;

      &lt;a id=&#039;trueskill&#039;&gt;&lt;/a&gt;
      &lt;h2&gt;TrueSkill&lt;/h2&gt;

      &lt;p&gt;
        &lt;a href=&#039;http://research.microsoft.com/en-us/projects/trueskill/&#039;&gt;TrueSkill&lt;/a&gt;
        is a rating system from Microsoft for use for Xbox Live online
        games. It&#039;s got three interesting properties. First, it deals
        natively with multiplayer matches. Second, it supports teams
        with multiple members. And third, it doesn&#039;t track skill as a
        single number but as a combination of two numbers: an estimate
        of the skill, and an uncertainty of the estimate.

      &lt;p&gt;
        The system is also very complicated compared to something like
        Elo. The best explanation of how it works
        is &lt;a href=&#039;http://www.moserware.com/2010/03/computing-your-skill.html&#039;&gt;an
        epic blog post by Jeff Moser&lt;/a&gt;, who also made the first open
        source implementation in C#. I wasn&#039;t man enough to
        re-implement the algorithm from scratch, and used
        the &lt;a href=&#039;https://github.com/sublee/trueskill&#039;&gt;Python
        TrueSkill implementation by Heungsub Lee&lt;/a&gt;

      &lt;p&gt;
        For the purposes of this investigation, we&#039;ll represent each
        player + faction combination as a 2 player team, each game as
        a match between 3-5 such teams, and then just trust TrueSkill
        to do the right thing.

      &lt;p&gt;
        The other thing we need is deriving a win probability from two
        TrueSkill ratings (which, again, is a combination of a skill
        estimate and an uncertainty). This isn&#039;t completely trivial
        since it needs to take into account the possible skill
        distribution of all players (expressed as normal
        distributions) as well as the distribution for how different
        skill ranges affect the win probability.

      &lt;p&gt;
        Somewhat surprisingly, this doesn&#039;t appear to be addressed at
        all in the literature. I ended up writing the following based
        on a suggestion by Moser in one of the blog comments (linked
        above), but have to admit I didn&#039;t think too much about
        whether it&#039;s really correct.
&lt;pre&gt;
def win_probability(a, b):
    deltaMu = sum([x.mu for x in a]) - sum([x.mu for x in b])
    sumSigma = sum([x.sigma ** 2 for x in a]) + sum([x.sigma ** 2 for x in b])
    playerCount = len(a) + len(b)
    denominator = math.sqrt(playerCount * (BETA * BETA) + sumSigma)
    return cdf(deltaMu / denominator)
&lt;/pre&gt;

      &lt;a id=&#039;results&#039;&gt;&lt;/a&gt;
      &lt;h2&gt;Results&lt;/h2&gt;

      &lt;p&gt;
        This section will go through individual changes to the rating
        system starting from nothing, ending up with the rating system
        in its current form. We&#039;ll then look at potential extra features,
        before finishing with the comparison to TrueSkill. If you skipped
        over the previous section on evaluating rating quality, below
        is a quick summary of the three metrics that I&#039;ll use:

      &lt;p&gt;
        &lt;ul&gt;
          &lt;li&gt;SSE: Sum square of errors
          &lt;li&gt;Betting: The predictions of the two systems are used to
            determine the odds for a bet. The
          &lt;li&gt;Split predictions: Look only at cases where the two
            systems disagree on who is going to win a given pairwise
            matchup. Give a point to the system that predicts the
            winner correctly.
        &lt;/ul&gt;

      &lt;p&gt;
        The process was to split the game data into two parts. A
        training data set with about 140k pairwise matchups was used
        to compute ratings for all players / factions. These ratings
        were then used to predict the results on the evaluation set of
        about 55k matchups. The split between training and evaluation
        sets was done using a cutoff date. The training set contained
        the games that finished before 2015-06-01, the evaluation set
        had the games that finished later.

      &lt;p&gt;
        One final point on the test setup is that all the win
        predictions were done on pairwise submatches, rather than the
        match as a whole. That applies even in cases where the system
        used for computing the ratings from the training data was
        match-based rather than pairwise submatch-based.

      &lt;h3&gt;A. The dummy rating system&lt;/h3&gt;

      &lt;p&gt;
      Let&#039;s start with the stupidest possible system, which totally
      ignores the training set, and just predicts a 50% win rate in
      every pairwise match. On the 29996 matchups in the evaluation
      set, we get a SSE of 7386.25. That should be our absolute floor.

      &lt;h3&gt;B. Normal Elo&lt;/h3&gt;

      &lt;p&gt;
      Moving on, we&#039;ll use a normal Elo system with a K-factor of 24.
      The SSE drops to 6689, and on our &amp;quot;betting&amp;quot; metric &lt;b&gt;B&lt;/b&gt;
      beats the dummy system &lt;b&gt;A&lt;/b&gt; by 1658 to -1658. (There&#039;s no
      result on the split prediction test, since the dummy rated
      everything as 0.5. That metric needs one algorithm to give a
      result of over 0.5 and the other a result of under 0.5).

      &lt;h3&gt;C. Normal Elo, minimum of 5 games&lt;/h3&gt;

      &lt;p&gt;
        Then we introduce the restriction of users not having an
        effect on other players before they have played at least 5
        games (as described in the rating system section). Compared
        to &lt;b&gt;B&lt;/b&gt;, the results are inconsistent. There&#039;s a tiny
        improvement in SSE which drops to 6687. There&#039;s also a
        noticeable improvement in the betting metric, which &lt;b&gt;C&lt;/b&gt;
        wins over &lt;b&gt;B&lt;/b&gt; 152 to -152. We can now also get results on
        split predictions, where &lt;b&gt;B&lt;/b&gt; is better by 390 to 370.

      &lt;p&gt;
        This feature seems unlikely to be worthwhile, but it&#039;s not
        harmful either and it&#039;s what&#039;s used in the current production
        implementation. So we&#039;ll carry on using it in the following test
        cases too.

      &lt;h3&gt;D. Iterated Elo&lt;/h3&gt;

      &lt;p&gt;
        This test uses the iteration model, with the default
        parameters as explained in the algorithm description: three
        iterations with K-factors of 16/4/1.77. SSE drops to 6612.
        More significantly, &lt;b&gt;D&lt;/b&gt; wins the betting over &lt;b&gt;C&lt;/b&gt;
        by 830 to -830, and the split predictions by 1355 to 1133.

      &lt;h3&gt;E. Faction ratings&lt;/h3&gt;

      &lt;p&gt;
        The next step is to compute faction ratings in parallel to the
        player ratings, to feed the faction ratings back into the
        player rating computations, and to take into account both the
        player and the faction ratings when predicting win
        probabilities. This is also the system I&#039;m currently using, so
        it&#039;s kind of my &amp;quot;par&amp;quot; value. Any further changes would
        hopefully be improvements.

      &lt;p&gt;
        When faction ratings are mixed in, SSE drops by a lot to
        6471. &lt;b&gt;E&lt;/b&gt; wins the betting metric over &lt;b&gt;D&lt;/b&gt; by 710 to
        -710, and split predictions by 2488 to 2110.

      &lt;h3&gt;F. Per-map faction ratings&lt;/h3&gt;

      &lt;p&gt;
        The faction ratings being global, rather than computed
        separately for each map, is a common criticism of the rating
        system from the player base. In this test each faction and map
        combination is treated as a distinct entity, with completely
        separate ratings. Doing this does indeed improve the results a
        bit. SSE is reduced to 6437. &lt;b&gt;F&lt;/b&gt; also wins the betting
        over &lt;b&gt;E&lt;/b&gt; by 383 to -383, and the split predictions by
        1147 to 1033.

      &lt;p&gt;
        Making this change would however be tricky from a UI
        perspective. The rating UI wouldn&#039;t scale nicely to 60
        &amp;quot;factions&amp;quot;, but you can&#039;t really aggregate the
        results either. Some thought required on whether this change
        would be worth it.

      &lt;h3&gt;G. Ignoring dropouts&lt;/h3&gt;

      &lt;p&gt;
        The test driver skips any pairwise matchups where one or both
        players dropped out. What would happen if we did the same when
        computing ratings? That is, if either party of a pairwise
        matchup drops out, don&#039;t update the rating for either. This
        would indeed improve things a bit, which makes some sense. SSE
        drops to 6419. &lt;b&gt;G&lt;/b&gt; also wins betting over &lt;b&gt;F&lt;/b&gt; by 113
        to -113, and split predictions won by 840 to 814.

      &lt;p&gt;
        This isn&#039;t an acceptable change though. Introducing it would
        immediately make players start dropping out of games they
        expect to lose. But it is interesting to see what the effect
        might be in a perfect world, where the rating system would not
        affect player behavior. &lt;a id=&#039;fnref2&#039;&gt;[&lt;a href=&#039;#fn2&#039;&gt;2&lt;/a&gt;]

      &lt;h3&gt;H. Different faction weights&lt;/h3&gt;

      &lt;p&gt;
        Using lower faction weights than 1 produces almost
        imperceptibly better results, and not even across the board
        (it generally makes two categories better, one category
        worse).  This change also seems hard to justify from first
        principles, so it feels too much like curve fitting. So this
        is another idea not worth implementing.

      &lt;h3&gt;I. Batched rating updates&lt;/h3&gt;

      &lt;p&gt;
        One thing that was kind of dodgy in my original implementation
        is that the rating changes from a single multiplayer game were
        not applied in a single atomic unit. Instead we&#039;d first
        compute a rating changes from a single pairwise submatch,
        apply those changes, then compute a rating change for the next
        pairwise submatch &lt;i&gt;based on the already updated ratings&lt;/i&gt;.
        What if the algorithm instead first computed all the changes
        from that one match, and then applied all of them in one go?

      &lt;p&gt;
        This should make sense, but also not have all that big an
        impact. Neither of those expectations is true though. The SSE
        is significantly higher at 6523 (vs. 6437
        for &lt;b&gt;F&lt;/b&gt;), &lt;b&gt;I&lt;/b&gt; loses the betting by -934 to 934, and
        the split decisions by 660 to 783.

      &lt;p&gt;
        I don&#039;t know why this change makes the system perform
        worse. The most likely reason is that batching the updates
        makes single events have larger effects on the ratings.
        That&#039;s not entirely satisfactory, since if that were the case
        just a slightly smaller K-factor should have a similar
        effect. But that&#039;s not what happens in practice.

      &lt;h3&gt;J. TrueSkill without factions&lt;/h3&gt;

      &lt;p&gt;
        I&#039;ll do all the TrueSkill comparisons to the &amp;quot;Per-map
        faction ratings&amp;quot; version of the system (&lt;b&gt;F&lt;/b&gt;). The
        first variant to test is TrueSkill that ignores factions
        completely, both when computing the ratings and when making
        predictions on win probabilities. This (heavily crippled!)
        version of TrueSkill has a SSE of 6658, which is somewhere
        between the &lt;b&gt;B&lt;/b&gt; and &lt;b&gt;D&lt;/b&gt;. &lt;b&gt;J&lt;/b&gt; loses the betting
        against &lt;b&gt;F&lt;/b&gt; by -732 to 732, and the split decisions by
        2981 to 3500.

      &lt;h3&gt;K. TrueSkill with factions&lt;/h3&gt;

      &lt;p&gt;
        Next up we use TrueSkill as intended. Each faction is now
        treated as a TrueSkill player, and each player / faction
        combination is considered to be a two player TrueSkill
        teams. This gives a much more respectable SSE of 6474 (still a
        bit off from the 6437 of &lt;b&gt;F&lt;/b&gt;), loses betting by only -181
        to 181, and loses the split predictions by 2128 to 2203.

      &lt;h3&gt;L. TrueSkill with per-map factions&lt;/h3&gt;

      &lt;p&gt;
        Treating each faction and map combination separately
        (i.e. changing from &lt;b&gt;E&lt;/b&gt; to &lt;b&gt;F&lt;/b&gt;) gave a good boost to
        the accuracy of the iterative algorithm. Would the same
        approach work for TrueSkill? The SSE does indeed drop to 6441.
        &lt;b&gt;L&lt;/b&gt; still loses to &lt;b&gt;F&lt;/b&gt; in the other two metrics:
        betting by -148 to 148, and the split predictions by a very
        tight margin of 1921 to 1935.

      &lt;p&gt;
        Again, just as in &lt;b&gt;H&lt;/b&gt;, tiny and somewhat inconsistent
        improvements could be achieved by tweaking the faction weights
        a bit. Doing so would not change the big picture, and the same
        arguments against doing so are still valid.

      &lt;a id=&#039;conclusions&#039;&gt;&lt;/a&gt;
      &lt;h2&gt;Conclusions&lt;/h2&gt;

      &lt;p&gt;
        When I did the experiments, I was kind of expecting the
        conclusion to be a classic tradeoff between complexity and
        great results vs. simple and &amp;quot;good enough&amp;quot;. That
        turns out to not be the case; the two rating systems were very
        closely matched, and in fact the quick hack that&#039;s currently
        used seemed to measure marginally better than the matching
        TrueSkill version.

      &lt;p&gt;
        The current production version isn&#039;t quite at a local optimum
        though, there&#039;s one improvement that would be worth adding to
        it (separate ratings for each map + faction combination). But
        there doesn&#039;t seem to be any great urge to switch to a
        completely different system. That&#039;s a happy ending, since I&#039;d
        much rather have 150 lines of my own code than 1500 lines
        written by someone else and that I don&#039;t entirely understand.

      &lt;p&gt;
        There are a couple of caveats. It&#039;s possible that the formula
        for going from sets of TrueSkill ratings to win probabilities
        is suboptimal, and skewing the results. The data set isn&#039;t
        huge and I can&#039;t compute a confidence interval of any kind, so
        it&#039;s possible that the results aren&#039;t actually significant
        (but on the other hand I&#039;m happy even with a statistical tie).

      &lt;p&gt;
        Also, while here we just looked at how effective each rating
        system was at predicting outcomes, there are other criteria
        that matter in practice. For example, one thing my users are
        pretty vocally unhappy about is that they perceive the faction
        ratings to be too unstable. It seems likely that TrueSkill
        would do a better job there, since the faction ratings it
        produces have a low uncertainty (the factions have thousands
        or tens of thousands of games played), so new results would
        only cause tiny changes to them.

      &lt;p&gt;
        If anyone would like to test how their pet rating system works
        for this use case, the test data and the evaluation scripts are
        &lt;a href=&#039;https://github.com/jsnell/rating-eval&#039;&gt;available&lt;/a&gt;.

      &lt;a id=&#039;footnotes&#039;&gt;&lt;/a&gt;
      &lt;h2&gt;Footnotes&lt;/h2&gt;
      &lt;div class=&#039;footnotes&#039;&gt;
      &lt;p&gt;
        &lt;a id=&#039;fn1&#039;&gt;[&lt;a href=&#039;#fnref1&#039;&gt;1&lt;/a&gt;] In the first draft, I wrote that this was &amp;quot;obviously&amp;quot; the
        goal.  But on reflection, it&#039;s probably not obvious at
        all. After all, there are so many rating systems around that
        seem to have about zero predictive value. Maybe there&#039;s a bias
        towards people who play a lot, or there&#039;s no concept of
        opponent skill. Clearly these systems were designed with some
        goal in mind, but predictive power certainly wasn&#039;t it.
        Sirlin&#039;s delightfully cynical
        &lt;a href=&#039;http://sirlingames.squarespace.com/blog/2010/7/24/analyzing-starcraft-2s-ranking-system.html&#039;&gt;story
          on Starcraft 2&#039;s rating system&lt;/a&gt; might be relevant here.
      &lt;p&gt;
        &lt;a id=&#039;fn2&#039;&gt;[&lt;a href=&#039;#fnref2&#039;&gt;2&lt;/a&gt;] Now that I&#039;m writing this article, it occurs to me that
        there might be a decent compromise solution: apply a rating
        penalty to the player who drops out, but don&#039;t give any reward
        to the winner. You&#039;d lose the zero sum property, but that
        doesn&#039;t seem critical. The same is true of Elo variants that
        use variable K factors depending on number of games played, and
        they seem to work just fine.
    &lt;/div&gt;
</description><author>jsnell@iki.fi</author><category>GAMES</category><pubDate>Wed, 18 Nov 2015 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-11-18-rating-system-for-asymmetric-multiplayer-games/</guid></item><item><title>Detecting cheaters in an asynchronous online game</title><link>https://www.snellman.net/blog/archive/2015-07-22-cheater-detection-in-async-online-game/</link><description>
&lt;h2&gt;Introduction&lt;/h2&gt;

&lt;p&gt;
This post is a description of some tools and data analysis I did for
detecting players using multiple user accounts in an asynchronous
online
game. The &lt;a href=&#039;https://github.com/jsnell/cheat-detector&#039;&gt;code&lt;/a&gt;
is available at GitHub.

&lt;p&gt;
A couple of months ago one of the players on
my &lt;a href=&#039;http://terra.snellman.net/&#039;&gt;Online Terra Mystica&lt;/a&gt; site
had some concerns that some of the
players in the
&lt;a href=&#039;http://tmtour.org/&#039;&gt;tournament&lt;/a&gt; were playing with multiple
accounts. So I decided to do a bit of digging into the logs to see
whether it was really happening or just paranoia.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h2&gt;Gathering and preprocessing the data&lt;/h2&gt;

&lt;p&gt;
Before doing anything else, we need some data to work with. It happens
that I store a log record for every move done in any game, mostly for
debug purposes. That record contains a bunch of information, but the
ones we&#039;re going to work with in this analysis are the following:

&lt;ul&gt;
  &lt;li&gt; username
  &lt;li&gt; game
  &lt;li&gt; timestamp
  &lt;li&gt; IP address
&lt;/ul&gt;

&lt;p&gt;
I used the records for all games played in period the tournament ran
for (two months starting May 1st), but only looked at players who had
entered the tournament (about 750).

&lt;p&gt;
The first order of business was to find some suspicious users. I
defined this as two users making at least one move from the same IP
address on the same day. There were a couple of surprising
things. First, 230 players were &amp;quot;suspicious&amp;quot;, which was a
lot higher than I would have expected. Second, it wasn&#039;t just that a
user happened to arrive from the same IP as a single other user. It
could be as high as 10 other users. There must be a lot more IP
address reuse and sharing going on than I was assuming. So this
really won&#039;t work as anything except a first filter to reduce the
search space a bit.

&lt;p&gt;
The next step is to process this data into something that can be used
for automatically assessing the similarity of the access patterns of
two users. The data in the current form makes it easy to determine
whether two players did a move at roughly the same time. But there&#039;s
also information in the times where no moves were made, as long as it
was the turn of one or both of the players to move.

&lt;p&gt;
The transformation to get that data is simple: during processing of
the input data we keep track of the timestamp T of the last move done
in each game. When we see a new record for a game, we mark it having
been that users turn from T to the timestamp of the new record. This
is an approximation, since a game can be waiting for input from
multiple players at the same time. But it should be good enough, since
the only effect of getting this wrong is moving some samples from weak
dissimilarity to no signal.

&lt;p&gt;
Finally, the analysis I had in mind wasn&#039;t going to work on a
continuous scale, so the data is bucketed into 30 minute intervals
&lt;a href=&#039;#fn1&#039; id=&#039;fnref1&#039;&gt;[1]&lt;/a&gt;.

&lt;h2&gt;Quantifying similarity&lt;/h2&gt;

&lt;p&gt;
The next step is to compute a similarity score for the access
patterns of two accounts.

&lt;p&gt;
There&#039;s three states a player could be in for a given time segment,
which we&#039;ll label &amp;quot;moved&amp;quot; (player moved at least once),
&amp;quot;stalled&amp;quot; (player&#039;s turn in at least one game, but did not
move) and &amp;quot;idle&amp;quot; (otherwise). There&#039;s 9 combinations of
these states for two players, and depending on the whether the
combination supports or refutes similarity we assign a positive or
negative score based on the following table:

&lt;p&gt;
&lt;table style=&#039;padding-left: 5ex; text-align: right;&#039;&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;td&gt;B moved&lt;td&gt;B stalled&lt;td&gt;B was idle
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;A moved &lt;td&gt; -10/10 &lt;td&gt; -5 &lt;td&gt; 0
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;A stalled &lt;td&gt; -1 &lt;td&gt; 1 &lt;td&gt; 0
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;A was idle &lt;td&gt; 0 &lt;td&gt; 0 &lt;td&gt; 0
  &lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;
Moves being done at the same time is a very strong signal, either
positive or negative depending on whether the moves were done from the
same location or not. Both players being stalled at the same time is a
very weak signal (though if the players are stalled for a long time,
it can add up since this score is computed once per time segment).

&lt;p&gt;
You might have noticed that the matrix is not symmetric. We treat A
moving while B stalls different from B moving while A stalls. This is
done to deal with a common pattern where the main account makes moves
from both a main computer and a mobile phone, while the sockpuppet
makes moves only from the computer.  These accounts will appear very
similar on visual inspection but just mildly similar if both cases
were treated as a strong negative signal. With an asymmetric scoring
matrix, all the strong negative signals will accumulate for one user,
leaving the other account with a high similarity score. &lt;a href=&#039;#fn2&#039;
id=&#039;fnref2&#039;&gt;[2]&lt;/a&gt;

&lt;p&gt;
There&#039;s also a non-local signal that I&#039;m using. In the case where both
accounts moved from the same IP, the score is adjusted higher if both
accounts had been stalled for a long time just before making those
moves. The theory here is that it&#039;s a much stronger signal for two
players to make a near-simultaneous move after say two days of
unforced inactivity than after half an hour of inactivity, since long
periods of inactivity tend to be rare, and it should be rarer yet for
that period to be finished at exactly the same time for persons that
are truly distinct.

&lt;p&gt;
These scores are then summed up, divided by weight (essentially the
sum of the absolute values of the scores, but there&#039;s a small initial
weight so that account pairs require a decent number of samples before
showing alarming levels of similarity). This -1 to 1 range is then
further normalized to a 0 to 1 range, which I find that more
convenient to work with.

&lt;p&gt;
A tiny bit of fudging in the above description is that to deal with
quantization discontinuities (A makes a move at 16:29, B makes a move
at 16:31), we actually look at a sliding window of three time segments
rather than process them completely invidually. Since time segments
with moves are sparser than non-moves, this will increase the further
effective weight of the top-left 10/-10 cell in the matrix. When picking
the single cell in the table that gets scored, moved &amp;gt; stalled &amp;gt;
idle, so a single matching (or mismatching) pair of moves might get
scored up to three times.

&lt;h2&gt;Visualization&lt;/h2&gt;

&lt;p&gt;
In order to get some idea of whether the similarity measure is
reasonable or not, we also need some way of visualizing the data of
some subset of users. After an hour of trying to get something even
remotely readable from R (which usually is pretty decent at
visualizing data), I gave up in disgust and just wrote a tiny Perl
script to just generate a SVG file directly.

&lt;p&gt;
It simply has one wide column for each IP address running
horizontally, and time running vertically. Each user has a different
color for drawing their interactions with the site, as well as their
own smaller strips of the full per-IP columns. A time segment where
the user did a move is drawn as a square, a time segment where they
stalled is drawn as a thin rectangle / line.

&lt;p&gt;
Here&#039;s an example from a hilarious clique of 8 separate accounts,
mostly playing in near-perfect lockstep. How many distinct persons are
involved? Click on the thumbnail to open the full image:

&lt;p&gt;
&lt;a href=&#039;/blog/stc/images/cheat-detector/bigclique.svg&#039; target=&#039;blank&#039;&gt;
&lt;img src=&#039;/blog/stc/images/cheat-detector/bigclique-thumb.png&#039;&gt;
&lt;/a&gt;

&lt;p&gt;
This visualization is not perfect. The biggest problem is that with
more than about 20 addresses involved there&#039;s just too much horizontal
scrolling involved on comfortable zoom levels. This would not be
impossible to solve. Generally if there are large amounts of IPs
involved most of them are completely ephemeral, and could somehow be
folded together.

&lt;h2&gt;Evaluation&lt;/h2&gt;

&lt;p&gt;
I looked at the similarity scores of some people I know in real life
and suspected might be false positives (e.g. working in the same
place, or living in the same place. These tended to be in the 0.4-0.6
range. I also looked at some cases where I was essentially certain
were being played by the same person based on other metadata. The
scores for those were over 0.9. The latter seems like a reasonable
threshold value for flagging particular pairs accounts for more
scrutiny.

&lt;p&gt;
When the 7th season of the tournament started, there was a rules
change put in place explicitly to forbid playing with multiple
accounts. There were no retroactive penalties of course, but looking
at how people&#039;s behavior changed might provide us with some hints
about whether this algorithm was at all on the right track.

&lt;p&gt;
I got emails from people who were horrified that they might be banned
because both they and their SO played the game. All of these cases
were well below the threshold. There were also people apologizing for
playing with multiple accounts, saying they hadn&#039;t thought it was
wrong. These cases tended to be at or above the threshold.

&lt;p&gt;
A decent way of visualizing the effect of the rule change is looking
at the distribution of the similarity scores for season 6 vs season
7. The way to read these graphs is that the most similar pair of users
in season 6 had a similarity of 0.99, the 80th most similar pair a
score of 0.9. (Also remember that the scores are not symmetric, so
every pair of users with any similarity shows up twice in this graph,
once per direction). You can see the proportion of very suspicious
accounts being cut in half from one season to the next while the rest
of the graph has roughly the same shape:

&lt;p&gt;
&lt;img src=&#039;/blog/stc/images/cheat-detector/scores-season6.png&#039;&gt;
&lt;img src=&#039;/blog/stc/images/cheat-detector/scores-season7.png&#039;&gt;

&lt;p&gt;
And as some final behavioral metrics, there were 13 suspicious cliques
of users in season 6. At least one player continued on to season 7
from every clique, but in 8 out of 13 cases each clique shrunk down to
just one player. After the rules change, players above the flagging
threshold were about 3 times as likely to not join season 7 than the
average player.

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;
A comment I got when I gave a draft of this article to some friends
was that it&#039;s stupid to write an article about exactly how any
kind of anti-cheating setup works. All it does is allow cheaters to
figure out exactly what they need to do to game the system. This is
definitely true in the general case.

&lt;p&gt;
I&#039;d like to hope that this isn&#039;t an issue in this case. It&#039;s a
relatively small community, the stakes are low, getting any advantage
from cheating would be hard work since you&#039;d need about a year if
playing to work a sockpuppet player into the top divisions.  And the
workarounds you&#039;d need are pretty obvious, not anything clever. My
impression is that everyone who had been doing this was simply wanting
to play in more games, and thought that it wasn&#039;t really
forbidden. Trying to work around this kind of a detection system by
connecting different sockpuppets through different proxies should
produce some fairly heavy cognitive dissonance with the idea that this
is an allowed practice :-)

&lt;p&gt;
Oh, and what about the user whose complaint triggered this little
investigation? It turns out that everyone else playing in his league
was squeaky clean. Unfortunately the setup of the tournament (with
each division having twice as many leagues as the previous one) means
that you&#039;ll always have pretty high skill differences in the lowest
division. The lowest division will after all on average contain a
third of the entire tournament&#039;s players, and might contain as many as
half.

&lt;h2&gt;Footnotes&lt;/h2&gt;

&lt;p&gt;
&lt;a href=&#039;#fnref1&#039; id=&#039;fn1&#039;&gt;[1]&lt;/a&gt; Half an hour is totally
arbitrary. I wonder if there&#039;s some good way of determining an
appropriate time bucket size.

&lt;p&gt;
&lt;a href=&#039;#fnref2&#039; id=&#039;fn2&#039;&gt;[2]&lt;/a&gt; Why not make it a weak symmetric
signal then? Because the strong signal is great for genuinely distinct
accounts. The penalties will tend to be distributed fairly equally
between the two, correctly depressing the scores of both accounts.

</description><author>jsnell@iki.fi</author><category>GAMES</category><pubDate>Wed, 22 Jul 2015 16:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-07-22-cheater-detection-in-async-online-game/</guid></item><item><title>A Monte Carlo simulation of Red7</title><link>https://www.snellman.net/blog/archive/2015-03-30-monte-carlo-red7/</link><description>
&lt;p&gt;
&lt;a href=&#039;https://boardgamegeek.com/boardgame/161417/red7&#039;&gt;Red7&lt;/a&gt; is
a very clever little card game, and one of my favorite 2014 releases.
But I have wondered
about the density of meaningful decisions in the game. Sometimes it
doesn&#039;t feel like you have all that much agency, and are just hanging
on in the game with a single valid move every time it&#039;s your turn.

&lt;p&gt;
So here&#039;s some automated exploration of what a game of Red7 actually
looks like from a statistical point of view. The method used here is a
pure Monte Carlo simulation, with the players choosing randomly from
the set of their valid moves.

&lt;p&gt;
Why a Monte Carlo simulation? I started trying to do a full game tree for
a given starting setup but to my surprise the game tree is actually too
large for that to be feasible; 2 weeks of computation even for a
single two player game and a lot of optimization. The branching factor
is just much bigger than it feels like when playing the game.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;h2&gt;The rules&lt;/h2&gt;

&lt;p&gt;
(Skip this section if you&#039;re already familiar with the game. All you need
to know is that we&#039;re using the advanced version of the game but without
the optional special action rules.)

&lt;p&gt;
The rules of the game are very simple. There&#039;s a deck of 49 cards
(7 colors, numbers 1-7 in each color). In the middle is a discard
pile (&amp;quot;canvas&amp;quot;). The color topmost card of the discard pile determines
the victory condition. You must be &amp;quot;winning&amp;quot; at the end of each turn
you take, or you&#039;re out of the game.

&lt;p&gt;
There are three options to choose from on your turn. Play a card from your
hand to the table in front of you (your &amp;quot;palette&amp;quot;), discard a card
from your hand to the canvas, or first play a card and then discard a
card. If you discard a card with a number higher than the number of
cards in your palette, you get to draw a card.

&lt;p&gt;
The winning condition is determined based on the color of the canvas
(i.e. top card in discard pile):

&lt;p&gt;
&lt;table&gt;
&lt;tr&gt;&lt;td style=&#039;background-color: red&#039;&gt;Red &lt;td&gt;Highest card
&lt;tr&gt;&lt;td style=&#039;background-color: orange&#039;&gt;Orange &lt;td&gt;Most cards of the same number
&lt;tr&gt;&lt;td style=&#039;background-color: yellow&#039; &gt;Yellow &lt;td&gt;Most cards of the same color
&lt;tr&gt;&lt;td style=&#039;background-color: green&#039; &gt;Green &lt;td&gt;Most even cards
&lt;tr&gt;&lt;td style=&#039;background-color: blue; color: white&#039; &gt;Blue &lt;td&gt;Most different colors
&lt;tr&gt;&lt;td style=&#039;background-color: indigo; color: white&#039;&gt;Indigo &lt;td&gt;Longest run of sequential numbers (e.g. 4/5/6)
&lt;tr&gt;&lt;td style=&#039;background-color: violet&#039;&gt;Violet &lt;td&gt;Most cards with a number lower than 4
&lt;/table&gt;

&lt;p&gt;
If two players are tied for the winning condition (e.g. the rule is
blue and both of them have three even cards in their palette), the
winner is the player who had a higher card included in their
card combination (cards that didn&#039;t contribute to the winning
condition are ignored for the tie breaker). This is primarily based
on the numeric value of the card. But if two cards have the same
value, the one closer to red in the spectrum wins the tie (e.g. green
5 &amp;gt; indigo 5 &amp;gt; green 4).

&lt;h2&gt;The implementation&lt;/h2&gt;

&lt;p&gt;
(Ignore this section if you&#039;re not interested in the programming, and
skip straight on to the results).

&lt;p&gt;
I suspect that every Common Lisp program will eventually evolve to
using a clever bit-packing of fixnums as its primary data structure.
That&#039;s the case here as well.

&lt;h3&gt;Cards&lt;/h3&gt;

&lt;p&gt;
A card is an integer between 0 and 55 (inclusive). The low 3 bits are
the color, with a 0 being a dummy color that&#039;s not used for anything, 1
for violet going all the way to 7 for red. The next 3 bits are the
card&#039;s numeric value minus one (0-6). Note that with this representation
determining the higher of two cards is simply a matter of
making an integer comparison.

&lt;pre&gt;
(deftype card () &#039;(mod 56))

(defun card-color (card)
  (ldb (byte 3 0) card))

(defun card-value (card)
  (1+ (ash card -3)))
&lt;/pre&gt;

&lt;p&gt;
We&#039;ll also need a way to represent a set of cards, for a player&#039;s
hand or palette. We&#039;re going to use a 56-bit integer for that, with
bit X being 1 if the set contains card X.

&lt;pre&gt;
(deftype card-set () &#039;(unsigned-byte 56))
&lt;/pre&gt;

&lt;p&gt;
Adding and removing cards is simple. (Except how annoying is it that
SETF LOGBITP is not specified in the standard?).

&lt;pre&gt;
(defun remove-card (card card-set)
  (logandc2 card-set (ash 1 card)))

(defun add-card (card card-set)
  (logior card-set (ash 1 card)))

;; Create a new set from a list of cards.
(defun make-card-set (cards)
  (reduce #&#039;add-card cards))
&lt;/pre&gt;

&lt;p&gt;
We&#039;ll also need to be able to iterate through all the cards in a set.
This is most easily achieved by using INTEGER-LENGTH to find the highest
bit currently set, executing the loop body, clearing out the highest bit,
and carrying on.

&lt;pre&gt;
(defmacro do-cards ((card card-set) &amp;body body)
  (let ((modified-set (gensym)))
    `(loop with ,modified-set of-type card-set = ,card-set
           until (zerop ,modified-set)
           for ,card = (1- (integer-length ,modified-set))
           do (setf ,modified-set (remove-card ,card ,modified-set))
           do ,@body)))
&lt;/pre&gt;

&lt;h3&gt;Scoring&lt;/h3&gt;

&lt;p&gt;
With these primitives we can then write a very fast function to
determine who is currently winning the game. We&#039;ll base this
evaluation function on scoring a combination of a palette + rule, and
comparing the score that each player gets with the current rule. This
is a much better way than trying to directly compare the palettes.
If you&#039;re caching this evaluation function, you get a much higher
cache hit rate when the cache key depends only on the state of one
player rather than a combined state of two players.
(I&#039;m also pretty sure that given this data layout, computing a score
will be faster than any kind of direct comparison).

&lt;p&gt;
Let&#039;s start off with the general structure, and fill in the details as
functions under LABELS afterwards. So given a card-set and a color,
we&#039;ll return a score for that set:

&lt;pre&gt;
(defun card-set-score (card-set type)
  (labels (...)
    (ecase type
      (7 (red))
      (6 (orange))
      (5 (yellow))
      (4 (green))
      (3 (blue))
      (2 (indigo))
      (1 (violet)))))
&lt;/pre&gt;

&lt;p&gt;
Red (highest card) is trivial. We just find the highest card in the set
with a call to INTEGER-LENGTH.

&lt;pre&gt;
           (red ()
             (integer-length card-set))
&lt;/pre&gt;

&lt;p&gt;
For other rules we can make good use of the following helper function. It
matches the set against a bitmask, and returns a score based on the
number of bits that are set both in the set and the mask (main part of
score) which we get with LOGCOUNT, as well as the highest bit set in
both (the tiebreaker). Given this definition, most of the scoring
types can be written in a very concise manner:

&lt;pre&gt;
           (score-for-mask (mask)
             (let ((matching-cards (logand card-set mask)))
               (let ((matching-cards (logcount matching-cards))
                     (best-matching-card (integer-length matching-cards)))
                 (+ best-matching-card (* 64 matching-cards)))))
&lt;/pre&gt;

&lt;p&gt;
For orange (cards of one number) we start with a bitmask that matches
all bits corresponding to a card with the value 7. We compute the score
for that mask, then shift the mask right by 8 bits such that it covers
the cards with the value 6. Repeat 7 times, and find the maximum score.
(We don&#039;t need to know which iteration produced the highest score, only
what the score was).

&lt;pre&gt;
           (orange ()
             (loop for mask = #xff000000000000 then (ash mask -8)
                   repeat 7
                   maximize (score-for-mask mask)))
&lt;/pre&gt;

&lt;p&gt;
Yellow (most cards with the same number) is very similar. We start off
with a bitmask that matches all the red cards (so bit 55, 47, 39, etc)
and compute the score. Then shift it right by one, such that the mask
matches all orange cards instead. Again repeat 7 times and maximize.

&lt;pre&gt;
           (yellow ()
             (loop for mask = #x80808080808080 then (ash mask -1)
                   repeat 7
                   maximize (score-for-mask mask)))
&lt;/pre&gt;

&lt;p&gt;
Green (most even cards) and violet (most cards under 4) are trivial;
we can just score a single mask matching the even cards for green,
all cards of value 1, 2 or 3 for violet.

&lt;pre&gt;
           (green ()
             (score-for-mask #x00ff00ff00ff00))
           (violet ()
             (score-for-mask #x00000000ffffff))
&lt;/pre&gt;

&lt;p&gt;
Blue (most cards of different colors) is where we get into unintuitive
territory. Let&#039;s start with the tiebreaker; it&#039;s obviously guaranteed
that he highest card in the palette as a whole can be included in this
winning set, so we can just use INTEGER-LENGTH on the whole set the
same way we did for the red scoring rule.

&lt;p&gt;
To get the number of different colors, we will fold the cardset multiple
times. First we&#039;ll do a bitwise OR of the high 32 bits and the low 32
bits. Then we&#039;ll take OR bits 0-15 of that result with bits 16-31. And
finally one more OR of bits 0-7 with 8-15. The low 8 bits are now such
that bit 7 is set if any of the &amp;quot;red&amp;quot; bits in the original were set,
bit 6 if any of the &amp;quot;orange&amp;quot; bits, etc. We can then just use LOGCOUNT on
that byte to get the number of colors present in the palette, and combine
it together with the tiebreaker score computed above.

&lt;pre&gt;
           (blue ()
             (let* ((palette card-set)
                    (best-card (integer-length palette)))
               (setf palette (logior palette (ash palette -32)))
               (setf palette (logior palette (ash palette -16)))
               (setf palette (logior palette (ash palette -8)))
               (+ best-card
                  (* 64 (logcount (ldb (byte 8 0) palette))))))
&lt;/pre&gt;

&lt;p&gt;
Finally, there&#039;s indigo (longest straight). There does not appear
to be any clever bit manipulation trick to compute this quickly
(if you can think of one, please let me know!). We need to iterate through the
cards in order of descending value, ignore any consecutive cards with
the same number, and reset our scoring computation when the straight
gets interrupted by a missing number.

&lt;pre&gt;
           (indigo ()
             (let ((prev nil)
                   (current-run-score 0)
                   (best-score 0))
               (declare (type (unsigned-byte 16) current-run-score best-score))
               (do-cards (card card-set)
                 (cond ((not prev)
                        (setf current-run-score card)
                        (setf prev card))
                       ((= (card-value card) (card-value prev)))
                       ((= (card-value card) (1- (card-value prev)))
                        (incf current-run-score 64)
                        (setf prev card))
                       (t
                        (setf current-run-score card)
                        (setf prev card)))
                 (setf best-score (max best-score current-run-score)))
               best-score))
&lt;/pre&gt;

&lt;h3&gt;Players&lt;/h3&gt;

&lt;p&gt;
A player is defined as a normal structure, with the only oddity being
that they form a circular linked list using the NEXT slot. This tends
to be more convenient for iterating through players in turn order
than keeping them stored in an external collection of some sort.

&lt;pre&gt;
(defstruct (player)
  (id 0 :type (mod 5))
  eliminated
  (hand 0 :type card-set)
  (palette 0 :type card-set)
  (score-cache (make-array 16) :type (simple-vector 16))
  (next nil :type (or null player)))
&lt;/pre&gt;

&lt;p&gt;
The core operation of generating a list of valid moves is deciding
whether the player is winning the game after those a move is
made. When doing this we&#039;ll end up repeatedly evaluating the scores for
the same palettes over and over again. To speed this up, there&#039;s a
minimal cache; for each player / rule combination we store both the
last palette we evaluated for that rule, as well as the score.

&lt;pre&gt;
(defun player-score (player rule)
  (declare (type (mod 8) rule))
  (let* ((palette (player-palette player))
         (cache (player-score-cache player))
         (cached-key (aref cache rule)))
    (if (eql cached-key palette)
        (aref cache (+ rule 8))
        (progn
          (setf (aref cache rule) palette)
          (setf (aref cache (+ rule 8))
                (card-set-score palette rule))))))
&lt;/pre&gt;

&lt;p&gt;
Given that way to score a player against a rule, we can then check whether
the current player is winning the game with the rule.

&lt;pre&gt;
(defun player-is-winning (player rule)
  (loop with orig-player = player
        with orig-score of-type fixnum = (card-set-score player rule)
        for player = (player-next orig-player) then (player-next player)
        until (eql player orig-player)
        do (when (&gt;= (the fixnum (player-score player rule))
                     orig-score)
             (return-from player-is-winning nil)))
  t)
&lt;/pre&gt;

&lt;p&gt;
We can then generate all valid moves by iterating through all the
PLAY, PLAY+DISCARD, and DISCARD combinations for the player&#039;s current
state, and collecting the ones result in the player winning.

&lt;pre&gt;
(defun valid-moves (player current-rule)
  (let (valid-moves)
    (labels ((check-discard (play-card)
               (do-cards (discard-card (player-hand player))
                 (unless (or (eql play-card discard-card)
                             ;; Filter out cases where player discards a card
                             ;; without changing rule or gaining a new card.
                             (and (eql current-rule (card-color discard-card))
                                  (&gt;= (logcount (player-palette player))
                                      (card-value discard-card))))
                   (when (player-is-winning player (card-color discard-card))
                     (push (cons (cons :play play-card)
                                 (cons :discard discard-card))
                           valid-moves)))))
             (check-plays ()
               (do-cards (play-card (player-hand player))
                 (setf (player-palette player)
                       (add-card play-card (player-palette player)))
                 (when (player-is-winning player current-rule)
                   (push (cons :play play-card) valid-moves))
                 (check-discard play-card)
                 (setf (player-palette player)
                       (remove-card play-card (player-palette player))))))
      (check-plays)
      (check-discard nil))
    valid-moves))
&lt;/pre&gt;

&lt;h3&gt;Other stuff&lt;/h3&gt;

&lt;p&gt;
There&#039;s a little bit more code required to generate the scaffolding
for a game, and to actually do the random walk through the game
tree. None of that code is particularly interesting, nor are the
INLINE or TYPE declarations that you&#039;d need to sprinkle on the above
code to make it fast. The &lt;a
href=&#039;https://github.com/jsnell/red7&#039;&gt;full code&lt;/a&gt; is available on GitHub.

&lt;h3&gt;Performance&lt;/h3&gt;

&lt;p&gt;
In the optimal case of trying to iterate through the whole game tree
in a 2p game, the average cost of making a move is about 500 cycles,
with my desktop doing 7 million moves per second. This is however
amortizing the cost of computing the set of valid moves across all of
those moves (since in a full search every valid move gets
executed). If you&#039;re just doing a pure random walk with no
backtracking, you&#039;d get no amortization at all. That effect makes an
order of magnitude difference.

&lt;p&gt;
But it&#039;s funny that the biggest profiler hotspot in the program is the
PLAYER-SCORE function. Which, if you remember, will simply do an array
lookup to get the previous cache key, compare it to the card-set that
should be evaluated, and either return a previous result or call out
to the real scoring function. The function does basically nothing, but
it does nothing really often. When all of the things of substance are
pretty fast as well, it&#039;s maybe not a surprise that the bottleneck ends
up in a place like that.

&lt;h2&gt;Results&lt;/h2&gt;

&lt;p&gt;
(Skip this section if you&#039;re not actually interested in the game, and
just wanted to read some Common Lisp code).

&lt;p&gt;
The following results are computed from running simulations of 10k
different initial setups, with 100k matches for each simulation with
each player making random but valid moves. (So a total of one billion
games). All plays were with 3 players, the only player count I
consider worth playing.

&lt;p&gt;
As a sanity check, I ran a smaller simulation of 1000 initial setups
where the players would not play a card + discard, if just playing that
same card was sufficient to get into the lead without a discard. The
results were very close to the large fully random simulation
(e.g. the average game length was 14.6 instead of 14.1 turns, and
the win percentage of the best turn order position was 39% rather than
40%).

&lt;p&gt;
Finally, an even smaller scale experiment had the AIs use move
selection heuristics very similar to those I personally use when
playing the game. Those results didn&#039;t differ materially from random
play either.

&lt;h3&gt;Caveats&lt;/h3&gt;

&lt;p&gt;
Unless stated otherwise, all of the numbers are from games with
players making completely random moves. It is possible that the
aggregate statistics are different when players consciously build
toward palettes that are strong in multiple scoring rules, or strong
in rules that they have a lot of cards in hand for.

&lt;p&gt;
The games are always played with the full deck, rather than in reality
as the deck slowly depletes from hand to hand as cards are moved to the
scoring piles of players.

&lt;h3&gt;Starting player effect&lt;/h3&gt;

&lt;p&gt;
One thing I was curious about is whether the starting player has an
advantage, a disadvantage, or neither. It&#039;s not obvious, since there
are effects both ways.

&lt;p&gt;
The case for a disadvantage: Running out of cards means losing the
game, and the all other things being equal the first player will also
run out of cards first. Due to the way in which the player order is
picked, the last player is also guaranteed to have the highest value
starting card in their palette giving them a leg up on winning future
tiebreakers.

&lt;p&gt;
The case for an advantage: The earlier in turn order a player is, the
fewer cards the opponents have in their palettes. It&#039;s much easier to
pass two players with one card each, than two players with two cards
each. And this effect continues throughout the game, so it should
accumulate over time.

&lt;p&gt;
It turns out that at least with undirected random play there&#039;s a major
disadvantage to being first. It could be that the effect is smaller
when players are making &amp;quot;good&amp;quot; moves.

&lt;p&gt;
&lt;table&gt;
&lt;tr&gt;&lt;td&gt;Position&lt;td&gt;Win rate
&lt;tr&gt;&lt;td&gt;1st&lt;td&gt; 27.20%
&lt;tr&gt;&lt;td&gt;2nd&lt;td&gt; 32.42%
&lt;tr&gt;&lt;td&gt;3rd&lt;td&gt; 40.37%
&lt;/table&gt;

&lt;h3&gt;Number of possible moves&lt;/h3&gt;

&lt;p&gt;
Like mentioned above, the branching factor in the game was higher
than I&#039;d been expecting. There are cases where players have a lot
more moves available than I would have expected.

&lt;p&gt;
The theoretical maximum number of options is 7 + 7 + 7 * 6 = 56, where
a player can get in the lead either by discarding any of their cards,
playing any of their cards, or with a combination of the two. This
situation actually happened a total of 483986 times in 14 billion
moves (0.03% of the time). A lot more common than I would have thought.

&lt;p&gt;
But of course we don&#039;t particularly care about the 0.03% case. The
more common cases are more interesting. The following graph shows how
often you have at least &lt;i&gt;X&lt;/i&gt; moves available in the game.

&lt;p&gt;
&lt;img src=&#039;/blog/stc/images/red7/cumulative.png&#039;&gt;

&lt;p&gt;
For example, you can see that about a 1/3rd of the time a player had
10 or more options to choose from. It appears that the game is nowhere
as constrained as I thought, even when playing without the special
action rules.

&lt;h3&gt;Length of game&lt;/h3&gt;

&lt;p&gt;
The average game lasted for 14.2 turns, which is perhaps less than I
expected given 2 of those 14 turns were by definition a player just
dropping out from the match.

&lt;p&gt;
There were some games that already ended on turn 4, which meant that
only two cards were played in the game. That number was a mercifully
low 0.01%. And while there were players who got eliminated before playing
a card, there at least were no games ending in turn 2 or 3 even if that&#039;s
theoretically possible. And a single game lasted all the way to turn
28.

&lt;p&gt;
The following graph shows how large a proportion of the games were still
running on a given turn.

&lt;p&gt;
&lt;img src=&#039;/blog/stc/images/red7/gamelength.png&#039;&gt;


&lt;h3&gt;Effect of player decisions&lt;/h3&gt;

&lt;p&gt;
The final question is about how strongly predetermined a single hand of
Red7 is, and how much a player can affect it.

&lt;p&gt;
We&#039;ve already established that at least with this skill level of play
there&#039;s a very large start player advantage, but is that an isolated
issue or does the setup matter even more than that. In these simulations
all players are by definition equally skilled. If the end result of the
game is primarily determined by player skill, you&#039;d thus expect them
to have similar win rates from game to game. So let&#039;s graph the
distribution of per-setup win rates for each starting position:

&lt;p&gt;
&lt;img src=&#039;/blog/stc/images/red7/winrates-density.png&#039;&gt;

&lt;p&gt;
Now, this graph is a little abstract since we&#039;re looking at
probabilities of probabilities. The way to read this is that across
those 10000 starting setups, the most common win percentage for
player 1 (red) across the 100000 games in a specific setup was around
15% (the peak of the red line is at around 0.15). You can see that
the later players in turn order have a graph that&#039;s shifted further
to the right, which is what you&#039;d expect when they have a
substantially higher win percentage. But you can also see that from
any starting position you might get absolutely dismal win rates
(near 0) or very high win rates (over 80%). The ridiculously high
win rates (95%) appear to be purely reserved for the player last
in turn order.

&lt;p&gt;
There were two setups where a player didn&#039;t manage to win even a single
match out of 100000 (in both cases that was player 1). In 25% of the
cases the player with the worst chance of winning a setup had a 10% win
rate or lower, in 7% of the cases a win rate of 5% or lower. It does
appear that within a single hand of Red7, luck plays a massive role.

&lt;p&gt;
Out of all of the questions we&#039;ve been looking at, this is of course
the one where the applicability of a purely random search strategy is
the most questionable. If we&#039;re investigating the effect of player
skill, how can results from the least skillful play imaginable be
relevant? I&#039;m sympathetic to that argument, but before buying into
it I&#039;d really like to understand the mechanism by which one player is
supposed to disproportionately benefit from the random play.

&lt;p&gt;
Also... As mentioned earlier, I also tried extending the AIs to be
smarter about selecting each move. This was not based on any kind of
lookahead, but simply the kinds of heuristics I&#039;d usually use myself
when playing the game. If I can get into the lead either by playing a
card or discarding a card (without drawing a new one to replace it),
I&#039;d rather play a card since that&#039;s going to be useful on future
rounds. When choosing which of two cards to play, I&#039;d usually prefer
to play the one that adds strength to more different scoring rules.

&lt;p&gt;
Experiments with one AI player getting use of these kinds of
heuristics while the others played completely randomly did not show a
big effect, the changes in the win rate were on the order of 1-2
percentage points.

&lt;h2&gt;Future work&lt;/h2&gt;

&lt;p&gt;
I might be done with this little project, but if I pick it up again
there&#039;s a couple of obvious directions to take this. Implementing the
optional special action rules would be nice. That&#039;s my preferred form
of the game anyway.

&lt;p&gt;
The more interesting one is to extend the current system to be a full
AI using the Monte Carlo Tree Search approach. This would allow
generating statistics based on &amp;quot;good&amp;quot; play of the game, maybe provide
information on what kinds of moves are in general successful, as well
as give a more conclusive answer to the level of skill the game has.

&lt;p&gt;
The tricky bit with evolving this code to a MCTS is that the system
in the current form would allow the MCTS to exploit knowledge of
future random events and hidden information. It would need to
randomize all card draws (currently deterministic), as well as swap the
opponents hands for random cards for the duration of the evaluation
phase, and then swap the original deck and original hands back in for
the move execution. That&#039;s going to slow down each individual move
a lot, which is a problem when MCTS will intrinsically require computing
several orders of magnitude more moves than a random walk.
</description><author>jsnell@iki.fi</author><category>GAMES</category><category>LISP</category><pubDate>Mon, 30 Mar 2015 14:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2015-03-30-monte-carlo-red7/</guid></item><item><title>Command languages as game user interfaces</title><link>https://www.snellman.net/blog/archive/2014-12-08-command-languages-as-game-ui/</link><description>
&lt;p&gt;
In
the &lt;a href=&#039;https://www.snellman.net/blog/archive/2014-11-27-history-of-online-terra-mystica/&#039;&gt;previous
post&lt;/a&gt; in this series, I promised to discuss in detail some of the
positive and negative consequences of the less conventional design
choices of my &lt;a href=&#039;http://terra.snellman.net/&#039;&gt;online Terra
Mystica implementation&lt;/a&gt;. If you have no idea of what that is,
reading at least the intro of that post might be a good idea. This
post will just deal with one design choice, but it&#039;s the elephant in
the room: the command language.

&lt;p&gt;
The canonical internal representation of a game in my TM
implementation is as a sequence of rows, each describing a some number
of player actions specified in
an &lt;a href=&#039;http://terra.snellman.net/usage/#sec-3.3&#039;&gt;ad hoc mini
language&lt;/a&gt;, or administrative commands that change the game setup
in some way (for example setting game options, or dropping a
player from the game partway through). This is what it might look like:

&lt;pre style=&#039;background-color: white&#039;&gt;
&lt;span style=&#039;background-color: #e0f0ff&#039;&gt;yetis&lt;/span&gt;: action ACT4
&lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;: upgrade E6 to TE
&lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;: +FAV6
&lt;span style=&#039;background-color: #f08080&#039;&gt;giants&lt;/span&gt;: Leech 3 from &lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;
&lt;span style=&#039;background-color: #f08080&#039;&gt;giants&lt;/span&gt;: pass BON4
&lt;span style=&#039;background-color: #e0f0ff&#039;&gt;yetis&lt;/span&gt;: Leech 2 from &lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;
&lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;: +WATER
&lt;span style=&#039;background-color: #f0c060&#039;&gt;dragonlords&lt;/span&gt;: Decline 2 from &lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;
&lt;span style=&#039;background-color: #f0c060&#039;&gt;dragonlords&lt;/span&gt;: dig 1. build G6
&lt;span style=&#039;background-color: #e0f0ff&#039;&gt;yetis&lt;/span&gt;: send p to EARTH
&lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;: action FAV6. +AIR
&lt;span style=&#039;background-color: #f0c060&#039;&gt;dragonlords&lt;/span&gt;: pass BON7
&lt;span style=&#039;background-color: #e0f0ff&#039;&gt;yetis&lt;/span&gt;: upgrade E7 to TE. +FAV11
&lt;span style=&#039;background-color: #f08080&#039;&gt;giants&lt;/span&gt;: Leech 3 from &lt;span style=&#039;background-color: #e0f0ff&#039;&gt;yetis&lt;/span&gt;
&lt;span style=&#039;background-color: #f0c060&#039;&gt;dragonlords&lt;/span&gt;: Leech 2 from &lt;span style=&#039;background-color: #e0f0ff&#039;&gt;yetis&lt;/span&gt;
&lt;span style=&#039;background-color: #b08040&#039;&gt;cultists&lt;/span&gt;: Leech 2 from &lt;span style=&#039;background-color: #e0f0ff&#039;&gt;yetis&lt;/span&gt;
&lt;/pre&gt;

&lt;p&gt;
That&#039;s a short excerpt from the middle of a random game. A full game
generally runs for about 400 rows.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
What do I mean by this being the canonical internal representation?
Only a few parts of the game state are actually persisted separately
in the DB; these are things that might almost qualify as metadata,
such as whose turn is it to move, is the game still running, and what
were the final rankings of a finished game. But in general the only
way to find out the current state of the game is to evaluate the whole
sequence of commands from start to finish. This is in fact done for
&lt;i&gt;almost every operation on the site&lt;/i&gt; (viewing a game, previewing a move,
saving a move, viewing the or editing the game in an admin mode, and
so on).

&lt;p&gt;
In addition to being the canonical internal representation, the
command language is also the canonical user interface; the fundamental
operation players do is enter new rows into the command
sequence. Often this is done by writing the commands manually, though
there are GUI shortcuts of one form or another available for almost
all operations.

&lt;p&gt;
This might sound like a slightly insane way of doing things, but it
does have some benefits as well. I&#039;ve made several digital board game
adaptations of varying levels of completeness over the years, used
tens of other ones, and this solution hits the closest to my personal
sweetspot.

&lt;h3&gt;A taxonomical diversion&lt;/h3&gt;

&lt;p&gt;
Before discussing the fallout of this design decision in more detail,
it&#039;s probably useful to do a quick tour of some of the main axes in
the design space. (I&#039;m of course just describing the extremes, while
in the real world most examples would fall on a continuum).

&lt;p&gt;
First, there&#039;s the question of the interaction model which might be
&lt;b&gt;abstract&lt;/b&gt;
or &lt;b&gt;&lt;a href=&#039;http://en.wikipedia.org/wiki/Skeuomorph&#039;&gt;skeuomorphic&lt;/a&gt;&lt;/b&gt;. In
a skeuomorphic design the player doing input on a computer would still
be mimicking the actions of someone playing the game with physical
pieces and no computer assistance.

&lt;p&gt;
In an abstract design the player
would only input the parts of the move that are necessary to uniquely
distinguish it from other possible moves, with any bookkeeping and
mandatory intermediate steps being carried out automatically. Likewise
in a skeuomorphic design the software provides information through the
same methods as the original physical game, while an abstract design
will automate some of the mechanical parsing of the game state. Or
even just the question of using the graphical assets of the original
game, generally optimized for sales, versus using digital-first assets
optimized for clarity.

&lt;p&gt;
As an example of this axis, in
the &lt;a href=&#039;http://en.wikipedia.org/wiki/18XX&#039;&gt;18xx&lt;/a&gt; series of
games a substantial amount of playtime is spent computing the exact
routes of a number of trains on a complex rail network. I&#039;m aware of
three solutions that are actually in use, and there is a fourth
plausible one, in order from least to most abstract:

&lt;ul&gt;
&lt;li&gt; The user manually decides on the routes, computes their values with no computer assistance, and those values are used with no validation. Examples: ps18xx, early versions of &lt;a href=&#039;http://rails.sourceforge.net/&#039;&gt;Rails&lt;/a&gt;.
&lt;li&gt; The user enters valid routes through a user interface. The software computes the values of the routes, and distributes the income from the company appropriately. Example: &lt;a href=&#039;http://www.rr18xx.com/&#039;&gt;rr18xx&lt;/a&gt;.
&lt;li&gt; In games with requirements that all routes must be optimal, the software could compute an optimal route but only for the purpose of rejecting any manually computed unoptimal ones. Examples: None. (Though it&#039;s similar to what&#039;s done in the SlothNinja implementation of &lt;a href=&#039;http://www.slothninja.com/&#039;&gt;Indonesia&lt;/a&gt;, a game that probably counts as an honorary 18xx)
&lt;li&gt; The software automatically finds an optimal set of routes and computes their values. Examples: The ancient DOS-based &lt;a href=&#039;http://www.mikkosgameblog.com/2010/08/simtex-1830/&#039;&gt;1830 from Simtex&lt;/a&gt;, recent versions of Rails.
&lt;/ul&gt;

&lt;p&gt;
My own tastes run toward maximum abstraction, I&#039;ve rarely if ever seen
a digital boardgame conversion that needed to be more skeuomorphic.
But this is not a universal view. There are definitely people who will
refuse to play a conversion that does not use the same graphics as the
physical version. Or who will strenuously argue against automatic
finding of optimal routes in 18xx, on the basis that being evaluating
routes is a core skill in the game when making decision about route
building, and that skill can only be acquired by getting sufficient
practice in manual route computation.

&lt;p&gt;
A second axis is the internal representation, which could be based on
either &lt;b&gt;log replay&lt;/b&gt; or &lt;b&gt;stored state&lt;/b&gt;. In a log replay
system the game is stored as a series of steps from the starting setup to
the current state. In a stored state system the game is stored as the
current values of all pieces of the game. How much money does every
player have, which round is it right now, what&#039;s in this exact space
on the map, and so on.

&lt;p&gt;
A third axis is the input model. Moves could be entered either through
&lt;b&gt;direct&lt;/b&gt; or &lt;b&gt;indirect&lt;/b&gt; manipulation. In a system using
direct manipulation, the player would for example see a graphical
display a map and be able to click or drag on a unit to enter a move
for it. In an indirect system the player observes the game state in
one place, and enters their moves using some completely unrelated
system.

&lt;p&gt;
I think most digital boardgames use a direct input model, but there
are also a fair number that have a menu-driven system of some sort.
The only examples I know of that go a bit further with indirection by
providing a command language are
my &lt;a href=&#039;https://www.snellman.net/software/pogmap/&#039;&gt;ancient Paths
of Glory mapper&lt;/a&gt; and the even
older &lt;a href=&#039;http://en.wikipedia.org/wiki/Internet_Diplomacy#Email_judges&#039;&gt;Diplomacy
PBEM judges&lt;/a&gt;. If you have other examples, I&#039;d love to hear of them.

&lt;p&gt;
Direct manipulation is often, but not always, linked to excessive
skeuomorphism in the interaction model. For example I find it almost
painful to play most Vassal modules, with their hyper-direct
interaction model of dragging and dropping counters around, manually
drawing cards from a deck or rolling dice. Digital boardgames are not
the same media as physical boardgames, and should play to their unique
strengths. But these are in fact orthogonal concerns, and there&#039;s no
reason for why a direct manipulation model couldn&#039;t also provide
useful input and computational abstractions.

&lt;p&gt;
Whew, so much for the theory. In this taxonomy Online Terra Mystica is
pretty far toward the abstract end, and is fully in the log replay
camp. While it has a half-hearted attempt at adding some direct
manipulation concepts to the UI, it started off as an indirect system
and deep inside that&#039;s what it is. It also chooses to merge the input
format and the log format into one entity. So what does this mean?

&lt;h3&gt;Feature set&lt;/h3&gt;

&lt;p&gt;
Perhaps the signature feature of the site is the &lt;b&gt;planner&lt;/b&gt;. This
tool allows the player to enter an arbitrarily long sequence of
actions - all the way to the end of the game - and see what the
effects would be. Are all the moves valid? Are there sufficient
resources available to do all of this? Oh, I don&#039;t have enough
resources? Well what if I do this on round 5, and delay that action to
round 6. In cases where the plan fundamentally depends on the
opponents doing something, it&#039;s possible for the plan to also contain
arbitrary resource adjustments.  And finally, since the command
language supports comments, these plans can be properly documented so
that when you return to them in a day or two, you can remember why
you wanted to do these particular moves.

&lt;p&gt;
I think this feature is intrinsically linked to the command language as
a user interface, and it might actually be unique. There are some
games with other kinds of interfaces that allow you to play the game
forward, and then undo / rewind / reload. But simply being able to
play the game forward is not sufficient to make this a useful
tool. It&#039;s only the ease of inserting, reordering and deleting moves
that makes it possible to use this as a matter of course, rather than
only under the most exceptional circumstances.

&lt;p&gt;
A somewhat related feature is &lt;b&gt;undo&lt;/b&gt;. Inflexibility in allowing
moves to be taken back is the bane of many forms of digital
boardgames.  When playing a game face to face, most groups will
generally allow at least some level of taking back moves. In some
cases all moves are final immediately (this has always been the
primary problem of the otherwise brilliant implementation
of &lt;a href=&#039;http://brass.orderofthehammer.com/&#039;&gt;Brass at Order of the
Hammer&lt;/a&gt;). In some other implementations there are distinct
checkpoints, for
example &lt;a href=&#039;http://www.boardgaming-online.com/&#039;&gt;BGO&#039;s Through the
Ages&lt;/a&gt; allows undoing back to the start of your full turn, but no
other rollbacks (clicking &#039;finish turn&#039; is final, as is any kind of
action during an auction or war resolution). These two are, I believe,
examples of undo being limited for design reasons. At rr18xx meanwhile
rollbacks are possible until the previous action of each player. Here
my understanding is that the overriding issue is technical, as the
rollback is essentially a full restore to a previous database
snapshot, and there are resource constraints on how many snapshots can
be kept.

&lt;p&gt;
The solution Online TM takes to this is to grant the creator of the
game arbitrary powers to edit the history at will, the &lt;b&gt;admin
mode&lt;/b&gt;. Not only can they undo the last move or couple of moves. If
there was a mistake made three moves back, they can go and fix it (and
they can fix it without forcing the intervening moves to be
redone). This feature is fully tied to a log replay mode of
operation. While more limited forms of undoing could be implemented as
a reverse log replay from the end state or through state snapshots,
this more complete form depends on the log being directly editable.
And realistically the log also needs to be the input format; it would
not be reasonable to expect the admin to be able to edit a more
formal log representation correctly (whether the log format is XML,
protocol buffers, JSON, or something else). But in the case where the
log format and the move input system match, just playing the game
has taught the game admin the necessary skills.

&lt;p&gt;
This is a very nice feature for friendly games. It does have downsides
though, more on that later in the section on the social implications.

&lt;p&gt;
There&#039;s also a potential as yet unimplemented feature
of &lt;b&gt;pre-programmed actions&lt;/b&gt;, that people frequently ask for.
&amp;quot;I know exactly what I want to do next turn, why can&#039;t I just
pre-enter my move&amp;quot;. This would be a pretty interesting thing for
speeding up games, but to my mind would not be conducive for good
play. Circumstances change, often in ways you did not anticipate at
all.  The only way this could be even remotely usable would be if the
language was extended to have some kind of conditional execution. And
that&#039;s a can of worms I&#039;m interested in opening, and I suspect also a
bridge too far for 99% of my users.

&lt;p&gt;
It&#039;s worth noting that many of the above features are closely tied to
a game with no randomness (or at most setup randomness) and no
hidden information. As such their existence is something of an
anti-feature, preventing other additions to the game.

&lt;p&gt;
For a non-hypothetical example, I&#039;m currently thinking about how to
implement the &lt;b&gt;faction auction&lt;/b&gt; variant from the TM expansion. A
full open auction in the beginning would be painfully slow. The most
obvious, though still slightly imperfect, solution is a series
of &lt;a href=&#039;http://en.wikipedia.org/wiki/Vickrey_auction&#039;&gt;blind second
price auctions&lt;/a&gt;. But this is not a good fit for the site&#039;s existing
design. The problem is that the blind bid introduces momentary hidden
information into the game, and it&#039;s possible for that information to
leak through either the preview or admin modes. For example the admin
could wait for everyone else to bid, peek into the log and see
everyone else&#039;s bids, and then bid in such a way as to force the
winner to pay the maximum amount.

&lt;h3&gt;UX&lt;/h3&gt;

&lt;p&gt;
The most obvious UX consequence of using a command language is that
it tends to be &lt;b&gt;harder to learn&lt;/b&gt;. The following quote, said partly
in jest, certainly contains a kernel of truth:

&lt;blockquote&gt;
... has done a bang-up job providing a PBEM Terra Mystica experience that includes just enough extra layers of complexity via the interface and game administration tools to keep TM as confusing as ever, long after you master the actual game!
&lt;/blockquote&gt;

&lt;p&gt;
Non-natural languages are simply not a mode of human computer
interaction that most people are comfortable with in this day and age.
It actually continues to amaze me that I could get non-programmers to
play using this implementation at all. Is it possible to evaluate how
big a hurdle this has been for people? The best number I can come up
with is that around 20% of the players who joined at least one game
never finished even one game without dropping out. Note that these
are players who have already jumped through hoops such as email
validation during account registration. It&#039;s possible that there&#039;s
some other issue beside the UI that&#039;s a problem for these players,
but it does seem like the most likely candidate.

&lt;p&gt;
A smaller problem is that it essentially forces the introduction of a
&lt;b&gt;move preview&lt;/b&gt;. For those who haven&#039;t played the game, when entering
moves you need to first enter the moves, then click &#039;preview&#039;, check
that the results match what you want, and finally click &#039;save&#039; to
commit the moves. In a game that uses a direct manipulation paradigm,
a preview could be skipped. But with a more obscure UI like here, it&#039;s
absolutely essential since the move might not have had the intended
effect. Whether it&#039;s doing the entirely wrong move, picking the wrong
tile, building on the wrong location, etc. Even with a preview step
somebody will request a rollback on average once or twice a game.

&lt;p&gt;
So why do I call this a problem? Because despite my best efforts,
especially new players will frequently forget to &#039;save&#039;, leaving the
game in a limbo state where they think they&#039;ve done their move, until
some other player gets impatient. (To mitigate this a little, the
system will automatically do a &#039;preview&#039; when using the GUI tools to
generate the commands rather than type them.  Unfortunately
performance problems make it unfeasible to trigger continuous parsing
+ updates when typing).

&lt;p&gt;
A horrible mistake I made in the design of the language was the lack
of (mandatory) &lt;b&gt;turn delimiters&lt;/b&gt;. Originally my implementation treated
each row as a complete turn. This caused more confusion than any other
part of the command language. In the end I ended up writing a lot of
very complicated code for automatically detecting the turn breaks in
a command stream.

&lt;p&gt;
But that wasn&#039;t actually good enough, there are valid command streams
where the splitting isn&#039;t unambiguous, e.g. the tunneling ability of dwarves, where
&lt;code&gt;transform E10. build E10&lt;/code&gt;. I had to make an arbitrary
choice on that (basically the behavior now is greedy, as many commands
as possible are stuffed into the same move). So I had to include the
&lt;code&gt;done&lt;/code&gt; command to allow players to disambiguate in the few
cases where it&#039;s needed. This is still supremely confusing for
people. All of this could have been avoided by taking this into account
right at the start.

&lt;p&gt;
Finally, one very surprising outcome is that having a compact
vocabulary for game actions makes it much easier to display
a &lt;b&gt;useful player-readable log&lt;/b&gt; of what happened in the game. The
typical user-visible log is structured as natural language, and so
verbose as to be hard to read especially when trying to piece together
the flow of the game after the fact. It&#039;s easy to see why that design
choice is made, but it&#039;s not necessary when all players are almost by
definition going to know how to read a more compact representation.

&lt;p&gt;
Likewise this makes it really easy to display a concise summary of
what has happened in the game since the player last looked at it (done
both in the notification emails and the &#039;recent moves&#039; tab of games).

&lt;h3&gt;Social issues&lt;/h3&gt;

&lt;p&gt;
The unlimited admin access to games has a dark side. &lt;b&gt;Admin
malfeasance&lt;/b&gt; is rare but I do get about one complaint a month about
it. Sometimes these are games where the admin will change their moves
after others have already taken moves, rolling the game back by a huge
amount, taking over entirely for another player for example forcibly
passing them, applying different standards to allowing others to undo
vs. doing it themselves, and so on.

&lt;p&gt;
This is the kind of drama that I really do not want to deal with, but
the general solution is to just mark the game as unrated, and let the
players sort out between themselves whether and how the game will
continue. And it is a bit of a miracle that it hasn&#039;t yet become a
more widespread problem,
as &lt;a href=&#039;http://www.penny-arcade.com/comic/2004/03/19&#039;&gt;one might
expect to happen for the anonymity + internet combo&lt;/a&gt;. If it does
ever become intolerable, the solution will almost certainly be to
disable admin mode entirely for public games. The TM tournament has
already shown that it&#039;s at least workable, even if people do occasionally
get a little bit screwed by the &#039;no manual administration&#039; policy.

&lt;p&gt;
One consequence of a command language is that everything needs to be
named. The map needs to have a coordinate system, every component
needs a identifier of some sort, and every interaction needs a short
and snazzy name. Old school wargames will do this as a matter of
course. Of course every hex has an id! Of course the cards are both
numbered and uniquely titled! But not so much for eurogames.

&lt;p&gt;
The naming we ended up with on the site is far from optimal, and
caused yet more drama due to non-online players feeling excluded from
conversations. (If you want to know more, you can see an explanation
for &lt;a href=&#039;http://boardgamegeek.com/article/17066276#17066276&#039;&gt;where
the names came from, and why they won&#039;t change&lt;/a&gt;). That bit is
unfortunate. But at least I actually find real value in having
convenient shorthands available for everything, when discussing the
game, whether when theorycrafting or conducting some tabletalk on IRC
during a game.

&lt;h3&gt;Implementation issues&lt;/h3&gt;

&lt;p&gt;
The obvious problem for a log replay system
is &lt;b&gt;performance&lt;/b&gt;. Replaying a full game, which is done for almost
every operation, can take around 0.15 seconds in the current
implementation, with no obvious low hanging fruit to fix. On the
current traffic levels server load is not a problem, but I would start
to get worried if usage increased by a factor of 10. As discussed
above, there are features I&#039;m unwilling to implement due to CPU load
concerns. And it is actually causing real development pain for testing
(see below).

&lt;p&gt;
It&#039;s hard to say exactly how much of the CPU overload is related to
command parsing, a step that could be avoided with the use of a
more structured log format. Some crude profiling suggests that the
parsing takes only 5-10% of the runtime, certainly nowhere enough
to warrant using a different format.

&lt;p&gt;
A rewrite in a language with higher performance implementations than Perl
would almost certainly give a factor of 10 improvement on the actual
game evaluation code, moving the bottlenecks to IO. But a full rewrite
is not in the cards.

&lt;p&gt;
Another potential implementation worry is &lt;b&gt;storage&lt;/b&gt;. The current
DB size is about 250MB. Unlike CPU usage, this is a cost that
accumulates over time. Out of that 250MB maybe 75% is used by the game
logs. The logs, stored as a sequence of commands, are not a
particularly efficient form of encoding the game data. Simple lossless
compression could easily compress them by 80-90%.  Luckily disk is
cheap (this server still has 600GB free), so this should never become
a real issue.

&lt;p&gt;
Another consequence of a log replay system is that any change in the
game evaluation might &lt;b&gt;break existing games&lt;/b&gt;. That change might be a
bugfix for a place where the effect of a move was miscomputed, it
might be extra validation to prevent illegal moves of some kind,
cheating prevention, or something else entirely. This is not a
theoretical possibility. Basically every single game evaluation change
I make, there are already multiple affected games. No matter how
elementary a rule is, somebody has already broken it.

&lt;p&gt;
Obviously in a stored state implementation changes like this don&#039;t
matter. The current state is the current state no matter what. But in
a log replay system you need to have some story on how to deal with
retroactive changes. I can think of the following strategies:

&lt;ul&gt;
&lt;li&gt; Punt: Don&#039;t make any changes at all.
&lt;li&gt; Ignore: Just make the change, and don&#039;t worry about games breaking or the results changing part way through.
&lt;li&gt; Delete: Just delete any games that would be broken.
&lt;li&gt; Fixups: Find all games where the old and new behavior differ, and
change the appropriate logs in such a way that the results with the
new log and version will be the same as the result with the original log
and old version. This change could be manual or automated.
&lt;li&gt; Versioning: Each game file carries a version number. When making
a breaking change, keep both the original and new code paths, and choose
one of the two based on the version number. Any newly created games use
the new version number and get the fixes, existing games keep their original
version number and the original behavior.
&lt;li&gt; Positive options: Conditionalize the behavior on an option. Turn that option on for new games, as well as any existing games for which the new and old versions behave the same.
&lt;li&gt; Negative options: Conditionalize the old behavior on an option. Turn that option on only for existing games where the results for old and new versions differ. Never turn the option on for newly created games.
&lt;/ul&gt;

&lt;p&gt;
During the lifespan of the site I&#039;ve used most of these at one time or
another. The &#039;ignore&#039; strategy was appropriate a couple of times (for
changes where I decided that the the new behavior was always
acceptable, such as situations where a player had ended up overpaying
for an action). The &#039;delete&#039; strategy would be exceptional, the only
situations where I used it were games that were aborted, and one case
of a single game being completely unsalvageable due to bug abuse by a
player. The &#039;fixup&#039; strategy has the nice benefit that it avoids
introducing a new code path, and was my default choice early on. But
at this point it&#039;d be an unacceptable amount of manual work, and it&#039;s
not readily automatable. Especially with the relatively freeform input
from the command language. My next default was &#039;positive options&#039;, but
after about 3-4 of those I switched to &#039;negative options&#039;. Positive
options had a slightly more complicated rollout procedure, and also
permanently clutter up all games, confusing people. (&amp;quot;What&#039;s this
&lt;code&gt;strict-darkling-sh&lt;/code&gt; option?&amp;quot;).

&lt;p&gt;
None of these options are good, in this instance a log replay model
does introduce some major costs either to the developer (who has to do
extra work) or the users (who have some games screwed up or completely
lost).

&lt;p&gt;
But it&#039;s not all bad! A log replay model makes &lt;b&gt;testing&lt;/b&gt; much
easier. First, it&#039;d be very easy to write test cases since there is a
very natural serialization format for games already, the command
language. I don&#039;t actually write explicit tests for TM, but for
example at work we need absurd amount of infrastructure for making it
easy to write unit tests for TCP/IP packet handling. This kind of
design gives the test cases for free. Likewise a Age of Steam
implementation I was once doodling around with had lots of test cases,
but even with the reasonably friendly format (protocol buffers) they
were an absolute pain to write due to the boilerplate.

&lt;p&gt;
If I don&#039;t write unit tests, how do I test? Mostly by &lt;b&gt;side by side
testing&lt;/b&gt;; I have
a &lt;a href=&#039;https://github.com/jsnell/terra-mystica/blob/master/src/diffgame.pl&#039;&gt;small
script&lt;/a&gt; that runs every single game in the database against both
the new and the previous version. It munges the results a bit removing
known harmless diffs, and then displays any changes from game to
game. I can then look at those games, and decide whether it&#039;s
indicating some kind of a problem with my change, an expected result
of my change, or a problem of some sort in the game. It also acts as
a great regression test that prevents failures from creeping in, and
is the source of data for finding the games that would be broken by
a game, so that one of the fixes discussed in the previous section
can be applied.

&lt;p&gt;
This has been one of my favorite forms of testing for a long time, and
works tremendously well in a case like Online TM where we have access
to all games ever played. Thinking specifically of digital boardgames,
it&#039;s also a model that wouldn&#039;t work well without a replayable log.
The only problem is, as alluded to above, the CPU usage. Right now a
full &lt;code&gt;diffgame&lt;/code&gt; run takes about 90 minutes of CPU time on a
rather beefy machine. Even with parallelization it&#039;s not a fast
feedback cycle. (Makes me kind of miss being able to just casually run
a sxs test on a thousand machines).

&lt;h3&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;
I&#039;m afraid this ended up longer than intended, despite only covering
one design decision. It&#039;s also a design decision that I feel is
overall a win. You&#039;ll have to wait for the next post for the
embarrassing technical missteps.

</description><author>jsnell@iki.fi</author><category>GAMES</category><category>PERL</category><pubDate>Mon, 08 Dec 2014 12:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2014-12-08-command-languages-as-game-ui/</guid></item><item><title>A brief history of Online Terra Mystica</title><link>https://www.snellman.net/blog/archive/2014-11-27-history-of-online-terra-mystica/</link><description>
&lt;h3&gt;What&#039;s this Online Terra Mystica thing?&lt;/h3&gt;
&lt;p&gt;
For the last couple of years my main hobby hacking project (over a
thousand commits, and probably an order of magnitude more time spent
on it than all other non-work projects combined) has been an &lt;a
href=&#039;http://terra.snellman.net/&#039;&gt;asynchronous
multiplayer web implementation&lt;/a&gt; of the brilliant board game
&lt;a href=&#039;http://boardgamegeek.com/boardgame/120677/terra-mystica&#039;&gt;Terra Mystica&lt;/a&gt; (&lt;a href=&#039;http://www.feuerland-spiele.de/en/&#039;&gt;Feuerland Spiele&lt;/a&gt;, 2012).
At the moment it&#039;s roughly 2/3 Perl, 1/3 Javascript, and uses
Postgres as the data storage.

&lt;p&gt;
It&#039;s been a fairly successful project for something that was
originally intended as a one-off. The usage statistics at the end
of November 2014 are:

&lt;ul&gt;
&lt;li&gt; Almost 6000 registered users
&lt;li&gt; About 1200 monthly active users (as in playing at least one game; not
passive use like looking at the statistics pages).
&lt;li&gt; 14000 moves executed on a normal weekday (10000 on weekends)
&lt;li&gt; 16500 games either ongoing or finished.
&lt;li&gt; Bi-monthly &lt;a href=&#039;http://tmtour.org&#039;&gt;online TM tournament&lt;/a&gt; run by
  Daniel &amp;Aring;kerlund with 400+ players.
&lt;li&gt; &lt;a href=&#039;https://github.com/jsnell/terra-mystica&#039;&gt;1038 commits&lt;/a&gt; as of this writing.
&lt;/ul&gt;

&lt;a href=&#039;http://terra.snellman.net/game/135test&#039;&gt;&lt;img src=&#039;/blog/stc/images/tm-thumb.png&#039; style=&#039;float: right; padding: 10px;&#039;&gt;&lt;/a&gt;

&lt;p&gt;
This was not supposed to be a general use
program. It was originally a one night hack to help keep track of
a hand-moderated play-by-forum game of TM, which was obviously
headed for failure due to the massive amount of errors people were
making while describing their moves in natural language or when manually
tracking their resources in the game.

&lt;read-more&gt;&lt;/read-more&gt;

&lt;p&gt;
From there the project snowballed, slowly gathering features
including just about everything I ever marked in the TODO as being
&#039;out of scope&#039;. Since I often had only very limited amounts of time to
work on this, and my expectation was always that the interest in the
site would soon fizzle out, the project management method was to always
get the maximum short-term bang for the buck.

&lt;p&gt;A project whose direction is literally guided by &#039;what can I get
done in the next two hours&#039; is of course massively path dependent; the early
decisions made with very little consideration had outsized influence
on where the site ended up. Sometimes the expedient gambles on &#039;do the
simplest possible thing&#039; failed, and the results were just rubbish. At
other times things ended up at a slightly odd local maximum. And in
some rare cases the gamble turned out to produce wonderful
and unexpected results.

&lt;h3&gt;Timeline&lt;/h3&gt;


&lt;p&gt;Future posts will discuss the actual lessons
learned; what didn&#039;t work and what did work - both in the mechanics of
programming and in the peculiarities of online boardgames. But in
this one let&#039;s just have a look at the history of the site, how long
it took for it to get features that one might consider absolutely
necessary, and how amazingly bad user experience people are willing to
put up with when it&#039;s the only way they can play their favorite game
online.

&lt;p&gt;Feel free to skip past the bulleted list if you get bored, it&#039;s
still a bit long even if I include only changes I consider fairly
major (indeed, a lot has to get filtered out given it&#039;s 1000+ commits).

&lt;p&gt;
  &lt;b&gt;2012&lt;/b&gt;
  &lt;ul&gt;
    &lt;li&gt;&lt;b&gt;December - Early January&lt;/b&gt;: The smallest program that did anything useful related to a game. I&#039;d enter moves into a text file and run the script to produce the final game state as JSON. This JSON was rendered to HTML + Canvas by some Javascript code that was half ripped off from an old project. There was some minimal rules checking and automation, and support for only 5 out of the 14 factions in the game. Users of the current site might want to see the &lt;a href=&#039;https://www.snellman.net/tmp/tm/1/&#039;&gt;old look&lt;/a&gt;.
  &lt;/ul&gt;
  &lt;b&gt;2013&lt;/b&gt;
  &lt;ul&gt;
    &lt;li&gt;&lt;b&gt;January&lt;/b&gt;: A rudimentary dynamic web site, implemented simply as a wrapper CGI script around the JSON generator script. After that a clumsy web-based editor was added for game files (a textarea that could be used to edit specific files in a git repository, no authentication except for each game having a random 160 bit identifier as part of the URL). This allowed other people to moderate their own games, as long as I created a game for them and sent the link with the secret embedded. Players would post / email a natural language description of their move to the moderator, who would then enter the moves into the admin tool using the correct syntax. Amazingly some 20 games were run using this insane system, while by all rights the project should have died there.
      &lt;br&gt;&lt;br&gt;This version of the software had automation for resolving the effect most game events, but did very little validation to notice completely invalid moves.
    &lt;li&gt;&lt;b&gt;February&lt;/b&gt;: Added an ability to easily rewind the game state back to any time in history, to help with post-game strategy analysis. Also added a way for players to enter their own moves (a textarea in the main game view, a preview button and a save button, and some verification to make sure they could only enter their own moves). Again there was no real authentication here, just links with an embedded faction token derived from the per-game secret key.
    &lt;li&gt;&lt;b&gt;March&lt;/b&gt;: The hackiest email integration in the world: Store the email addresses players in the same text file with the commands. After a player has entered a move, the software would create a mailto: link with prefilled subject, content and receivers (the other players). The player would clicks on the mailto: link, the email loads up in their mailer (even GMail), and they&#039;d press send.
      &lt;br&gt;&lt;br&gt;Compute and display a VP projection on the last round assuming no further moves, to give players some idea of who is really winning.
    &lt;li&gt;&lt;b&gt;April&lt;/b&gt;:
      I continued to resist adding any user management or authentication. But my friend Gareth wanted a better way to manage his ongoing games than a spreadsheet, and wrote a small App Engine site into which players entered their secret game URLs. His site then used my site&#039;s API to figure out which games the player needed to act in. And it went even a bit further, by embedding the move entry UI into the same app.
      &lt;br&gt;&lt;br&gt;After a few weeks of using Gareth&#039;s site, I had to admit that he was totally right about this being required functionality. So I finally added a DB to the project for storing user accounts and game metadata, and a &#039;your games&#039; list on the front page after login. It&#039;s also only at this point in the lifetime of the site where I added a UI for people to make new games. Until then every game was created by somebody asking for a new game via email.
      &lt;br&gt;&lt;br&gt;Finally, this month also saw the addition of a statistics page on how often each faction was winning (since balance was a hot topic on the BGG forums of Terra Mystica right from the start), and soon after a list of achieved high scores for each faction and player count.
    &lt;li&gt;&lt;b&gt;May&lt;/b&gt;: This month mostly introduced all kinds of stricter validation, as the reduced barrier to entry for playing was causing significantly more illegal moves to be entered (early on players were enthusiasts of the game and thus had good knowledge of the rules; at this point people started to learn the game through the site, which was quite scary).
      &lt;br&gt;&lt;br&gt;The main new feature of the month was the &#039;planner&#039;, an alternate text entry box which could be used to enter commands arbitrarily far into the future, and check that the moves are valid and what kind of effect they have. This is useful for example for checking that you have sufficient resources for making certain moves without manual computation. Another use is leaving &#039;notes to self&#039;, so that the player doesn&#039;t need to re-evaluate the board for every single move. (Some people were suddenly playing tens of games at a time, so this was a real problem).
    &lt;li&gt;&lt;b&gt;June-August&lt;/b&gt;: This time period saw only minor fixes and improvements from the user&#039;s point of view. There was a bit of infrastructure work behind the scenes, such as moving the actual game moves into the database, though they still remained just plaintext.
    &lt;li&gt;&lt;b&gt;October&lt;/b&gt;:
      The mini expansion for TM was released at the Spiel fair in Essen. I implemented the new features the very next morning in lobby of my hotel at Essen, with a ChromeBook, a ssh connection to the production server, and and the world&#039;s worst WiFi. After some reflection I decided not to make the change visible to the public before getting back home and a more reliable work environment :-)
    &lt;li&gt;&lt;b&gt;November&lt;/b&gt;:
      I finally made the site automatically send email notifications, rather than require players to jump through the fragile mailto: hoops to let other players know whose turn it is. Replacement of the mailto-style notification of moves also required the addition of an in-site chat feature for communication.
    &lt;li&gt;&lt;b&gt;December&lt;/b&gt;:
      Another consequence of the real email support from the previous month was that players no longer needed to expose their email addresses to other players. This finally made it possible to allow players to create &#039;public games&#039; that anyone can join, rather than only play people with whom they&#039;ve done some kind of an out-of band email address exchange. (At this point 1500+ games had been started, amazing how far such a kludgy system could go).
      &lt;br&gt;&lt;br&gt;At the time 25-30% of moves were being entered from
      smartphones or tablets. But the move entry interface was typing
      commands like &lt;code&gt;&#039;convert 2pw to 2c. upgrade d3 to tp&#039;&lt;/code&gt;
      into a text box. What&#039;s
      wrong with this picture? :-) In the month we finally got a
      slightly friendlier UI, though the textual command representation
      still remained the canonical one.
      &lt;br&gt;&lt;br&gt;
      The site finally got a ranking system: a multi-iteration version
      of the ELO algorithm, which computed not only player strengths but
      also faction strengths, and credited good results with the weaker
      factions more than good results with the strong ones.
      &lt;br&gt;&lt;br&gt;
        Finally, in very late December I went on a big refactoring
        spree to move the game from CGI scripts to a more persistent
        application server (FCGI with Plack and CGI::PSGI, but no
        framework). Eradicating all global data and all
        modification of literal data structures was way too much work,
        those were not corners worth cutting in the first place.
      &lt;br&gt;&lt;br&gt;The new UI went live a year from starting the project
      (almost exactly; from December 22nd 2012 to December 21st 2013),
      and is the point where I&#039;d consider the site to be actually usable
      by mere mortals.
  &lt;/ul&gt;
&lt;b&gt;2014&lt;/b&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;b&gt;February&lt;/b&gt;: Support for variant maps, for testing parts of
    the upcoming Terra Mystica expansion for the designers. I also added
    a map editor that could import map definitions from
    &lt;a href=&#039;http://lodev.org/tmai/&#039;&gt;Lode&#039;s TM AI&lt;/a&gt;, which the
    design team had been using for the map. The online playtest team
    proceeded to play 100 games with different map versions
    before the expansion finally went to print.
  &lt;li&gt;&lt;b&gt;April&lt;/b&gt;: A bunch of work on the expansion, which was still
    being kept under wraps. So the support for the new final scoring types and
    four of the new factions was not visible to most users at this time.
    &lt;br&gt;&lt;br&gt;
    The main user-visible change was automatically dropping players from
    games after a week of inactivity, to support the inaugural season of
    the online Terra Mystica tournament. People&#039;s irritation about others
    playing slowly had been constant ever since the addition of public
    games (95% of my games are private with a few separate groups of
    friends, so I&#039;m pretty isolated from this myself). Unfortunately
    this change appears did not appear to help enough.
    &lt;br&gt;&lt;br&gt;This month also saw the addition of individual profile pages,
    showing all kinds of statistics for each player (games started,
    finished, performance with given factions, performance and play
    counts against specific opponents, etc).
  &lt;li&gt;&lt;b&gt;September&lt;/b&gt;:The next attempt at reducing the anguish caused
    by slow players was to allow setting shorter move timers than the
    default one week (from 12 hours to 14 days). Lots of people started
    12 hour deadline games, and moved on to complaining about so many
    people dropping out. Sometimes you just can&#039;t win.
  &lt;li&gt;&lt;b&gt;October&lt;/b&gt;:Public support for the two new expansion maps,
    as well as the new final scoring types.
  &lt;li&gt;&lt;b&gt;November&lt;/b&gt;:Public support for all six new factions from
    the expansion, as well as the variable turn order variant.
&lt;/ul&gt;

&lt;p&gt;
I find it interesting that it really did basically take a year of
real time (and maybe 2 months of hacking time) before the
implementation was in a shape where I would&#039;ve thought about
publishing it. And there&#039;s no way I&#039;d put that amount of time into a
project like this up front. Usually these projects are active for a
couple of weekends before getting abandoned; fun parts are done but
all the hard work of making it really usable remains.

&lt;p&gt;In this case people were eager to use even the incredibly crude
early versions, so I got over that hump very quickly. And at that point
every incremental improvement to the site was affecting tens, hundreds,
or thousands of people. This is of course always more motivating than
working on polishing the perfect piece of software that nobody is
using.

&lt;p&gt;There were many architectural and design decisions done along the
way that I ended up deeply regretting, and which cost me lots of time
later on. But without all those early shortcuts there would&#039;ve been no
implementation at all. Easily the best example of
&lt;a href=&#039;http://www.jwz.org/doc/worse-is-better.html&#039;&gt;Worse is Better&lt;/a&gt;
that I&#039;ve been personally involved with.
</description><author>jsnell@iki.fi</author><category>GAMES</category><category>PERL</category><pubDate>Thu, 27 Nov 2014 21:00:00 GMT</pubDate><guid permaurl='true'>https://www.snellman.net/blog/archive/2014-11-27-history-of-online-terra-mystica/</guid></item></channel></rss>