Introduction

A couple of years ago I wrote a quick and dirty rating system for a online boardgame site I run. It wasn't particularly well thought out, but it did the job. Some discussion about the system made me revisit it, with two years of hindsight and orders of magnitude more data.

How well does the system actually work, and how predictive are the ratings? There are some obvious tweaks to the system — would implementing them make things better or worse? Would anything be gained from switching to a more principled (but more complicated) approach. For this last bit, I used Microsoft's TrueSkill as the benchmark. It has some desirable properties and appears to be the gold standard of team based rating systems right now.

The code and the data are available on GitHub in my rating-eval repository.

Table of contents

  • The game - What properties of this game are relevant to a rating system?
  • The rating system - How does the current rating system on the site work?
  • Evaluating rating quality - What are valid metrics for evaluating the predictive quality of a rating system?
  • TrueSkill - What's TrueSkill, and how was this game translated to TrueSkill concepts?
  • Results - The evaluation results in tedious detail.
  • Conclusions - The tl;dr on the results.
  • Footnotes

The game

The game in question is Terra Mystica. I won't go into the exact mechanics of the game in this article, since that just doesn't matter. What does matter is the answer: What makes the game such a unique snowflake that I can't just plop in a standard Elo or Glicko implementation and call it a day?

First, TM is a multiplayer game and indeed most commonly played as a 3-5 player game rather than a 2p one. This means that a pure two player rating system would not work. But there are well known ways of coercing a two player system to a multiplayer one, so that's not a major concern.

Second, TM is very asymmetric especially by the standards of the eurogame genre. The base game of Terra Mystica comes with 14 different factions, freely chosen by the players at the start of the game. The only restriction is that the factions come in 7 colors; picking a faction of a given color blocks the other faction of that color from the game. The first expansion added 6 more factions to the game. The factions can be very different from each other, with each of them having different special powers, building costs, resource production, and so on. Every faction also has a preference for different parts of the map, and will be on a different point on the symbiotic-competitive spectrum with different opposing factions.

Why would asymmetry matter for a rating system? Well, one reason is that asymmetry is a potential source of imbalance. Especially early on in the game's lifecycle it was not clear whether the different factions were balanced or not, and if not how unbalanced they were.

The arguments about this were complicated by there being a chicken and egg problem. Some people thought that statistics on the win rates or average scores for each faction were invalid. Clearly, the argument went, some factions were just doing well because they're more popular along good players (or the converse for factions doing badly due to being popular among bad players). And then there were people arguing the opposite! They'd say that you could not actually tell anything about how good a player was, because a high win rate or high average score for for a player might just be due to them playing good factions.

The way to open up this deadlock is a rating system that can simultaneously determine the skill of players and the relative strength of factions, with a feedback system between the two.

Third, first full expansion to the game didn't just introduce the new factions mentioned above. It also introduced two completely new maps. Now, Terra Mystica as a game is incredibly sensitive to the map design. I was part of playtesting one of the new maps, and it was an kind of an infuriating process. The developer would make a tiny tweak, just swapping a couple of hexes around, and there would be a butterfly effect. So there would be an attempt to make Darklings a little worse with a change, and they're not actually affected but Halflings become borderline unplayable.

From a rating system view the question is then whether these three maps should be considered as separate games or the same one. (The game also has some random setup aspects; these setups can't be considered as separate games from a rating perspective since there's something like 2 million different configurations).

The rating system

My primary goal for the rating system was for it to be useful for predicting game outcomes [1]. And as mentioned in the previous section, it would need to compute both player and faction ratings. But in addition to these things, there were a secondary goals.

A lot of board game sites use a unmodified Elo system with the standard constants, with the ratings updated after every match finishes. This can make the ratings absurdly volatile. It also encourages people not play against opponents with much lower ratings. Both of these are driven by the way multiplayer games are less computable than two player games. You can lose a game essentially through no fault of your own, which won't happen in a two player game, and especially if you finish dead last in a five player game, that single game can erase what feels like 20 games of progress to a higher rating. It's no wonder that the players become very conservative and risk-averse.

Some players will even take advantage of this volatility and manipulate their ranking by changing the order in which their games finish. Do you have 10 games about to finish? Just stall in the 5 that you're doing well in, and rush the ones you're losing in. Your final rating will be significantly higher than if you alternated the good and bad games.

All of the above is bad. So a secondary goal was that the system shouldn't be quite that sensitive to the most recent games. The exception are players with very few games played; for those players you do want very high sensitivity so that they get roughly to their proper rating as soon as possible.

And second, I wanted to encourage players to pick the factions perceived as worse more often. So a player should be rewarded more / penalized less for winning with a bad faction than for winning with a good faction.

So, this is what I ended up with.

We start with the usual hack for converting a two player rating system to work with multiplayer games: treat each N player match as (N/2) / (N+1) two player submatches, and apply the rating algorithm to those submatches.

The core of the system is the normal Elo equation, which computes an expected score (somewhere between 0 for a loss and 1 for a win) for both players based on the difference of their ratings:

  my $ep1 = 1 / (1 + 10**(-$diff / 400));
  my $ep2 = 1 / (1 + 10**($diff / 400));

Sorry, I lied! We don't actually use the difference of the ratings of the players. Instead we add up the rating of the player and the rating of the faction they are playing first, and only take the difference after that.

  my $p1_score = $p1->{score} + $f1->{score} * $fw;
  my $p2_score = $p2->{score} + $f2->{score} * $fw;
  my $diff = $p1_score - $p2_score;

What's the $fw variable there? It's a configurable faction weight, in case we want to run experiments with the faction choice being considered more or less important. There's one other thing $fw is used for. If a player drops out, we still want to compute a rating change for them (otherwise they'd just drop out of games they're about to lose to avoid the ranking penalty). But we really should not be penalizing the faction the player was using when they dropped out. It's not like the faction had any way of influencing that! So in the specific case of a dropout, $fw gets set to 0 for that submatch.

  if ($res->{a}{dropped} or $res->{b}{dropped}) {
      if ($settings->{ignore_dropped}) {
          next;
      } else {
          $fw = 0;
      }
  }

We then need the actual result of the game, in the same format as our expected results. So again it's 1 for win, 0 for loss, and now also 0.5 for a draw.

  if ($a_vp == $b_vp) {
      ($ap1, $ap2) = (0.5, 0.5);
  } elsif ($a_vp > $b_vp) {
      ($ap1, $ap2) = (1, 0);
  } else {
      ($ap1, $ap2) = (0, 1);
  }

The difference between the actual and expected ratings is then used to compute a rating change for both players. Let's return later to what the value of $pot actually is.

  my $p1_delta = $pot * ($ap1 - $ep1);
  my $p2_delta = $pot * ($ap2 - $ep2);

As the last step of dealing with each submatch we apply the rating changes. There's some subtleties here. First, we don't necessarily do all the updates immediately but batch up the rating records and apply a batch in one go. (For example apply all the changes from a single game in one go rather than have the results of the first submatch affect the interpretation of the other submatches — but see the results section for more on that).

Second, until a player has played a minimum number of games (default 5) they won't have an effect on other players. Or to be clear, we update the rating of a player either if both players are new, or if the opponent is not new. (But not when the player is old and the opponent is new). This has the effect that new players will have a "shadow rating" computed for them, but will not affect the ratings of the opponents or factions.

The idea here is that we have very little idea of the strength of the new player. So it makes no sense to just assume they're an exactly average new player, and use that assumption to adjust the ratings of players for whom we already have a lot of better quality data.

  my $count = ($p1->{games} >= $settings->{min_games}) +
    ($p2->{games} >= $settings->{min_games});

  if ($p2->{games} >= $settings->{min_games} or !$count) {
      push @rating_changes, [$p1, $p1_delta];
  }
  if ($p1->{games} >= $settings->{min_games} or !$count) {
      push @rating_changes, [$p2, $p2_delta];
  }

Faction ratings only get updated if both players have played enough games. The faction weight factor is taken into account here as well.

  next if $count != 2;

  push @rating_changes, [$f1, $pot * $p1_delta * $fw];
  push @rating_changes, [$f2, $pot * $p2_delta * $fw];

The most dodgy part of what I implemented is that (unlike traditional Elo) this system doesn't run as a streaming process where the ratings are computed purely from old ratings and some number of finished games. Instead the ratings are computed from iteratively, taking into account the full history every time the computation is done. There are obvious practical reasons for why you wouldn't want to do things that way, but I'm not expecting to ever have enough games for that to matter.

Every iteration works on exactly the same data. The same game results handled in the same order, using the same settings. However, the ratings are not reset between iterations. At the start of the first iteration everyone has a rating of 1000. At the start of the second iteration they'll have some other value. These different starting values will affect all the subsequent computations. It's kind of back-propagating the later results, such that they affect the algorithm's interpretation of earlier events.

  for (1..$settings->{iters}) {
    iterate_results @matches, %players, %factions, $_, $settings;
  }

There's another difference between the iterations, which is the iteration count being passed in as the 4th parameter. This is where the value for $pot comes from. With the default settings the progression goes 16, 4, 1.77. The later iterations have a smaller effect.

  my $pot = $settings->{pot_size} / $iter ** $settings->{iter_decay_exponent};

This also means that the later games have a smaller effect than you might initially think. Sure, every result can contribute up to 22 rating points to a player's rating. But in practice a chunk of the 1st iteration's 16 point rating change would have been undone by the time the 2nd iteration finishes and it's time to divvy up 4 points for that game. The reason is that the player will have a higher starting rating for the 2nd iteration, so every win will count for less and every loss will count for more.

For new players (who only have few games played) this effect of the older games undoing part of the result of the first game will be almost non-existent. There's not very many of those old games around, after all. This is exactly what we want, for players with few games every new game tells us a lot relative to what we knew at the start. Later games will however always matter more than the earlier ones, so it's also not the case that the rating of someone who has played hundreds of games can't shift their rating at all.

What's not considered by this system? This system only uses on the final ranks of the game as input. It doesn't take consider the in-game victory points, either as an absolute value or relative to the scores of the opponents. Does this make sense? Surely a win by 10 points should be worth more than a win by 1 point. The latter is basically a draw!

But this seems like a necessary restriction. People love seeing numbers go up, even if the numbers are in reality totally irrelevant. They love it so much that they'll try to optimize for the number going up. If you introduce something other than winning as a component in the rating system, some players will start optimizing for that other thing instead. That's what they're rewarded for, so it must be the right thing! And at that point the system as a whole must become less useful at predicting winning, andbetter at predicting that other thing.

So even if using more detailed in-game statistics as part of the rating computation would almost certainly provide more accurate statistics, it'd only work if the players are unaware of it.

Evaluating rating quality

The basic process I used for evaluating the rating quality was to first split all my data into two parts. The first 75% or so of the data was essentially a training set, used to compute ratings for all players and factions in the game. The output of the rating system should be such that the ratings of two entities A and B can be converted to a win probability - (0 if A is basically guaranteed to lose, 1 if they're basically guaranteed to win, but mostly clustered in the 0.25-0.75 range). We then compare these predictions against the actual output in the remaining 25%, the evaluation set, with a loss being 0, a win being 1, and a tie (rare but possible) being 0.5.

How should this comparison work? There's a few kinds of metrics we could compute. Some metrics are basically self-contained, completely independent of other prediction sets. "This prediction set scored 1500". Other metrics are derived from comparing two sets of predictions directly against each other. "A was better than B, 100 to 80", "A was better than C, 150 to 60", but this tells you nothing about how B and C would compare.

The rest of this section basically goes through my process of trying to find metrics that worked. It'll therefore stumble in and out of a couple of dead ends.

The simplest possible metric is just to sum up the absolute errors. If the prediction for a match is 0.8 and the actual result is 1, penalize the prediction set by 0.2 points. The closer to zero the score, the better the prediction set. This doesn't actually work. The problem is that it's suboptimal for a system to use its actual prediction; it should always round the prediction to the closest of 0 or 1.

Example: A and B are playing 5 matches. The rating system predicts an 80% win rate for A. And in fact A gets exactly that win rate. The absolute error would be: 4*(1 - 0.8) + 1*(0.8) = 1.6. What if we predict a 100% win rate instead? That'd mean: 4*(1 - 1) + 1*(1) = 1. The less accurate prediction was judged to be significantly better. That's just no good at all.

What you instead need is the sum of squares of errors (SSE). For the same example, the 80% prediction then produces 4*((1 - 0.8)**2) + 1*(0.8**2) = 0.8 while the 100% prediction gives 4*((1 - 1)**2) + 1*(1.0**2) = 1.

What about a direct comparison between two rating systems? Could we just count how many times each system was closer to the actual result?

Example: if one system predicts 70% win rate for A over B while the other predicts a 90% win rate, award the first system a point every time A loses and the second a point for every win. The system with the higher score is better. This fails for a very similar reason as the first method, as accurate predictions are penalized. In this example, if A wins 4 out of 5 matches the first system would get one point while the second one gets 4 points. But both predictions were actually equally far from the actual win rate of 80%.

How can this be fixed? Clearly the reward for the better prediction can't be fixed, but must somehow depend on the odds that the predictions are implying. That is, the predictions are essentially used to place bets. Actually doing a betting system is a bit tricky though, since betting is fundamentally a random process, but we'd like our metrics to be computed in a deterministic manner. After some doodling, I came up with the following method that's deterministic but still retains the core flavor of betting.

Let $e1 and $e2 be the win probability each system gives to player A, and $res be the actual result (0, 0.5, 1). If the predictions are the same, there's no bet to be made. But otherwise both players should be happy to make a bet using the implied odds of the midpoint of those predictions:

my $em = ($e1 + $e2) / 2;

The set that gave a higher prediction will then be adjusted by $res - $em points, while the other will be adjusted by $em - $res points. This has the effect that when the outcome is the one that both systems expected, the winner of the bet will be awarded a small number of points. While if the outcome is a surprise to both systems, the winner of the bet will get a larger amount. It will also be symmetrical: the outcome will be the same if you swap the players around (i.e. invert the predictions and the result).

Example: A and B are playing 5 matches, with A winning 4 games. System X predicts a win rate of 0.7 for A, system Y predicts 0.8, with an average of 0.75. Since Y's prediction was higher, it will be receive 1 - 0.75 = 0.25points for every game that A wins, but lose 0 - 0.75 = -0.75 points for the ones A loses. The end result is that Y gains 4*0.25 - 0.75 = 0.25 points from these 5 matches (and X loses the same amount, since this is a zero sum process). This makes sense since Y's prediction was indeed more accurate.

Finally, one more way to compare two rating systems is by only looking at matchups where the two systems produced split predictions; that is, one system gives a win probability of under 0.5 while the other gives a probability of over 0.5. In this metric we simply count which of the two picked the correct winner more often.

This system doesn't suffer from any tactical misprediction issues. But it means throwing away most of the data. It also makes the value judgement that the most important (only important?) part of the prediction space are the matches between players of almost equal skill. Whether that's the right call or not seems to depend on the ultimate purpose of the rating system. The needs of a matchmaking system are different from the needs of a system that tries to predict the results of games between arbitrary players.

TrueSkill

TrueSkill is a rating system from Microsoft for use for Xbox Live online games. It's got three interesting properties. First, it deals natively with multiplayer matches. Second, it supports teams with multiple members. And third, it doesn't track skill as a single number but as a combination of two numbers: an estimate of the skill, and an uncertainty of the estimate.

The system is also very complicated compared to something like Elo. The best explanation of how it works is an epic blog post by Jeff Moser, who also made the first open source implementation in C#. I wasn't man enough to re-implement the algorithm from scratch, and used the Python TrueSkill implementation by Heungsub Lee

For the purposes of this investigation, we'll represent each player + faction combination as a 2 player team, each game as a match between 3-5 such teams, and then just trust TrueSkill to do the right thing.

The other thing we need is deriving a win probability from two TrueSkill ratings (which, again, is a combination of a skill estimate and an uncertainty). This isn't completely trivial since it needs to take into account the possible skill distribution of all players (expressed as normal distributions) as well as the distribution for how different skill ranges affect the win probability.

Somewhat surprisingly, this doesn't appear to be addressed at all in the literature. I ended up writing the following based on a suggestion by Moser in one of the blog comments (linked above), but have to admit I didn't think too much about whether it's really correct.

def win_probability(a, b):
    deltaMu = sum([x.mu for x in a]) - sum([x.mu for x in b])
    sumSigma = sum([x.sigma ** 2 for x in a]) + sum([x.sigma ** 2 for x in b])
    playerCount = len(a) + len(b)
    denominator = math.sqrt(playerCount * (BETA * BETA) + sumSigma)
    return cdf(deltaMu / denominator)

Results

This section will go through individual changes to the rating system starting from nothing, ending up with the rating system in its current form. We'll then look at potential extra features, before finishing with the comparison to TrueSkill. If you skipped over the previous section on evaluating rating quality, below is a quick summary of the three metrics that I'll use:

  • SSE: Sum square of errors
  • Betting: The predictions of the two systems are used to determine the odds for a bet. The
  • Split predictions: Look only at cases where the two systems disagree on who is going to win a given pairwise matchup. Give a point to the system that predicts the winner correctly.

The process was to split the game data into two parts. A training data set with about 140k pairwise matchups was used to compute ratings for all players / factions. These ratings were then used to predict the results on the evaluation set of about 55k matchups. The split between training and evaluation sets was done using a cutoff date. The training set contained the games that finished before 2015-06-01, the evaluation set had the games that finished later.

One final point on the test setup is that all the win predictions were done on pairwise submatches, rather than the match as a whole. That applies even in cases where the system used for computing the ratings from the training data was match-based rather than pairwise submatch-based.

A. The dummy rating system

Let's start with the stupidest possible system, which totally ignores the training set, and just predicts a 50% win rate in every pairwise match. On the 29996 matchups in the evaluation set, we get a SSE of 7386.25. That should be our absolute floor.

B. Normal Elo

Moving on, we'll use a normal Elo system with a K-factor of 24. The SSE drops to 6689, and on our "betting" metric B beats the dummy system A by 1658 to -1658. (There's no result on the split prediction test, since the dummy rated everything as 0.5. That metric needs one algorithm to give a result of over 0.5 and the other a result of under 0.5).

C. Normal Elo, minimum of 5 games

Then we introduce the restriction of users not having an effect on other players before they have played at least 5 games (as described in the rating system section). Compared to B, the results are inconsistent. There's a tiny improvement in SSE which drops to 6687. There's also a noticeable improvement in the betting metric, which C wins over B 152 to -152. We can now also get results on split predictions, where B is better by 390 to 370.

This feature seems unlikely to be worthwhile, but it's not harmful either and it's what's used in the current production implementation. So we'll carry on using it in the following test cases too.

D. Iterated Elo

This test uses the iteration model, with the default parameters as explained in the algorithm description: three iterations with K-factors of 16/4/1.77. SSE drops to 6612. More significantly, D wins the betting over C by 830 to -830, and the split predictions by 1355 to 1133.

E. Faction ratings

The next step is to compute faction ratings in parallel to the player ratings, to feed the faction ratings back into the player rating computations, and to take into account both the player and the faction ratings when predicting win probabilities. This is also the system I'm currently using, so it's kind of my "par" value. Any further changes would hopefully be improvements.

When faction ratings are mixed in, SSE drops by a lot to 6471. E wins the betting metric over D by 710 to -710, and split predictions by 2488 to 2110.

F. Per-map faction ratings

The faction ratings being global, rather than computed separately for each map, is a common criticism of the rating system from the player base. In this test each faction and map combination is treated as a distinct entity, with completely separate ratings. Doing this does indeed improve the results a bit. SSE is reduced to 6437. F also wins the betting over E by 383 to -383, and the split predictions by 1147 to 1033.

Making this change would however be tricky from a UI perspective. The rating UI wouldn't scale nicely to 60 "factions", but you can't really aggregate the results either. Some thought required on whether this change would be worth it.

G. Ignoring dropouts

The test driver skips any pairwise matchups where one or both players dropped out. What would happen if we did the same when computing ratings? That is, if either party of a pairwise matchup drops out, don't update the rating for either. This would indeed improve things a bit, which makes some sense. SSE drops to 6419. G also wins betting over F by 113 to -113, and split predictions won by 840 to 814.

This isn't an acceptable change though. Introducing it would immediately make players start dropping out of games they expect to lose. But it is interesting to see what the effect might be in a perfect world, where the rating system would not affect player behavior. [2]

H. Different faction weights

Using lower faction weights than 1 produces almost imperceptibly better results, and not even across the board (it generally makes two categories better, one category worse). This change also seems hard to justify from first principles, so it feels too much like curve fitting. So this is another idea not worth implementing.

I. Batched rating updates

One thing that was kind of dodgy in my original implementation is that the rating changes from a single multiplayer game were not applied in a single atomic unit. Instead we'd first compute a rating changes from a single pairwise submatch, apply those changes, then compute a rating change for the next pairwise submatch based on the already updated ratings. What if the algorithm instead first computed all the changes from that one match, and then applied all of them in one go?

This should make sense, but also not have all that big an impact. Neither of those expectations is true though. The SSE is significantly higher at 6523 (vs. 6437 for F), I loses the betting by -934 to 934, and the split decisions by 660 to 783.

I don't know why this change makes the system perform worse. The most likely reason is that batching the updates makes single events have larger effects on the ratings. That's not entirely satisfactory, since if that were the case just a slightly smaller K-factor should have a similar effect. But that's not what happens in practice.

J. TrueSkill without factions

I'll do all the TrueSkill comparisons to the "Per-map faction ratings" version of the system (F). The first variant to test is TrueSkill that ignores factions completely, both when computing the ratings and when making predictions on win probabilities. This (heavily crippled!) version of TrueSkill has a SSE of 6658, which is somewhere between the B and D. J loses the betting against F by -732 to 732, and the split decisions by 2981 to 3500.

K. TrueSkill with factions

Next up we use TrueSkill as intended. Each faction is now treated as a TrueSkill player, and each player / faction combination is considered to be a two player TrueSkill teams. This gives a much more respectable SSE of 6474 (still a bit off from the 6437 of F), loses betting by only -181 to 181, and loses the split predictions by 2128 to 2203.

L. TrueSkill with per-map factions

Treating each faction and map combination separately (i.e. changing from E to F) gave a good boost to the accuracy of the iterative algorithm. Would the same approach work for TrueSkill? The SSE does indeed drop to 6441. L still loses to F in the other two metrics: betting by -148 to 148, and the split predictions by a very tight margin of 1921 to 1935.

Again, just as in H, tiny and somewhat inconsistent improvements could be achieved by tweaking the faction weights a bit. Doing so would not change the big picture, and the same arguments against doing so are still valid.

Conclusions

When I did the experiments, I was kind of expecting the conclusion to be a classic tradeoff between complexity and great results vs. simple and "good enough". That turns out to not be the case; the two rating systems were very closely matched, and in fact the quick hack that's currently used seemed to measure marginally better than the matching TrueSkill version.

The current production version isn't quite at a local optimum though, there's one improvement that would be worth adding to it (separate ratings for each map + faction combination). But there doesn't seem to be any great urge to switch to a completely different system. That's a happy ending, since I'd much rather have 150 lines of my own code than 1500 lines written by someone else and that I don't entirely understand.

There are a couple of caveats. It's possible that the formula for going from sets of TrueSkill ratings to win probabilities is suboptimal, and skewing the results. The data set isn't huge and I can't compute a confidence interval of any kind, so it's possible that the results aren't actually significant (but on the other hand I'm happy even with a statistical tie).

Also, while here we just looked at how effective each rating system was at predicting outcomes, there are other criteria that matter in practice. For example, one thing my users are pretty vocally unhappy about is that they perceive the faction ratings to be too unstable. It seems likely that TrueSkill would do a better job there, since the faction ratings it produces have a low uncertainty (the factions have thousands or tens of thousands of games played), so new results would only cause tiny changes to them.

If anyone would like to test how their pet rating system works for this use case, the test data and the evaluation scripts are available.

Footnotes

[1] In the first draft, I wrote that this was "obviously" the goal. But on reflection, it's probably not obvious at all. After all, there are so many rating systems around that seem to have about zero predictive value. Maybe there's a bias towards people who play a lot, or there's no concept of opponent skill. Clearly these systems were designed with some goal in mind, but predictive power certainly wasn't it. Sirlin's delightfully cynical story on Starcraft 2's rating system might be relevant here.

[2] Now that I'm writing this article, it occurs to me that there might be a decent compromise solution: apply a rating penalty to the player who drops out, but don't give any reward to the winner. You'd lose the zero sum property, but that doesn't seem critical. The same is true of Elo variants that use variable K factors depending on number of games played, and they seem to work just fine.