Sunday, 17 January 2016

The Genius of Elo

In the previous post, we asked two questions? How do we compensate for the difficulty of a team’s schedule when we assess the team’s results?  It’s not such an easy thing to do: if you beat a supposedly strong team easily, doesn’t that just prove it wasn’t actually so strong? And secondly, how do we weight recent results against older ones? When a team records a new result, to what extent should we change our previous assessment of it?

And the Elo system provides us with a very clear model for thinking about these two points, starting with the second one, but based on how in general, we make assessments in life.  If something happens which is very unlikely according to our previous preconceptions, we adjust our preconceptions more radically than if something happens which is very close to what we expected.  If a cricket team beats a side we thought was slightly weaker than it, we’re not so very surprised, and our assessment of the two teams doesn’t change very much.  But if a supposedly weak team wallops a supposedly mighty one, we may consider this a freak result; but we are also more likely to question our prior assumptions about the two sides’ relative strength.

And the second key idea of the Elo system is this: that if we say that a team is stronger than another one, what we mean is that we expect it to do well, should the two teams play each other.  Thus we have a basic formula for adjusting a team’s ratings after a match: that the new rating of a team A,  R(A)1, equals R(A)0, its previous rating, plus some function of the difference between the result achieved and that expected; where the latter is itself some function of the difference between R(A)0 and R(B)0, the previous rating of the team’s opponents.

We’ll look into what that function might look like in the next post, but for now, I just want to explore the difference between a rating of this sort and an average.  After every game, R will change.  Even if a team was of absolutely fixed underlying ability, it would win some matches and lose some; and it’s rating would fluctuate.  We can think of the rating less as a summary of performance (a measurable thing) over a fixed time interval, but as a guess of the un-measurable quality of ability derived from the performance.  After each game, we make a new guess of the correct rating, informed by the extent to which the team’s actual performance has deviated from that we would have observed if the ratings were already perfect predictors of outcome.

Such a rating is thus very different from a statistic that tells us (as a fact) that a team has won 40% of its last 10 matches. But here’s an interesting point. How do we start operating such a system?  Before the first match we consider, how do we know what to expect? An obvious starting point is to assume all teams are equal, before we have evidence to the contrary.  But providing we are happy to give the system a little time to equilibrate, it doesn’t actually matter.  If our initially allocated ratings are poor, results will differ more from expectations; and so our originally assigned ratings will self-correct quickly. 

And thus we also solve the two problems we had previously.  A system which uses rolling, and/or weighted averages, is in danger of either over-privileging old results (so, for example, a team in decline keeps a good rating even once it is no longer performing well), or suffering from absurd instability (whereby a couple of good matches could lead to a side being top rated).  But under the Elo system, the team’s ratings will move more decisively in response to a team’s recent results when those results differ dramatically from expectations.  Thus the more a team underperforms, the greater the correction to the ratings (and the greater effective weight of more recent results).  In fact, we will still need to pick a general stability parameter; but as we will see, the system gives us a framework for at least trying to set this in a non-arbitrary way.


And our first problem is also solved.  A team with weaker opponents will have better results expected; and so will need to win more games to beat its expectations.  It is logically no more likely that playing five matches against weak opponents will improve your rating than playing five games against a strong team.  The principle of the system is thus very clean.  The mathematics will be explored next time.

No comments:

Post a Comment