Wednesday, 27 January 2016

The k-factor, revisited



The k-factor, or stability parameter, determines the degree to which we change our ratings based on the difference between the result predicted by the current ratings and the actual outcome.  Earlier, I told you that my first guess of k was 40, based on the fact that a k of 100 would mean that a win in a game where the expected value of both sides was 0.5 would shift the expected value for the winner in the next game to 2/3.  This seemed too big a shift, so I chose a smaller k.  But I also told you I ultimately settled on even smaller k (34) instead.  Why did I make this choice?

The idea is that if our rankings are good, our predictions should be accurate.  And we can use the historical data to measure exactly how good our rankings are, simply be summing the total difference between expected and actual results over all matches.  If every rating was perfect, this sum would be zero. What if the ratings were as bad as possible?  If team A has a hugely larger rating than team B, its expected value would be close to 1, and its opponent’s expected value close to zero.  But if their opponents were to unexpectedly win, the actual results for each team would be out by almost 1. So if every match out of 2,200 played had seen the worst possible prediction (i.e. team A has an expected value of nearly 1, but loses), the sum of all differences would be something close to 4,400. Note that in a drawn game, the largest possible difference of the sum of two teams’ ratings from the expectations is 1 (0.5 per team where the expected values were 1 and 0 respectively), so the more draws, the less the sum of the errors can be, regardless of how the ratings are determined.  In fact, roughly 1/3 of all matches have been drawn; so the worst possible score might be ~3,500. With my Elo system and a k of 34, the sum of all the errors is 1,416.  Is that any good?  Well, supposing the results of all matches were determined at random, but with the same overall frequency of wins to draws as has happened in reality.  Obviously in this situation the ratings have, by definition, no predictive value: a good rating indicates good recent results, but these have no bearing on the outcome of the next match. So I ran such a simulation and the sum of the difference was 1997.  So our actual ratings have better predictive value than if matches were determined at random, suggesting it is possible to get some indication of future results from past ones using my Elo formula.  Note there is an alternative, and perhaps more intuitive, way of getting a default value by  applying random ratings to actual results, but it’s harder to determine an underlying  frequency distribution to create random ratings, than it is to determine one for creating random outputs.

How did I choose a k of 34? The answer is simply by running the system with various different k-values and discovering that this k produces the lowest sum of errors (based on all the matches that have already been played). My first guess of 40 wasn’t bad; but a smaller k turns out to perform better.

The next question is, how sensitive is the predictive power of the formula to the k-value?  Surprisingly, the answer appears to be not very much at all. With a k of 34, the sum of the errors is, more precisely, 1415.869; with k = 35, it’s 1415.86, and with k = 33, the sum is 1415.94.  At k = 40, the sum rises to 1416.42; at k 30, it’s 1416.31. Even with what are intuitively relatively extreme values of k, the sum is still not much larger: at k = 100, the sum is 1440; at k = 10, it’s 1435.

In other words, the Elo system appears to have greater predictive value in the real world than in a random world; but this power is maintained over quite a wide range of k-values.  There’s a lot of random variation in cricket from match to match, so there will be errors under any system (not least because our expected value lies on a continuous scale from 0 to 1, but only 3 actual results are possible: 1, 0.5, and 0, so in almost every match there will some error, even if the most likely result actually happens). But as far as I can tell, almost any sane k gives the system roughly equivalent predictive power. But there doesn’t seem to be any reason not to take the best performing k, even if its advantage is small.  So our k is set to 34.

And that’s the complete system described.  In the next post, we’ll look at the ratings of the current test sides.

No comments:

Post a Comment