The k-factor, or stability parameter, determines the degree
to which we change our ratings based on the difference between the result
predicted by the current ratings and the actual outcome. Earlier, I told you that my first guess of k
was 40, based on the fact that a k of 100 would mean that a win in a game where
the expected value of both sides was 0.5 would shift the expected value for the
winner in the next game to 2/3. This
seemed too big a shift, so I chose a smaller k.
But I also told you I ultimately settled on even smaller k (34) instead. Why did I make this choice?
The idea is that if our rankings are good, our predictions
should be accurate. And we can use the
historical data to measure exactly how good our rankings are, simply be summing
the total difference between expected and actual results over all matches. If every rating was perfect, this sum would
be zero. What if the ratings were as bad as possible? If team A has a hugely larger rating than
team B, its expected value would be close to 1, and its opponent’s expected
value close to zero. But if their
opponents were to unexpectedly win, the actual results for each team would be
out by almost 1. So if every match out of 2,200 played had seen the worst
possible prediction (i.e. team A has an expected value of nearly 1, but loses),
the sum of all differences would be something close to 4,400. Note that in a
drawn game, the largest possible difference of the sum of two teams’ ratings
from the expectations is 1 (0.5 per team where the expected values were 1 and 0
respectively), so the more draws, the less the sum of the errors can be,
regardless of how the ratings are determined.
In fact, roughly 1/3 of all matches have been drawn; so the worst
possible score might be ~3,500. With my Elo system and a k of 34, the sum of all the errors
is 1,416. Is that any good? Well, supposing the results of all matches
were determined at random, but with the same overall frequency of wins to draws
as has happened in reality. Obviously in
this situation the ratings have, by definition, no predictive value: a good
rating indicates good recent results, but these have no bearing on the outcome
of the next match. So I ran such a simulation and the sum of the difference was
1997. So our actual ratings have better
predictive value than if matches were determined at random, suggesting it is
possible to get some indication of future results from past ones using my Elo
formula. Note there is an alternative,
and perhaps more intuitive, way of getting a default value by applying random ratings to actual results,
but it’s harder to determine an underlying
frequency distribution to create random ratings, than it is to determine
one for creating random outputs.
How did I choose a k of 34? The answer is simply by running the system with various different k-values and discovering that this k produces the lowest sum of errors (based on all the matches that have already been played). My first guess of 40 wasn’t bad; but a smaller k turns out to perform better.
How did I choose a k of 34? The answer is simply by running the system with various different k-values and discovering that this k produces the lowest sum of errors (based on all the matches that have already been played). My first guess of 40 wasn’t bad; but a smaller k turns out to perform better.
The next question is, how sensitive is the predictive power of the formula to the
k-value? Surprisingly, the answer
appears to be not very much at all. With a k of 34, the sum of the errors is,
more precisely, 1415.869; with k = 35, it’s 1415.86, and with k = 33, the sum
is 1415.94. At k = 40, the sum rises to
1416.42; at k 30, it’s 1416.31. Even with what are intuitively relatively
extreme values of k, the sum is still not much larger: at k = 100, the sum is
1440; at k = 10, it’s 1435.
In other words, the Elo system appears to have greater
predictive value in the real world than in a random world; but this power is
maintained over quite a wide range of k-values.
There’s a lot of random variation in cricket from match to match, so
there will be errors under any system (not least because our expected value
lies on a continuous scale from 0 to 1, but only 3 actual results are possible:
1, 0.5, and 0, so in almost every match there will some error, even if the most likely result actually happens). But as far as I can tell, almost any sane k gives the system
roughly equivalent predictive power. But there doesn’t seem to be any reason
not to take the best performing k, even if its advantage is small. So our k is set to 34.
And that’s the complete system described. In the next post, we’ll look at the ratings
of the current test sides.
No comments:
Post a Comment