Thursday, 28 January 2016

The best teams ever, take one



Having explained the ideas behind my system for rating test cricket teams, let’s now briefly discuss how I implemented it.  A database of all test cricket results can easily be downloaded from Cricinfo’s Statsguru service. And I wrote a program to parse and analyse these results in the programming language Perl, which is not a very fashionable language these days, but which is more than adequate for the task.  In fact, the core of the processing can be captured in a remarkably short program: iterative mathematical calculations are what computers do well.  And all 2000+ matches are easily analysed in around a second on a fairly old desktop computer. Most of the work of the program actually involves organising the results in a human-friendly way.

So, let’s first ask the question, who were the best (or, more accurately, most dominant) test teams of all time?  One issue here is that if team A had a high score at date X, that team almost certainly had almost as high a score shortly before and shortly after its peak; so the highest ratings ever could all belong to one team over a continuous period.  This might not be very interesting; so what I’ve done is divided test history into periods where one team was on top, and taken only the highest rating that each team held in one period.  And then top ten then comes out like this:

26 Dec 1999  24 Nov 2009  Australia      2 Jan 2008  283

18 Aug 1934  28 Jan 1955  Australia      5 Dec 1952  218

30 Nov 2012  25 Nov 2015  South Africa  22 Feb 2013  217

 6 Jul 2011   3 Feb 2012  England       18 Aug 2011  216

14 Sep 1983  26 Dec 1991  West Indies   11 Apr 1986  198

 5 Dec 1958  27 Jan 1961  Australia     21 Nov 1959  184

27 Jun 1930  23 Feb 1933  Australia      4 Mar 1932  169

23 Jan 1993  22 Jun 1995  West Indies   25 Mar 1994  168

 3 Aug 2010   6 Jul 2011  India          9 Oct 2010  166

 3 Feb 2012  30 Nov 2012  Australia      7 Apr 2012  163

Firstly, a few preliminaries.  In the table, the first column marks the start of a team’s period as world number one; the second date marks the end.  The team’s name is in the third column; the date of their highest rating during this period is in the third; and the rating itself is in the fifth.  In fact, there’s a complication with the dates.  Cricinfo makes the date a match starts available in an easy-to-download form.  So instead of recalculating the ratings after each match finishes, what I actually do is recalculate the ratings after all matches that started on a given day ended. It’s the date that those matches started which gets recorded as an attribute of the rating calculated after it ended – i.e. the Australians had a rating of 283 after the conclusion of the match that started on the 2nd January, 2008. This explains the apparent oddity that teams seem to have inevitably won the game that appears to have followed the date of their peak ranking. 

Secondly, some of the great teams have been rated the best in the world continuously for some very long periods.  Everyone knows the Australians were a great side in the early 21st century:  we see they enjoyed almost 10 years on top of the rankings.  The Australians of Bradman (and shortly after) had over 20 dominant years (although World War Two counts for a few of those, and I didn’t adjust my system to account for the absence of test cricket during this period, a decision that could be questioned: perhaps I should have suspended the system and restarted with all teams at zero?).  Eight years of dominance were enjoyed by the West Indies from 1983 onwards. And all these sides duly had very high ratings at their peak.

But the Australians of recent vintage really were exceptionally strong. Indeed, the ICC agrees (although their ratings only cover the post-war years): no other team has ever been this dominant.  In fact, some of their best players of this era (Warne, McGrath, Langer) retired a year or so before the team’s peak rating in 2008.  But the ratings inevitably represent past, not future performance; and throughout 2007, a side without these greats actually added to the performances of their predecessors. Here’s another fun fact: between 1930 and 1955, Australia were just briefly off the number one slot in late 1933 and early 1934.  What weakened Bradman’s otherwise invincible team during this period?  The answer, of course, is the infamous bodyline series, where England used leg-theory to negate the Australian giant.

But there are also some apparent problems.  England had a good side in 2011, and South Africa thereafter.  But few would consider these teams amongst the best of all time.  Nonetheless, here they are, at 3rd and 4th place in the all-time list. The teams in 9th and 10th place are also of recent vintage, and might also be considered surprising inclusions.

One explanation for this is simply that we’ve ranked teams here by their highest rating, but what really makes us consider a team to be great is a long period in the number one slot – not the degree of dominance, but the length of time for which a team is dominant.  This value is, after all, the first thing I noticed when reviewing the list: maybe I should have sorted the teams by that criterion instead?  But there’s also another problem.  A team’s rating shows its average dominance against the other test playing sides.  And in recent years, Bangladesh and Zimbabwe have been persistently weak with respect to other teams (Zimbabwe’s current rating is -315; Bangladesh had a rating of -334 in 2008).  In other words, both these teams have been weaker than any team has ever been strong. For what it’s worth, when Australia had a rating of 283, Bangladesh had a rating of -321, and the net ratings difference of 604 is associated with an expected value for Australia in a match between the two sides of 0.985 – i.e. the ratings would have predicted something very close to a certain Australian win.

And the presence of two consistently weak teams weakens the average strength of all teams relative to the strongest ones (and it's that which the ratings measure). So even though we tried to make the ratings comparable (as a measure of dominance), the raw ratings still don’t necessarily tell us exactly what we might want to know.  So in the next post, we’re going to play with some different criteria for ordering teams, and see what sides come to the fore under each of them.

Wednesday, 27 January 2016

The k-factor, revisited



The k-factor, or stability parameter, determines the degree to which we change our ratings based on the difference between the result predicted by the current ratings and the actual outcome.  Earlier, I told you that my first guess of k was 40, based on the fact that a k of 100 would mean that a win in a game where the expected value of both sides was 0.5 would shift the expected value for the winner in the next game to 2/3.  This seemed too big a shift, so I chose a smaller k.  But I also told you I ultimately settled on even smaller k (34) instead.  Why did I make this choice?

The idea is that if our rankings are good, our predictions should be accurate.  And we can use the historical data to measure exactly how good our rankings are, simply be summing the total difference between expected and actual results over all matches.  If every rating was perfect, this sum would be zero. What if the ratings were as bad as possible?  If team A has a hugely larger rating than team B, its expected value would be close to 1, and its opponent’s expected value close to zero.  But if their opponents were to unexpectedly win, the actual results for each team would be out by almost 1. So if every match out of 2,200 played had seen the worst possible prediction (i.e. team A has an expected value of nearly 1, but loses), the sum of all differences would be something close to 4,400. Note that in a drawn game, the largest possible difference of the sum of two teams’ ratings from the expectations is 1 (0.5 per team where the expected values were 1 and 0 respectively), so the more draws, the less the sum of the errors can be, regardless of how the ratings are determined.  In fact, roughly 1/3 of all matches have been drawn; so the worst possible score might be ~3,500. With my Elo system and a k of 34, the sum of all the errors is 1,416.  Is that any good?  Well, supposing the results of all matches were determined at random, but with the same overall frequency of wins to draws as has happened in reality.  Obviously in this situation the ratings have, by definition, no predictive value: a good rating indicates good recent results, but these have no bearing on the outcome of the next match. So I ran such a simulation and the sum of the difference was 1997.  So our actual ratings have better predictive value than if matches were determined at random, suggesting it is possible to get some indication of future results from past ones using my Elo formula.  Note there is an alternative, and perhaps more intuitive, way of getting a default value by  applying random ratings to actual results, but it’s harder to determine an underlying  frequency distribution to create random ratings, than it is to determine one for creating random outputs.

How did I choose a k of 34? The answer is simply by running the system with various different k-values and discovering that this k produces the lowest sum of errors (based on all the matches that have already been played). My first guess of 40 wasn’t bad; but a smaller k turns out to perform better.

The next question is, how sensitive is the predictive power of the formula to the k-value?  Surprisingly, the answer appears to be not very much at all. With a k of 34, the sum of the errors is, more precisely, 1415.869; with k = 35, it’s 1415.86, and with k = 33, the sum is 1415.94.  At k = 40, the sum rises to 1416.42; at k 30, it’s 1416.31. Even with what are intuitively relatively extreme values of k, the sum is still not much larger: at k = 100, the sum is 1440; at k = 10, it’s 1435.

In other words, the Elo system appears to have greater predictive value in the real world than in a random world; but this power is maintained over quite a wide range of k-values.  There’s a lot of random variation in cricket from match to match, so there will be errors under any system (not least because our expected value lies on a continuous scale from 0 to 1, but only 3 actual results are possible: 1, 0.5, and 0, so in almost every match there will some error, even if the most likely result actually happens). But as far as I can tell, almost any sane k gives the system roughly equivalent predictive power. But there doesn’t seem to be any reason not to take the best performing k, even if its advantage is small.  So our k is set to 34.

And that’s the complete system described.  In the next post, we’ll look at the ratings of the current test sides.

Sunday, 24 January 2016

Exit and Entry



Our next problem is caused by the fact that twice, test playing teams have been excluded from the sport for political reasons.  South Africa in 1970 and Zimbabwe in 2005 were both suspended.  Each subsequently have returned.

When suspension occurs, it’s easy to work out what to do. We remove the team from the ratings; and adjust the ratings of the other teams up or down as appropriate to keep the mean rating at zero.  It's more of a problem when the team returns.

There are a number of options here.  One is to return the team at the rating it had when it was expelled.  But this does not seem satisfactory. South Africa were the best team in the world when they were kicked out.  Over twenty years later, with none of the players they had in 1970 still in their team, it would be crazy to reintroduce them at the top of the rankings, just because they’d been good two decades previously.

The next option is to treat their return as if they were a new team, at the bottom.  But when South Africa first played a test match in 1888 they were a developing cricket side.  By contrast, they were still an advanced cricket-playing nation when they returned in 1992 – domestically, cricket had still been played, even though they’d been banned from international matches. It seems absurd to assume they had regressed to the level of one hundred years previously.

So perhaps we should re-enter them at zero.  But Zimbabwe had never had a rating as high as zero in their first spell in test cricket.  To set them to zero on their re-introduction would have been to assume they were returning stronger than they had ever been before. This also seems wrong.

I considered one other possibility: re-introducing a team at its average rating during its previous spell.  But ultimately, this seems a little too cute; and one could imagine in future a country leaves test cricket, its internal cricket culture completely changes (imagine a civil war, say, which leads to the complete abandonment of the sport, followed by its subsequent reintroduction), and when it eventually returns to the test arena its previous results really are no guide at all.  In the end, I decided to reintroduce teams at the bottom, just as if they were starting out. It should be noted that under this system, by 1995, South Africa’s rating reaches a positive (i.e. better than average) rating and by 1998 the team have claimed first place once again.  So although the system appears harsh, it doesn’t have the effect of permanently imparining their rating (and in fact, under the ICC’s system, it took South Africa a further year to reclaim the number one spot).

We’ve now almost finished our discussion of how my ratings are calculated.  But before we continue to look at the current ratings, we need to go back to our discussion of the k-factor.  That will be our next subject.

Saturday, 23 January 2016

Initialisation



In previous posts, we’ve established how an Elo rating system works, by making predictions based on existing ratings and adjusting those ratings according to the difference between the prediction and the actual outcome.  And we’ve explored how to set some of the parameters needed in the Elo formula.  But the basic concept is one of adjusting already existing ratings. How do we set them in the beginning and kick-off the whole system?

The first thing to note is that the absolute ratings don’t matter, only the difference between the ratings of different teams. So if team A has a rating of 100 and team B has a rating of zero, that’s exactly the same is if team A has a rating of one million, one hundred and team B a rating of exactly a million.  Secondly, the less accurate our initial guess of the relative ratings of two sides, the less accurate predictions based on them will be and thus the faster the ratings will correct, so unless one is very interested in the ability of teams in the early 1880s (England and Australia played the first ever test match in 1877), it doesn’t really matter how we initialise our system.  But we do have to start somewhere, and the cleanest assumption is to favour neither one side nor the other, and to set the initial ratings of both teams to the same value.  I chose zero for that value.  And because all changes to ratings are reciprocal (as one team improves its rating, its rival’s rating falls by the same amount), the sum total of all ratings, and the mean rating, are thus fixed at zero thereafter.

But of course, although the first test involved these two teams only, subsequently eight other teams have entered test cricket.  Most obviously, we would introduce these other teams also with a rating of zero; but there’s a problem.  In general, teams have been assigned test status after their ability has improved to make test matches between themselves and other test-rated countries worth playing.  But at the moment of admission, they’re typically rather weaker than most existing teams.  So giving a new side an initial rating of zero, the mean rating of all the existing teams, seems over-generous.

An alternative might be to add a new team with a rating equal to that of the lowest existing rating (or maybe, a fixed number of points even lower than that). This is probably the correct thing to do in terms of accurately predicting its first results (most teams entering test cricket have lost their early matches).  But this would also have the effect that each time a new team enters test cricket, the mean rating of all teams would fall. The ratings might still be appropriate at each point in time. But an ordinary rating at one point in time might be a strong rating some time later. We’d lost comparability beween eras.

My compromise is to enter each team at a rating equal to the lowest current rating at the time of entry; but then to adjust the ratings of all pre-existing teams upwards in order to restore the average rating to zero. Thus, in 1888, England had a rating of 77, and Australia of -77.  South Africa were about to play their first test.  They get added to the system with a score of -77; and we add 38.5 to the ratings of each of England and Australia to keep the overall average rating to zero.  Because of this adjustment, the new team actually ends up with a lower rating than the new rating of the previously worst-rated side. Note that if England and Australia now play, the prediction of their next game is unchanged by these revisions (the gap between the two teams remains 144 points, just as it was previously, even though both teams now have improved ratings in absolute terms, because they have each improved by the same amount).

But there is a kind of contradiction here.  We add the new team at a low rating on the defensible assumption that they’re probably not yet very good (from which is follows that the average quality of all test teams will therefore have gone down as a result of the new addition).  But but we then adjust the ratings of all the pre-existing teams upwards because we don’t want a deflationary trend. It appears that I’m assuming that all existing teams get better just because a bad team joins them!

There are two answers to this.  One is that no-one could argue that, in the long term, cricket has been weakened by the entrance of the West Indies (or indeed, most other teams to have followed England and Australia into the test arena). Test cricket itself tends to strengthen teams (which is a large part of why teams get admitted when they attain a certain level of promise but while still relatively weak).  Maybe when a new team enters test cricket, the average ability of all test teams is reduced; but only the effect is only temporary.

This is true, but the better answer is that cricket is a test of relative strength. From a set of cricket results, you cannot measure objectively how good a team actually is except in comparison to other teams from its own era.  If team A from era X has a higher rating than team B from era Y, this only shows that team A was more dominant in its era than team B was when it was playing; it tells us nothing about who would win were both sides to be miraculously resuscitated. In 1890 the England side of W.G. Grace and George Lohman had a rating of 99, whereas the current England side rates just 90.  But few would believe that even the good doctor would flourish if suddenly left to face players with modern levels of fitness (although moderns might similarly struggle if asked to play on the kind of pitch that their predecessors had to play on).  Comparing W.G. to say, Ben Stokes, is not really meaningful; nor can we directly compare their teams, except by considering how dominant (or not) they were compared to their contemporaries.  By keeping the average of all ratings at zero, an individual rating is a measure of a team’s superiority or inferiority compared with their average opponent.  And by this measure, adding a new, weaker opponent to the mix does indeed increase the strength of the rest. Thus the decision can be defended; but inter-era comparisons remain difficult, for reasons we will see.

That’s it for teams entering test cricket; in the next post, we’ll look at teams leaving, and, more problematically, re-entering the sport.