There are three main techniques used in the 2014 competition: duplicate matches, common seeds for the cards, and bootstrapping. Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents. As in all previous years, we also used common seeds for the cards in the matches. For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C. If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.
In the no-limit event, we used all-in equity to evaluate matches where both players went all-in. If both players have committed their entire stack, there are no future player choices, so we can compute the expected winnings across all future board cards, rather than using a single board as in previous years. This can, but does not always, significantly reduce the variance because the magnitude of all-in situations is very large. The tool can be found in the ACPC dealer code.
In the 2014 competition, we were not always able to complete a large enough number of matches between all opponents. The simple analysis based on confidence intervals breaks down in this case. If there are a varying number of matches between different players, we must make a very pessimistic assumption in computing the confidence interval for the expected value against a random opponent (the AVG column in the crosstable.) To avoid this, the 2014 competition used bootstrapping.
Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results. If there are 6612 matches in a competition, each sample is a new set of 6612 matches. We then compute some function(s) on the sampled competition and record the results. If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data. Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation. We do not expect this to be the case in the 2014 competition results.
For the competition, we used a set of functions for every pair of players ROW and COLUMN. The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise. The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0. These values directly address the question we would like to ask: how often should a player be ranked above some other player? Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information. For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third. This occurred in the two-player no-limit competition.
The SUM value can be considered in two ways. First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples. Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property. The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way. (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)
There is one final wrinkle in the bootstrapping process. Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents. To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated. For example, consider a competition with 4 players A, B, C, and D. If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.
The tool used for analysis is available for download.