2016 Results

Amendment

Note that the results originally announced at AAAI 2016 were erroneous.  In particular, some additional matches that were run to separate the top four agents, seeds 100 and higher, were biased as all these matches used the same sequence of cards.  This bias invalidates the previously posted results of the Bankroll Instant Runoff competition.  We gratefully thank Vahe Hagopian who brought this to our attention.  

To correct this error, additional matches between the top four, and subsequently the top two agents, were run.  The log files, crosstable and ordering listed below have been corrected.

Results

The results of the tenth Annual Computer Poker Competition were announced February 13th at the Thirtieth Conference on Artificial Intelligence in Phoenix during the Computer Poker and Imperfect Information Games Workshop.

This year, the competition used two player no-limit Hold'em as its testbed.  There was insufficient interest in the 3 player Kuhn poker event and the limit Hold'em events were temporarily removed until they can be revamped.  No-limit Texas Hold'em had an event using the total-bankroll winner determination rule, to encourage learning techniques as well as an instant run-off event that encourages defensive play, as in traditional game theoretic techniques.

About the Results

The rules for this year's competition called for matches of duplicate poker to help mitigate the effects of "luck" and more accurately determine this year's winner. In total, about 15 million hands of poker were played.

In all of the PDF result files, the top table is a crosstable including all combinations of matches between players in the given variant of poker.  Crosstable entries are the expected winnings, in thousandths of a big blind per hand, for the row player.  The second value is the 95% confidence interval around the mean.  Cells are shaded bright green/red if the row player won/lost by a statistically significant margin (95% confidence.) Cells are shaded light green/red if the row player won/lost by a statistically insignificant margin. If a cell is grey no data is available for that combination of players (e.g., matches of an agent against itself or an agent entered exclusively under a different winner determination rule).

The raw, uncapped result files have a mixture of instant run-off and total bankroll agents, with no further processing or information.  For the total bankroll result files, the second table gives the probability that the row player has a higher total bankroll than the column player.  The SUM column is simply the sum of the values in the row, and is NOT used in winner determination.  For the instant run-off results, the second table gives, for each round of the run-off, the SUM statistic describing the total bankroll performance of the remaining players.

Statistical confidence was not possible for all results. We have noted all such decisions with a '*'. Please see the section discussing the statistical analysis used to generate the results for additional details.


Heads-up No-Limit Texas Hold'em

The raw crosstable is available here.

The competition logs are available in three parts, one, two, and three.  The all-in EV processed competition logs are available in three parts, onetwo, and three.

Total Bankroll

gold_medal Baby Tartanian8 (Carnegie Mellon University, USA)
silver_medal Slumbot (Independent Researcher, USA) *
bronze_medal Act1 (Unfold Poker, USA) *

Bankroll Instant Run-off

gold_medal Baby Tartanian 8 (Carnegie Mellon University, USA)
silver_medal Slumbot (Independent Researcher, USA)
bronze_medal Act1 (Unfold Poker, USA)

 


Statistical Analysis

There are three main techniques used in the 2016 competition: duplicate matches, common seeds for the cards, and bootstrapping.  Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents.  As in all previous years, we also used common seeds for the cards in the matches.  For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C.  If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.

In the no-limit event, we used all-in equity to evaluate matches where both players went all-in.  If both players have committed their entire stack, there are no future player choices, so we can compute the expected winnings across all future board cards, rather than using a single board as in previous years.  This can, but does not always, significantly reduce the variance because the magnitude of all-in situations is very large.  The tool can be found in the ACPC dealer code.

Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results.  If there are 6612 matches in a competition, each sample is a new set of 6612 matches.  We then compute some function(s) on the sampled competition and record the results.  If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data.  Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation.  We do not expect this to be the case in the 2014 competition results.

For the competition, we used a set of functions for every pair of players ROW and COLUMN.  The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise.  The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0.  These values directly address the question we would like to ask: how often should a player be ranked above some other player?  Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information.  For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third.  This occurred in the two-player no-limit competition.

The SUM value can be considered in two ways.  First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples.  Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property.  The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way.  (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)

There is one final wrinkle in the bootstrapping process.  Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents.  To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated.  For example, consider a competition with 4 players A, B, C, and D.  If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.

The tool used for analysis is available for download.