- Details
- Category: Results
- Published on Monday, 15 July 2013 17:58
- Written by Super User
- Hits: 11155

The results of the eight Annual Computer Poker Competition were announced July 14th at the Twenty-Seventh Conference on Artificial Intelligence in Bellevue during the second Computer Poker Symposium. There was a good field of participants.

Three variations of Texas Hold'em poker were played: heads-up limit, heads-up no-limit, and three-player limit. Two different winner determination rules were used to decide the winners in each variant: bankroll instant run-off and total bankroll.

As with previous years, the rules for this year's competition called for matches of duplicate poker to help mitigate the effects of "luck" and more accurately determine this year's winner. In total, about 50 million hands of poker were played.

Each competition has three PDF files with numerical results. In all files, the top table in the PDF is a crosstable including all combinations of matches between players in the given variant of poker. Crosstable entries are the expected winnings, in thousandths of a big blind per hand, for the row player. The second value is the 95% confidence interval around the mean. Cells are shaded bright green/red if the row player won/lost by a statistically significant margin (95% confidence.) Cells are shaded light green/red if the row player won/lost by a statistically insignificant margin. If a cell is grey no data is available for that combination of players (e.g., matches of an agent against itself or an agent entered exclusively under a different winner determination rule).

The raw, uncapped result files have a mixture of instant run-off and total bankroll agents, with no further processing or information. For the total bankroll result files, the second table gives the probability that the row player has a higher total bankroll than the column player. The SUM column is simply the sum of the values in the row, and is NOT used in winner determination. For the instant run-off results, the second table gives, for each round of the run-off, the SUM statistic describing the total bankroll performance of the remaining players.

Statistical confidence was not possible for all results. We have noted all such decisions with a '*'. Please see the section discussing the statistical analysis used to generate the results for additional details.

Download raw results and logs

Full results for total bankroll

Marv (Marv Anderson, UK) | |

Feste (Francois Pays, France) * | |

Hyperborean (University of Alberta, Canada) * |

* Feste and Hyperborean could not be separated with statistical significance

Full results for instant run-off

Neo Poker Lab (Alexander Lee, Spain) | |

Hyperborean (University of Alberta, Canada) | |

Zbot (Ilkka Rajala, Finland) * |

* Zbot and Marv could not be separated with statistical significance

Download raw results and logs

Full results for total bankroll

Slumbot NL (Eric Jackson, USA) * | |

Hyperborean (University of Alberta, Canada) * | |

Tartanian6 (Carnegie Mellon University, USA) * |

* Slumbot / Hyperborean and Hyperborean / Tartanian6 could not be separated with statistical significance

Full results for instant run-off

Hyperborean (University of Alberta, Canada) | |

Slumbot NL (Eric Jackson, USA) | |

Tartanian6 (Carnegie Mellon University, USA) * |

* Tartanian6 and Nyx could not be separated with statistical significance

Download raw results and logs

Full results for total bankroll

Hyperborean (University of Alberta, Canada) | |

Little Rock (Rod Byrnes, Australia) | |

Neo Poker Lab (Alexander Lee, Spain) |

Full results for instant run-off

Hyperborean (University of Alberta, Canada) | |

Little Rock (Rod Byrnes, Australia) | |

Neo Poker Bot (Alexander Lee, Spain) |

There are three main techniques used in the 2013 competition: duplicate matches, common seeds for the cards, and bootstrapping. Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents. As in all previous years, we also used common seeds for the cards in the matches. For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C. If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.

In the 2013 competition, we were not able to complete a large number of matches between all opponents, especially in the two player no-limit division. The simple analysis based on confidence intervals breaks down in this case. If there are a varying number of matches between different players, we must make a very pessimistic assumption in computing the confidence interval for the expected value against a random opponent (the AVG column in the crosstable.) To avoid this, the 2013 competition used bootstrapping.

Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results. If there are 6612 matches in a competition, each sample is a new set of 6612 matches. We then compute some function(s) on the sampled competition and record the results. If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data. Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation. We do not expect this to be the case in the 2013 competition results.

For the competition, we used a set of functions for every pair of players ROW and COLUMN. The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise. The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0. These values directly address the question we would like to ask: how often should a player be ranked above some other player? Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information. For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third. This occurred in the two-player no-limit competition.

The SUM value can be considered in two ways. First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples. Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property. The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way. (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)

There is one final wrinkle in the bootstrapping process. Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents. To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated. For example, consider a competition with 4 players A, B, C, and D. If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.

The tool used for analysis is available for download.