- Details
- Category: Results
- Published on Monday, 15 July 2013 17:58
- Written by Super User
- Hits: 13667

The results of the nineth Annual Computer Poker Competition were announced July 14th at the Twenty-Eighth Conference on Artificial Intelligence in Quebec City during the second Computer Poker Symposium. The list of participants will be available shortly.

There were a range of games played, to explore different problems and techniques for intelligent agents in imperfect information games, from traditional two player game theoretic techniques, to multi-player modeling and online learning. This year, the competition used two player limit Hold'em, two player no-limit Hold'em, three player Hold'em, and a three player variant of Kuhn poker. All games had an event using the total-bankroll winner determination rule, to encourage learning techniques. Two player no-limit Hold'em and three player Hold'em also had an instant run-off event that encourages defensive play, as in traditional game theoretic techniques.

The rules for this year's competition called for matches of duplicate poker to help mitigate the effects of "luck" and more accurately determine this year's winner. In total, about 50 million hands of poker were played.

In all of the PDF result files, the top table is a crosstable including all combinations of matches between players in the given variant of poker. Crosstable entries are the expected winnings, in thousandths of a big blind per hand, for the row player. The second value is the 95% confidence interval around the mean. Cells are shaded bright green/red if the row player won/lost by a statistically significant margin (95% confidence.) Cells are shaded light green/red if the row player won/lost by a statistically insignificant margin. If a cell is grey no data is available for that combination of players (e.g., matches of an agent against itself or an agent entered exclusively under a different winner determination rule).

The raw, uncapped result files have a mixture of instant run-off and total bankroll agents, with no further processing or information. For the total bankroll result files, the second table gives the probability that the row player has a higher total bankroll than the column player. The SUM column is simply the sum of the values in the row, and is NOT used in winner determination. For the instant run-off results, the second table gives, for each round of the run-off, the SUM statistic describing the total bankroll performance of the remaining players.

Statistical confidence was not possible for all results. We have noted all such decisions with a '*'. Please see the section discussing the statistical analysis used to generate the results for additional details.

Download logs

Full results for total bankroll

Escabeche (Marv Andersen, UK) | |

SmooCT (Johannes Heinrich, UK) | |

Hyperborean (University of Alberta, Canada) * |

* Hyperborean and Feste could not be separated with statistical significance

** An error by the chairs caused the originally reported results to be 1st: Escabeche, 2nd/3rd: Feste/Slugathorus

Download raw results and logs

Full results for total bankroll

Tartanian7 (Carnegie Mellon University, USA) | |

Nyx (Charles University, Prague) * | |

Prelude (Unfold Poker, USA) * |

* Nyx and Prelude could not be separated with statistical significance

Full results for instant run-off

Tartanian7 (Carnegie Mellon University, USA) | |

Prelude (Unfold Poker, USA) * | |

Hyperborean (University of Alberta, Canada) * |

* Prelude, Hyperborean, and Slumbot could not be separated with statistical significance

Download raw results and logs

Full results for total bankroll

Hyperborean (University of Alberta, Canada) | |

SmooCT (Johannes Heinrich, UK) | |

KEmpfer (Knowledge Engineering Group -- Technische Universitat Darmstadt, Germany) |

Full results for instant run-off

Hyperborean (University of Alberta, Canada) | |

SmooCT (Johannes Heinrich, UK) | |

KEmpfer (Knowledge Engineering Group -- Technische Universitat Darmstadt, Germany) |

Download logs

Full results for total bankroll

Hyperborean (University of Alberta, Canada) | |

Lucifer (PokerCPT, University of Porto, Portugal) * | |

HITSZ (School of Computer Science and Technology HIT, China) * |

There are three main techniques used in the 2014 competition: duplicate matches, common seeds for the cards, and bootstrapping. Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents. As in all previous years, we also used common seeds for the cards in the matches. For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C. If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.

In the no-limit event, we used all-in equity to evaluate matches where both players went all-in. If both players have committed their entire stack, there are no future player choices, so we can compute the expected winnings across all future board cards, rather than using a single board as in previous years. This can, but does not always, significantly reduce the variance because the magnitude of all-in situations is very large. The tool can be found in the ACPC dealer code.

In the 2014 competition, we were not always able to complete a large enough number of matches between all opponents. The simple analysis based on confidence intervals breaks down in this case. If there are a varying number of matches between different players, we must make a very pessimistic assumption in computing the confidence interval for the expected value against a random opponent (the AVG column in the crosstable.) To avoid this, the 2014 competition used bootstrapping.

Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results. If there are 6612 matches in a competition, each sample is a new set of 6612 matches. We then compute some function(s) on the sampled competition and record the results. If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data. Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation. We do not expect this to be the case in the 2014 competition results.

For the competition, we used a set of functions for every pair of players ROW and COLUMN. The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise. The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0. These values directly address the question we would like to ask: how often should a player be ranked above some other player? Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information. For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third. This occurred in the two-player no-limit competition.

The SUM value can be considered in two ways. First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples. Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property. The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way. (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)

There is one final wrinkle in the bootstrapping process. Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents. To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated. For example, consider a competition with 4 players A, B, C, and D. If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.

The tool used for analysis is available for download.