2018 Results

Results

The results of the 11th Annual Computer Poker Competition are presented below. This year, there were three events in two games. Heads-up no-limit Texas Hold'em had an event using the total-bankroll winner determination rule to encourage learning techniques, as well as an instant run-off event that encourages defensive play, as in traditional game-theoretic techniques.

For the first time this year we also had an event in six-player no-limit Texas hold'em. There were nine submissions, but three were from Harbin Institute of Technology and two were from South-East University. To mitigate the potential problem of implicit collusion, only one agent from each university was allowed to advance to the "final table" of six agents, at which point total-bankroll was used to determine the ordering of agents.


Heads-up No-Limit Texas Hold'em

Total Bankroll

gold_medal Slumbot (Independent, USA)
silver_medal Feste (Independent, France)
bronze_medal HITSZ_LMW_2pn (HITSZ, China)

Bankroll Instant Run-off

gold_medal Slumbot (Independent, USA)
silver_medal Feste (Independent, France)
bronze_medal HITSZ_LMW_2pn (HITSZ, China)

 


Six-Player No-Limit Texas Hold'em

Total Bankroll

gold_medal PokerBot5 (Independent, Serbia)
silver_medal Paco (Independent, France)
bronze_medal HITSZ-Jaysen (HITSZ, China)

 

2017 Results

Results

The results of the tenth Annual Computer Poker Competition were to be announced February 5th at the Thirty-first Conference on Artificial Intelligence in San Francisco during the Computer Poker and Imperfect Information Games, but significant improvements to the competition infrastructure were required to keep the competition within our budget.  Thus, the results were delayed until November 2nd.

This year, the competition used two player no-limit Hold'em as its testbed.  No-limit Texas Hold'em had an event using the total-bankroll winner determination rule, to encourage learning techniques as well as an instant run-off event that encourages defensive play, as in traditional game theoretic techniques.

About the Results

The rules for this year's competition called for matches of duplicate poker to help mitigate the effects of "luck" and more accurately determine this year's winner. In total, about 45 million hands of poker were played.

In all of the PDF result files, the top table is a crosstable including all combinations of matches between players in the given variant of poker.  Crosstable entries are the expected winnings, in thousandths of a big blind per hand, for the row player.  The second value is the 95% confidence interval around the mean.  Cells are shaded bright green/red if the row player won/lost by a statistically significant margin (95% confidence.) Cells are shaded light green/red if the row player won/lost by a statistically insignificant margin. If a cell is grey no data is available for that combination of players (e.g., matches of an agent against itself or an agent entered exclusively under a different winner determination rule).

The raw, uncapped result files have a mixture of instant run-off and total bankroll agents, with no further processing or information.  For the total bankroll result files, the second table gives the probability that the row player has a higher total bankroll than the column player.  The SUM column is simply the sum of the values in the row, and is NOT used in winner determination.  For the instant run-off results, the second table gives, for each round of the run-off, the SUM statistic describing the total bankroll performance of the remaining players.

Statistical confidence was not possible for all results. We have noted all such decisions with a '*'. Please see the section discussing the statistical analysis used to generate the results for additional details.


Heads-up No-Limit Texas Hold'em

The raw crosstable is available here.

The competition logs are available here.  The all-in EV processed competition logs are available here.

Total Bankroll

gold_medal Intermission (Unfold Poker, Las Vegas, USA)
silver_medal PokerBot5 (Independent, Serbia) *
bronze_medal Feste (Independent, France)

Bankroll Instant Run-off

gold_medal Intermission (Unfold Poker, Las Vegas, USA)
silver_medal Feste (Independent, France)
bronze_medal HITSZ_HKL (Harbin Institute of Technology, China) *

 


Statistical Analysis

There are three main techniques used in the 2017 competition: duplicate matches, common seeds for the cards, and bootstrapping.  Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents.  As in all previous years, we also used common seeds for the cards in the matches.  For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C.  If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.

In the no-limit event, we used all-in equity to evaluate matches where both players went all-in.  If both players have committed their entire stack, there are no future player choices, so we can compute the expected winnings across all future board cards, rather than using a single board as in previous years.  This can, but does not always, significantly reduce the variance because the magnitude of all-in situations is very large.  The tool can be found in the ACPC dealer code.

Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results.  If there are 100 matches in a competition, each sample is a new set of 100 matches.  We then compute some function(s) on the sampled competition and record the results.  If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data.  Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation.  We do not expect this to be the case in the 2017 competition results.

For the competition, we used a set of functions for every pair of players ROW and COLUMN.  The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise.  The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0.  These values directly address the question we would like to ask: how often should a player be ranked above some other player?  Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information.  For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third.  This occurred in the two-player no-limit competition.

The SUM value can be considered in two ways.  First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples.  Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property.  The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way.  (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)

There is one final wrinkle in the bootstrapping process.  Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents.  To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated.  For example, consider a competition with 4 players A, B, C, and D.  If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.

The tool used for analysis is available for download.

2013 Results

The results of the eight Annual Computer Poker Competition were announced July 14th at the Twenty-Seventh Conference on Artificial Intelligence in Bellevue during the second Computer Poker Symposium. There was a good field of participants.

Three variations of Texas Hold'em poker were played: heads-up limit, heads-up no-limit, and three-player limit. Two different winner determination rules were used to decide the winners in each variant: bankroll instant run-off and total bankroll.

About the Results

As with previous years, the rules for this year's competition called for matches of duplicate poker to help mitigate the effects of "luck" and more accurately determine this year's winner. In total, about 50 million hands of poker were played.

Each competition has three PDF files with numerical results. In all files, the top table in the PDF is a crosstable including all combinations of matches between players in the given variant of poker.  Crosstable entries are the expected winnings, in thousandths of a big blind per hand, for the row player.  The second value is the 95% confidence interval around the mean.  Cells are shaded bright green/red if the row player won/lost by a statistically significant margin (95% confidence.) Cells are shaded light green/red if the row player won/lost by a statistically insignificant margin. If a cell is grey no data is available for that combination of players (e.g., matches of an agent against itself or an agent entered exclusively under a different winner determination rule).

The raw, uncapped result files have a mixture of instant run-off and total bankroll agents, with no further processing or information.  For the total bankroll result files, the second table gives the probability that the row player has a higher total bankroll than the column player.  The SUM column is simply the sum of the values in the row, and is NOT used in winner determination.  For the instant run-off results, the second table gives, for each round of the run-off, the SUM statistic describing the total bankroll performance of the remaining players.

Statistical confidence was not possible for all results. We have noted all such decisions with a '*'. Please see the section discussing the statistical analysis used to generate the results for additional details.



 

Heads-up Limit Texas Hold'em

Download raw results and logs

Total Bankroll

Full results for total bankroll

gold_medal Marv (Marv Anderson, UK)
silver_medal Feste (Francois Pays, France) *
bronze_medal Hyperborean (University of Alberta, Canada) *

* Feste and Hyperborean could not be separated with statistical significance

Bankroll Instant Run-off

Full results for instant run-off

gold_medal Neo Poker Lab (Alexander Lee, Spain)
silver_medal Hyperborean (University of Alberta, Canada)
bronze_medal Zbot (Ilkka Rajala, Finland) *

* Zbot and Marv could not be separated with statistical significance



Heads-up No-Limit Texas Hold'em

Download raw results and logs

Total Bankroll

Full results for total bankroll

gold_medal Slumbot NL (Eric Jackson, USA) *
silver_medal Hyperborean (University of Alberta, Canada) *
bronze_medal Tartanian6 (Carnegie Mellon University, USA) *

* Slumbot / Hyperborean and Hyperborean / Tartanian6 could not be separated with statistical significance

Bankroll Instant Run-off

Full results for instant run-off

gold_medal Hyperborean (University of Alberta, Canada)
silver_medal Slumbot NL (Eric Jackson, USA)
bronze_medal Tartanian6 (Carnegie Mellon University, USA) *

* Tartanian6 and Nyx could not be separated with statistical significance



3-player Limit Texas Hold'em

Download raw results and logs

Total Bankroll

Full results for total bankroll

gold_medal Hyperborean (University of Alberta, Canada)
silver_medal Little Rock (Rod Byrnes, Australia)
bronze_medal Neo Poker Lab (Alexander Lee, Spain)

Bankroll Instant Run-off

Full results for instant run-off

gold_medal Hyperborean (University of Alberta, Canada)
silver_medal Little Rock (Rod Byrnes, Australia)
bronze_medal Neo Poker Bot (Alexander Lee, Spain)


Statistical Analysis

There are three main techniques used in the 2013 competition: duplicate matches, common seeds for the cards, and bootstrapping.  Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents.  As in all previous years, we also used common seeds for the cards in the matches.  For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C.  If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.

In the 2013 competition, we were not able to complete a large number of matches between all opponents, especially in the two player no-limit division.  The simple analysis based on confidence intervals breaks down in this case.  If there are a varying number of matches between different players, we must make a very pessimistic assumption in computing the confidence interval for the expected value against a random opponent (the AVG column in the crosstable.)  To avoid this, the 2013 competition used bootstrapping.

Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results.  If there are 6612 matches in a competition, each sample is a new set of 6612 matches.  We then compute some function(s) on the sampled competition and record the results.  If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data.  Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation.  We do not expect this to be the case in the 2013 competition results.

For the competition, we used a set of functions for every pair of players ROW and COLUMN.  The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise.  The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0.  These values directly address the question we would like to ask: how often should a player be ranked above some other player?  Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information.  For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third.  This occurred in the two-player no-limit competition.

The SUM value can be considered in two ways.  First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples.  Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property.  The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way.  (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)

There is one final wrinkle in the bootstrapping process.  Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents.  To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated.  For example, consider a competition with 4 players A, B, C, and D.  If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.

The tool used for analysis is available for download.

2014 Results

The results of the nineth Annual Computer Poker Competition were announced July 14th at the Twenty-Eighth Conference on Artificial Intelligence in Quebec City during the second Computer Poker Symposium.  The list of participants will be available shortly.

There were a range of games played, to explore different problems and techniques for intelligent agents in imperfect information games, from traditional two player game theoretic techniques, to multi-player modeling and online learning.  This year, the competition used two player limit Hold'em, two player no-limit Hold'em, three player Hold'em, and a three player variant of Kuhn poker.  All games had an event using the total-bankroll winner determination rule, to encourage learning techniques.  Two player no-limit Hold'em and three player Hold'em also had an instant run-off event that encourages defensive play, as in traditional game theoretic techniques.

About the Results

The rules for this year's competition called for matches of duplicate poker to help mitigate the effects of "luck" and more accurately determine this year's winner. In total, about 50 million hands of poker were played.

In all of the PDF result files, the top table is a crosstable including all combinations of matches between players in the given variant of poker.  Crosstable entries are the expected winnings, in thousandths of a big blind per hand, for the row player.  The second value is the 95% confidence interval around the mean.  Cells are shaded bright green/red if the row player won/lost by a statistically significant margin (95% confidence.) Cells are shaded light green/red if the row player won/lost by a statistically insignificant margin. If a cell is grey no data is available for that combination of players (e.g., matches of an agent against itself or an agent entered exclusively under a different winner determination rule).

The raw, uncapped result files have a mixture of instant run-off and total bankroll agents, with no further processing or information.  For the total bankroll result files, the second table gives the probability that the row player has a higher total bankroll than the column player.  The SUM column is simply the sum of the values in the row, and is NOT used in winner determination.  For the instant run-off results, the second table gives, for each round of the run-off, the SUM statistic describing the total bankroll performance of the remaining players.

Statistical confidence was not possible for all results. We have noted all such decisions with a '*'. Please see the section discussing the statistical analysis used to generate the results for additional details.



 

Heads-up Limit Texas Hold'em

Download logs

Total Bankroll

Full results for total bankroll

gold_medal Escabeche (Marv Andersen, UK)
silver_medal SmooCT (Johannes Heinrich, UK)
bronze_medal Hyperborean (University of Alberta, Canada) *

* Hyperborean and Feste could not be separated with statistical significance

** An error by the chairs caused the originally reported results to be 1st: Escabeche, 2nd/3rd: Feste/Slugathorus



Heads-up No-Limit Texas Hold'em

Download raw results and logs

Total Bankroll

Full results for total bankroll

gold_medal Tartanian7 (Carnegie Mellon University, USA)
silver_medal Nyx (Charles University, Prague) *
bronze_medal Prelude (Unfold Poker, USA) *

* Nyx and Prelude could not be separated with statistical significance

Bankroll Instant Run-off

Full results for instant run-off

gold_medal Tartanian7 (Carnegie Mellon University, USA)
silver_medal Prelude (Unfold Poker, USA) *
bronze_medal Hyperborean (University of Alberta, Canada) *

* Prelude, Hyperborean, and Slumbot could not be separated with statistical significance



3-player Limit Texas Hold'em

Download raw results and logs

Total Bankroll

Full results for total bankroll

gold_medal Hyperborean (University of Alberta, Canada)
silver_medal SmooCT (Johannes Heinrich, UK)
bronze_medal KEmpfer (Knowledge Engineering Group -- Technische Universitat Darmstadt, Germany)

Bankroll Instant Run-off

Full results for instant run-off

gold_medal Hyperborean (University of Alberta, Canada)
silver_medal SmooCT (Johannes Heinrich, UK)
bronze_medal KEmpfer (Knowledge Engineering Group -- Technische Universitat Darmstadt, Germany)


 

Three player Kuhn poker

Download logs

Total Bankroll

Full results for total bankroll

gold_medal Hyperborean (University of Alberta, Canada)
silver_medal Lucifer (PokerCPT, University of Porto, Portugal) *
bronze_medal HITSZ (School of Computer Science and Technology HIT, China) *


Statistical Analysis

There are three main techniques used in the 2014 competition: duplicate matches, common seeds for the cards, and bootstrapping.  Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents.  As in all previous years, we also used common seeds for the cards in the matches.  For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C.  If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.

In the no-limit event, we used all-in equity to evaluate matches where both players went all-in.  If both players have committed their entire stack, there are no future player choices, so we can compute the expected winnings across all future board cards, rather than using a single board as in previous years.  This can, but does not always, significantly reduce the variance because the magnitude of all-in situations is very large.  The tool can be found in the ACPC dealer code.

In the 2014 competition, we were not always able to complete a large enough number of matches between all opponents.  The simple analysis based on confidence intervals breaks down in this case.  If there are a varying number of matches between different players, we must make a very pessimistic assumption in computing the confidence interval for the expected value against a random opponent (the AVG column in the crosstable.)  To avoid this, the 2014 competition used bootstrapping.

Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results.  If there are 6612 matches in a competition, each sample is a new set of 6612 matches.  We then compute some function(s) on the sampled competition and record the results.  If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data.  Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation.  We do not expect this to be the case in the 2014 competition results.

For the competition, we used a set of functions for every pair of players ROW and COLUMN.  The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise.  The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0.  These values directly address the question we would like to ask: how often should a player be ranked above some other player?  Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information.  For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third.  This occurred in the two-player no-limit competition.

The SUM value can be considered in two ways.  First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples.  Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property.  The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way.  (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)

There is one final wrinkle in the bootstrapping process.  Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents.  To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated.  For example, consider a competition with 4 players A, B, C, and D.  If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.

The tool used for analysis is available for download.

2016 Results

Amendment

Note that the results originally announced at AAAI 2016 were erroneous.  In particular, some additional matches that were run to separate the top four agents, seeds 100 and higher, were biased as all these matches used the same sequence of cards.  This bias invalidates the previously posted results of the Bankroll Instant Runoff competition.  We gratefully thank Vahe Hagopian who brought this to our attention.  

To correct this error, additional matches between the top four, and subsequently the top two agents, were run.  The log files, crosstable and ordering listed below have been corrected.

Results

The results of the tenth Annual Computer Poker Competition were announced February 13th at the Thirtieth Conference on Artificial Intelligence in Phoenix during the Computer Poker and Imperfect Information Games Workshop.

This year, the competition used two player no-limit Hold'em as its testbed.  There was insufficient interest in the 3 player Kuhn poker event and the limit Hold'em events were temporarily removed until they can be revamped.  No-limit Texas Hold'em had an event using the total-bankroll winner determination rule, to encourage learning techniques as well as an instant run-off event that encourages defensive play, as in traditional game theoretic techniques.

About the Results

The rules for this year's competition called for matches of duplicate poker to help mitigate the effects of "luck" and more accurately determine this year's winner. In total, about 15 million hands of poker were played.

In all of the PDF result files, the top table is a crosstable including all combinations of matches between players in the given variant of poker.  Crosstable entries are the expected winnings, in thousandths of a big blind per hand, for the row player.  The second value is the 95% confidence interval around the mean.  Cells are shaded bright green/red if the row player won/lost by a statistically significant margin (95% confidence.) Cells are shaded light green/red if the row player won/lost by a statistically insignificant margin. If a cell is grey no data is available for that combination of players (e.g., matches of an agent against itself or an agent entered exclusively under a different winner determination rule).

The raw, uncapped result files have a mixture of instant run-off and total bankroll agents, with no further processing or information.  For the total bankroll result files, the second table gives the probability that the row player has a higher total bankroll than the column player.  The SUM column is simply the sum of the values in the row, and is NOT used in winner determination.  For the instant run-off results, the second table gives, for each round of the run-off, the SUM statistic describing the total bankroll performance of the remaining players.

Statistical confidence was not possible for all results. We have noted all such decisions with a '*'. Please see the section discussing the statistical analysis used to generate the results for additional details.



Heads-up No-Limit Texas Hold'em

The raw crosstable is available here.

The competition logs are available in three parts, one, two, and three.  The all-in EV processed competition logs are available in three parts, onetwo, and three.

Total Bankroll

gold_medal Baby Tartanian8 (Carnegie Mellon University, USA)
silver_medal Slumbot (Independent Researcher, USA) *
bronze_medal Act1 (Unfold Poker, USA) *

Bankroll Instant Run-off

gold_medal Baby Tartanian 8 (Carnegie Mellon University, USA)
silver_medal Slumbot (Independent Researcher, USA)
bronze_medal Act1 (Unfold Poker, USA)

 



Statistical Analysis

There are three main techniques used in the 2016 competition: duplicate matches, common seeds for the cards, and bootstrapping.  Duplicate Poker has been used in the competition in previous years, and it helps reduce the variance in outcomes caused by players receiving better (or worse) hands than their opponents.  As in all previous years, we also used common seeds for the cards in the matches.  For every hand, match #1 between players A and B will use the same cards as match #1 between players A and C, and the same cards as match #1 between players B and C.  If we're comparing the total bankroll performance of two players, this use of common seeds can reduce the variance in outcomes caused by seeing different cards: we will be measuring the performance against the same opponents, using the same cards.

In the no-limit event, we used all-in equity to evaluate matches where both players went all-in.  If both players have committed their entire stack, there are no future player choices, so we can compute the expected winnings across all future board cards, rather than using a single board as in previous years.  This can, but does not always, significantly reduce the variance because the magnitude of all-in situations is very large.  The tool can be found in the ACPC dealer code.

Bootstrapping is a method of re-sampling, where new competition results are generated by sampling from the actual results.  If there are 6612 matches in a competition, each sample is a new set of 6612 matches.  We then compute some function(s) on the sampled competition and record the results.  If we repeat this for some large (say 1000) bootstrap samples, we get some idea how stable the function is for the given data.  Note that this can potentially hide noisy results: if we don't have enough data, the result might appear stable, but only because we don't have enough data to see the variation.  We do not expect this to be the case in the 2014 competition results.

For the competition, we used a set of functions for every pair of players ROW and COLUMN.  The function f_ROW,COLUMN() was 1 if player ROW's total bankroll was greater than player COLUMN's total bankroll, and 0 otherwise.  The second table in the total bankroll results gives the average value of these functions, with blank entries for the functions had an average value of 0.  These values directly address the question we would like to ask: how often should a player be ranked above some other player?  Note that we could have tried directly counting the number of times an agent was in each rank, but this can lose some information.  For example, a player A might never be in third place, but just counting ranks is not enough to show the correlated information that every player A is second, a player B is third.  This occurred in the two-player no-limit competition.

The SUM value can be considered in two ways.  First, it is its own statistic: the expected number of players that the row player beats, averaged across the bootstrap samples.  Second, it is also an upper bound on all of the individual "A beats B" functions: if SUM < 0.05, we can be confident that the associated player is in last place. The instant run-off competition uses this second property.  The individual comparison functions were used to determine a player to eliminate in each round, but we were fortunate enough that the SUM value just happened to summarise the outcome in an acceptable way.  (In general, there could have been a situation where A lost to B 96% of the time, and lost to C 96% of the time and we would have said that A was significant, although the SUM value did not show that.)

There is one final wrinkle in the bootstrapping process.  Different matches (and their associated cards that get dealt out) had different sets of participating agents, and we wish to keep the correlation between the performance of different agents on the same set of cards and opponents.  To handle this, for each match in the original competition, we only sample the results from matches where all the necessary agents participated.  For example, consider a competition with 4 players A, B, C, and D.  If only players A, B, and D played match #10, in the bootstrap sample we could generate a new match #10 from all the matches where only A, B, and D played (including match #10) as well as all matches where A, B, C, and D played.

The tool used for analysis is available for download.

More Articles...

  1. 2012: Results