Combining No-regret and Q-learning

4 years ago

Combining No-regret and Q-learning

Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (where Q-learning already suffices), as well as limited extensions of them. With a straightforward modification, we extend the basic premise of LONR to work in multi-agent settings and present empirical results showing that it achieves last iterate convergence in a number of settings. Most notably, we show this for NoSDE games, a class of Markov games specifically designed to be impossible for Q-value-based methods to learn and where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average. Furthermore, by leveraging last iterate converging no-regret algorithms (one of which we introduce), we show empirical last iterate convergence in all domains tested with LONR.

Publisher URL: https://www.microsoft.com/en-us/research/publication/combining-no-regret-and-q-learning-2/

DOI: microsoft_corporation.25090.129