Tradeoffs in Contextual Bandit Learning

Monday, May 16, 2016 - 2:00pm - 2:50pm
Keller 3-180
Alekh Agarwal (Microsoft Research)
The contextual bandit problem is a classic example of the statistical tradeoff between exploration and exploitation. Upon taking an action, a learner only observes the reward of that action, while no feedback on the other possible actions is received. In additional to the natural information tradeoff between experimenting with unknown actions and relying on known good actions, the problem also has some intriguing tradeoffs between computational and statistical considerations. In this talk, we will discuss two old and one new algorithms which span different points along this tradeoff.

Without computational considerations, the EXP4 algorithm of Auer et al. (2002) solves the problem in its most general adversarial formulation. With suboptimal statistical guarantees, the Epoch-Greedy approach of Langford and Zhang (2008) gives essentially a computationally optimal approach. More recently, Agarwal et al. (2014) develop a new scheme which is statistically optimal in a weaker i.i.d. setting, and is also computationally efficient although more expensive than the Epoch-Greedy approach. We will further discuss the various trade-offs that arise in implementing this approach, along with the resulting theoretical considerations.
MSC Code: