Delay and Cooperation in Nonstochastic Bandits
Tuesday, May 17, 2016 - 11:10am - 12:00pm
We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove a bound on the average per-agent regret after T rounds. We then show that for any connected graph of K agents, our regret bound is strictly better than the minimax regret for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.