# Approximation algorithms for restless bandit problems

@article{Guha2010ApproximationAF, title={Approximation algorithms for restless bandit problems}, author={Sudipto Guha and Kamesh Munagala and Peng Shi}, journal={ArXiv}, year={2010}, volume={abs/0711.3861} }

The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit (MAB) problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any nontrivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty.
In this article, we consider the Feedback MAB problem, where the reward obtained by… Expand

#### 116 Citations

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

- 2013

In this paper we consider the problem of learning the optimal dynamic policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each… Expand

Approximations of the Restless Bandit Problem

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2019

It is shown that under some conditions on the $\varphi$-mixing coefficients, a modified version of UCB can prove effective, and a sub-class of the multi-armed restless bandit problem is characterized where approximate solutions can be found using tractable approaches. Expand

Approximations of the Restless Bandit Problem

- 2016

The multi-armed restless bandit problem is studied in the case where the pay-offs are not necessarily independent over time nor across the arms, but the joint distribution over the arms is φ-mixing.… Expand

Optimality of Myopic Policy for Restless Multiarmed Bandit with Imperfect Observation

- Computer Science, Mathematics
- 2016 IEEE Global Communications Conference (GLOBECOM)
- 2016

This paper performs an analytical study on the considered RMAB problem, and establishes a set of closed-form conditions to guarantee the optimality of the myopic policy. Expand

Multi-policy posterior sampling for restless Markov bandits

- Mathematics, Computer Science
- 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
- 2014

A polynomial time algorithm is proposed that learns transitional parameters for each arm and selects the perceived optimal policy from a set of predefined policies using a beliefs or probability distributions using randomized probability matching or better known as Thompson Sampling. Expand

The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret

- Computer Science, Mathematics
- 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2011

This work develops an original approach to the RMAB problem that is applicable when the corresponding Bayesian problem has the structure that the optimal solution is one of a prescribed finite set of policies, and develops a novel sensing policy for opportunistic spectrum access over unknown dynamic channels. Expand

The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret

- Computer Science, Mathematics
- ArXiv
- 2011

It is proved that the original approach to the non-Bayesian RMAB problem, in which the parameters of the Markov chain are assumed to be unknown, achieves near-logarithmic regret, which leads to the same average reward that can be achieved by the optimal policy under a known model. Expand

Optimal Adaptive Learning in Uncontrolled Restless Bandit Problems

- Computer Science, Mathematics
- 2012

This paper proposes a learning algorithm with logarithmic regret uniformly over time with respect to the optimal finite horizon policy for uncontrolled restless bandit problems, and extends the optimal adaptive learning of MDPs to POMDPs. Expand

Online Learning of Rested and Restless Bandits

- Computer Science, Mathematics
- IEEE Transactions on Information Theory
- 2012

It is shown that logarithmic regret algorithms exist both for the centralized rested and restless bandit problems and for the decentralized setting, and an algorithm with logarathmic regret with respect to the optimal centralized arm allocation is proposed. Expand

Approximation Algorithms for Bayesian Multi-Armed Bandit Problems

- Mathematics, Computer Science
- ArXiv
- 2013

It is shown that by restricting the state spaces of arms the authors can find single arm policies and that these single arm Policies can be combined into global (near) index policies where the approximate version of the exchange property is true in expectation. Expand

#### References

SHOWING 1-10 OF 89 REFERENCES

On Index Policies for Restless Bandit Problems

- Computer Science
- 2007

This paper considers the restless bandit problem, which is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit problems in decision theory, and shows that for an interesting and general subclass that the authors term RECOVERING bandits, a surprisingly simple and intuitive greedy policy yields a factor 2 approximation. Expand

Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards

- Mathematics, Computer Science
- 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)
- 2007

A constant factor approximation to the feedback MAB problem is designed by solving and rounding a natural LP relaxation to this problem, which is the first approximation algorithm for a POMDP problem. Expand

The Nonstochastic Multiarmed Bandit Problem

- Mathematics, Computer Science
- SIAM J. Comput.
- 2002

A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. Expand

Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access

- Computer Science, Mathematics
- IEEE Transactions on Information Theory
- 2010

This work establishes the indexability and obviates the need to know the Markov transition probabilities in Whittle index policy, and develops efficient algorithms for computing a performance upper bound given by Lagrangian relaxation. Expand

Gambling in a rigged casino: The adversarial multi-armed bandit problem

- Computer Science, Mathematics
- Proceedings of IEEE 36th Annual Foundations of Computer Science
- 1995

A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given. Expand

Approximation algorithms for budgeted learning problems

- Mathematics, Computer Science
- STOC '07
- 2007

The first approximation algorithms for a large class of budgeted learning problems, including the budgeted multi-armed bandit problem, are presented, providing approximate policies that achieve a reward within constant factor of the reward optimal policy. Expand

Finite-time Analysis of the Multiarmed Bandit Problem

- Computer Science
- Machine Learning
- 2004

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand

Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic

- Mathematics, Computer Science
- Oper. Res.
- 2000

A mathematical programming approach for the classicalPSPACE-hard restless bandit problem in stochastic optimization is developed and a priority-index heuristic scheduling policy from the solution to the firstorder relaxation is proposed, where the indices are defined in terms of optimal dual variables. Expand

A Restless Bandit Formulation of Multi-channel Opportunistic Access: Indexablity and Index Policy

- Computer Science, Economics
- ArXiv
- 2008

This work forms the problem of optimal sequential channel selection as a restless multi-a rmed bandit process, for which a powerful index policy—Whittle’s index policy)—can be implemented with remarkably low complexity on the indexability of the system. Expand

Adapting to a Changing Environment: the Brownian Restless Bandits

- Computer Science
- COLT
- 2008

The goal here is to characterize the cost of learning and adapting to the changing environment, in terms of the stochastic rate of the change, which is an infinite time horizon and defined with respect to a hypothetical algorithm that at every step plays the arm with the maximum expected reward at this step. Expand