Operations Research Transactions

Previous Articles     Next Articles

A survey on online learning methods: Thompson sampling and others

HE Simai1,*  JIN Yujia2  WANG HuaGE Dongdong3   

  1. 1. School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai 200433, China 2. School of Mathematical Sciences, Fudan University, Shanghai 200433, China 3. Research Institute for Interdisciplinary Sciences, Shanghai University of Finance and Economics, Shanghai 200433, China
  • Received:2017-08-30 Online:2017-12-15 Published:2017-12-15

Abstract:

The paper is a survey on the latest research results, major theories and algorithms in the field of online learning. The topic of online learning is a broad one, and we aim at introducing the principles of the basic algorithms and ideas to the readers. We start from the most standard models and algorithm design, and extend all the way to a more general presentation on the latest developments in the area.

To begin with, we take the standard online optimization model, the Multi-Armed Bandit Problem, as an example. Then we discuss Thompson Sampling algorithms and Upper Confidence Bound algorithms, analyzing and presenting the main idea and newest theoretical achievements, with further discussion about the extensions and applications of Thompson Sampling in some more complicated real-world online learning scenarios. Furthermore, the paper gives a brief introduction about online convex optimization, which serves as an effective and well-known framework in solving Multi-Armed Bandit problem and other application problems.

Key words: online learning, multi-armed bandit, Thompson sampling, upper confidence bound, contextual multi-armed bandit, online convex optimization