What is Thompson Sampling?
Thompson Sampling is a Bayesian algorithm for solving the multi-armed bandit problem — the challenge of choosing between multiple options (arms) when the reward of each option is uncertain. It works by maintaining a probability distribution for each option's expected reward and sampling from these distributions to make decisions, naturally balancing exploration (trying uncertain options) with exploitation (choosing the best-known option).
How Thompson Sampling Works
- Prior Distribution — Each option starts with a prior belief about its reward rate, typically a Beta distribution for binary outcomes
- Sampling — At each decision point, draw a random sample from each option's current distribution
- Selection — Choose the option whose sample is highest
- Update — After observing the outcome, update the chosen option's distribution (Bayesian posterior update)
Why It Works So Well
Thompson Sampling is probability-matching: options with higher uncertainty get explored more often because their samples have higher variance. As evidence accumulates, distributions narrow and the algorithm naturally shifts from exploration to exploitation. This makes it remarkably efficient — it achieves near-optimal regret bounds while being simple to implement.
Comparison with Other Approaches
Compared to epsilon-greedy (explore randomly X% of the time) and UCB (Upper Confidence Bound), Thompson Sampling typically requires fewer trials to identify the best option. Epsilon-greedy wastes exploration budget on clearly inferior options. UCB can over-explore in high-dimensional spaces. Thompson Sampling's probabilistic approach allocates exploration budget proportionally to actual uncertainty.
Applications Beyond Bandits
Thompson Sampling extends beyond simple A/B testing: ad placement optimization, clinical trial design, recommendation systems, hyperparameter tuning, and AI agent reward optimization. Any scenario where you must repeatedly choose between options with uncertain outcomes is a candidate.
Thompson Sampling in RendereelStudio
In RendereelStudio's multi-agent systems, Thompson Sampling is used to optimize agent decision-making. When an agent has multiple strategies for accomplishing a task (e.g., different beat detection algorithms, different upscaling models), Thompson Sampling selects the strategy most likely to produce the best result based on historical performance, while still occasionally exploring alternatives that might prove superior.
See It in Action