in old slang, a slot machine is called a "one-armed bandit" — because of its lever/arm, and its tendency to steal your money.
the setup
imagine a row of slot machines (one-armed bandits). each one has a different, unknown payout distribution. you have a fixed number of pulls. which lever do you pull, and how often?
that's the multi-armed bandit problem. it's the cleanest possible model of a tension that shows up everywhere:
- exploration — try new things to learn what works
- exploitation — do the thing you already know works
pull only the best-known machine and you might miss out on a better one. spread pulls evenly and you waste pulls on bad machines. the goal is to minimize regret — the gap between what you got and what an oracle who knew the best machine from the start would have gotten.
here's where regret comes from in a classic a/b-test workflow:
every user who lands on a losing variant during the collect/learn/test phases is regret. the bigger the red region, the more user value you spent paying for information. bandit algorithms work by shrinking that red region — they shift traffic toward the winner while learning, instead of waiting until the rollout.
why it matters
bandit problems aren't really about casinos. they're a clean lens on a class of decisions that recur constantly:
- ad ranking — show the variant that's converting, but keep testing new ones
- recommender systems — surface known winners without freezing the catalog
- a/b testing — but adaptive instead of fixed-horizon
- clinical trials — assign more patients to the treatment that's working
the framing is a sequential decision under uncertainty with limited feedback. you only learn about the arm you pull.
algorithms
epsilon-greedy
imagine you just moved to a new city with five pizza places nearby. you want to eat the best pizza most often, but you don't yet know which one is best.
two extreme strategies:
- always go to your current favorite — you might miss out on a better place you haven't tried.
- always pick a random place — you'll learn a lot, but waste many meals on bad pizza.
epsilon-greedy says: most of the time, go to the place you currently think is best. occasionally (with probability ε), pick a random place — just to keep an open mind. that "occasionally" is the whole trick.
let:
K= number of armsε= a small number between 0 and 1, e.g.0.1μ̂ₐ= current estimated mean reward for arma
at each time step:
p = random()
if p < ε:
explore — pick a random arm uniformly from all K arms
else:
exploit — pick the arm with the highest current μ̂
observe the reward
update μ̂ for the chosen arm (running average)
two lines of real logic. that simplicity is its main appeal.
main weakness — it explores uniformly, even when it's already very confident some arms are bad. it never stops exploring unless you decay ε over time. you'll see this below: even after thousands of impressions, the three losing anime keep getting surfaced around 2.5% of the time each (a third of ε).
netflix testing which anime thumbnail to surface in the 'for you' rail.
upper confidence bound (ucb)
ucb's philosophy fits in a phrase: be optimistic about arms you haven't tried much.
think about it like hiring. you have:
- candidate A: interviewed 50 times. average score 7.5/10. you're confident she's a 7.5.
- candidate B: interviewed only 2 times. average score 7.0/10. but B could actually be a 9. or a 4. you're not sure.
a greedy algorithm would always pick A (higher mean). but B has more uncertainty — and that uncertainty might hide a much better candidate. ucb says: give B the benefit of the doubt. assume the optimistic case until proven otherwise.
this solves the explore/exploit dilemma without needing a random ε. exploration emerges naturally from uncertainty itself.
for each arm a at time step t, compute its ucb score:
UCB_a(t) = μ̂_a + √(2 · ln t / n_a)
then pick the arm with the highest ucb score.
breaking it down:
μ̂_a— current estimated mean reward of arma(the "exploit" term)n_a— number of times armahas been pulledt— total number of pulls so far across all arms√(2 · ln t / n_a)— the exploration bonus, or confidence radius
intuition for the bonus term:
- smaller
n_a→ bonus is larger → arm gets more attractive. uncertainty calls for optimism. - larger
n_a→ bonus shrinks → arm's score approaches its true average. - larger
t(more total time) → bonus grows slowly, logarithmically — keeps a small exploration impulse alive.
in words: estimated value + uncertainty bonus = optimistic estimate.
regret bounds are logarithmic in the number of pulls, which is provably optimal up to constants. unlike ε-greedy, ucb stops exploring losing arms once it's confident — anime that nobody adds to their list get nudged less and less, instead of forever.
netflix testing which anime to highlight in an 'add to my list' nudge.
thompson sampling
a wine tasting analogy. imagine three wine bottles. after a few sips of each, you have:
- bottle A: pretty sure it's around 7/10 (tried 50 sips)
- bottle B: maybe 6/10? but could be 8 or 4 (tried 5 sips)
- bottle C: no idea — maybe 5? (tried 1 sip)
for each bottle, in your head, you draw a plausible rating given your current uncertainty:
- A: "i'd guess 7.1" — narrow distribution, sampled value close to 7
- B: "i'd guess 7.8 today" — wide distribution, today's sample came out high
- C: "i'd guess 4.2" — very wide, today's sample came out low
you pick B for your next sip — not because B's average is best, but because today's imagined draw was highest. tomorrow you might draw A: 7.0, B: 5.5, C: 8.0 — and try C instead. this naturally balances exploring uncertain arms with exploiting confident winners.
how it works
thompson sampling is bayesian. for each arm, maintain a posterior distribution over its true mean — initially wide (weak prior), narrowing as evidence accumulates.
on each round:
for each arm a:
draw θ_a ~ posterior(a)
pull arg max θ_a
observe reward r
update posterior(arg max θ_a) with r
it explores more when posteriors are wide (early on) and exploits more as they sharpen. arms with high uncertainty occasionally produce optimistic samples and get pulled — but only as long as their posteriors stay wide.
for bernoulli rewards (success/failure), beta priors make the math trivial. each arm is a Beta(α, β) where:
α= 1 + number of successesβ= 1 + number of failures
drawing from Beta(α, β) is one line in any stats library. empirically, thompson sampling usually beats ucb. theoretically, it has the same logarithmic regret bound — provably optimal up to constants.
netflix testing which anime to autoplay first — did the viewer finish episode 1?
what this looks like in practice
netflix runs bandits at every stage of the funnel — and the right metric depends on what the surface is trying to do.
- 'for you' rail → click-through rate. did the thumbnail get a click?
- 'add to my list' nudge → add-to-list rate. did the user save it for later?
- autoplay → episode-1 completion rate. did they actually finish the first episode?
three different surfaces, three different metrics, same anime catalog. each one has a hidden true rate the algorithm doesn't see. it just picks an anime, watches whether the user acted, and updates.
across the three demos above, the winner doesn't get declared. it gradually absorbs more impressions until the losing options barely show up. how fast that happens depends on the algorithm: thompson reallocates fastest, ε-greedy slowest (it never stops exploring), and ucb sits in between.
that gradual reallocation is the whole point: every viewer is a little more likely than the last to see the anime most likely to make them act — at whichever step of the funnel you're optimizing for.