multi-arm bandit problem

notes on the exploration / exploitation tradeoff and the algorithms that try to balance it.

in old slang, a slot machine is called a "one-armed bandit" — because of its lever/arm, and its tendency to steal your money.

the setup

imagine a row of slot machines (one-armed bandits). each one has a different, unknown payout distribution. you have a fixed number of pulls. which lever do you pull, and how often?

that's the multi-armed bandit problem. it's the cleanest possible model of a tension that shows up everywhere:

exploration — try new things to learn what works
exploitation — do the thing you already know works

pull only the best-known machine and you might miss out on a better one. spread pulls evenly and you waste pulls on bad machines. the goal is to minimize regret — the gap between what you got and what an oracle who knew the best machine from the start would have gotten.

here's where regret comes from in a classic a/b-test workflow:

every user who lands on a losing variant during the collect/learn/test phases is regret. the bigger the red region, the more user value you spent paying for information. bandit algorithms work by shrinking that red region — they shift traffic toward the winner while learning, instead of waiting until the rollout.

why it matters

bandit problems aren't really about casinos. they're a clean lens on a class of decisions that recur constantly:

ad ranking — show the variant that's converting, but keep testing new ones
recommender systems — surface known winners without freezing the catalog
a/b testing — but adaptive instead of fixed-horizon
clinical trials — assign more patients to the treatment that's working

the framing is a sequential decision under uncertainty with limited feedback. you only learn about the arm you pull.

algorithms

epsilon-greedy

imagine you just moved to a new city with five pizza places nearby. you want to eat the best pizza most often, but you don't yet know which one is best.

two extreme strategies:

always go to your current favorite — you might miss out on a better place you haven't tried.
always pick a random place — you'll learn a lot, but waste many meals on bad pizza.

epsilon-greedy says: most of the time, go to the place you currently think is best. occasionally (with probability ε), pick a random place — just to keep an open mind. that "occasionally" is the whole trick.

let:

K = number of arms
ε = a small number between 0 and 1, e.g. 0.1
μ̂ₐ = current estimated mean reward for arm a

at each time step:

p = random()

if p < ε:
  explore — pick a random arm uniformly from all K arms
else:
  exploit — pick the arm with the highest current μ̂

observe the reward
update μ̂ for the chosen arm (running average)

two lines of real logic. that simplicity is its main appeal.

main weakness — it explores uniformly, even when it's already very confident some arms are bad. it never stops exploring unless you decay ε over time. you'll see this below: even after thousands of impressions, the three losing anime keep getting surfaced around 2.5% of the time each (a third of ε).

ε-greedy (ε = 0.1) · 1 impressions

attack on titantruth 32%0.0% shown · click-through rate 0.0%

demon slayertruth 25%0.0% shown · click-through rate 0.0%

jujutsu kaisentruth 18%0.0% shown · click-through rate 0.0%

one piecetruth 15%0.0% shown · click-through rate 0.0%

netflix testing which anime thumbnail to surface in the 'for you' rail.

upper confidence bound (ucb)

ucb's philosophy fits in a phrase: be optimistic about arms you haven't tried much.

think about it like hiring. you have:

candidate A: interviewed 50 times. average score 7.5/10. you're confident she's a 7.5.
candidate B: interviewed only 2 times. average score 7.0/10. but B could actually be a 9. or a 4. you're not sure.

a greedy algorithm would always pick A (higher mean). but B has more uncertainty — and that uncertainty might hide a much better candidate. ucb says: give B the benefit of the doubt. assume the optimistic case until proven otherwise.

this solves the explore/exploit dilemma without needing a random ε. exploration emerges naturally from uncertainty itself.

for each arm a at time step t, compute its ucb score:

UCB_a(t) = μ̂_a + √(2 · ln t / n_a)

then pick the arm with the highest ucb score.

breaking it down:

μ̂_a — current estimated mean reward of arm a (the "exploit" term)
n_a — number of times arm a has been pulled
t — total number of pulls so far across all arms
√(2 · ln t / n_a) — the exploration bonus, or confidence radius

intuition for the bonus term:

smaller n_a → bonus is larger → arm gets more attractive. uncertainty calls for optimism.
larger n_a → bonus shrinks → arm's score approaches its true average.
larger t (more total time) → bonus grows slowly, logarithmically — keeps a small exploration impulse alive.

in words: estimated value + uncertainty bonus = optimistic estimate.

regret bounds are logarithmic in the number of pulls, which is provably optimal up to constants. unlike ε-greedy, ucb stops exploring losing arms once it's confident — anime that nobody adds to their list get nudged less and less, instead of forever.

ucb1 · 1 impressions

attack on titantruth 42%0.0% shown · add-to-list rate 0.0%

demon slayertruth 28%0.0% shown · add-to-list rate 0.0%

jujutsu kaisentruth 22%0.0% shown · add-to-list rate 0.0%

one piecetruth 16%0.0% shown · add-to-list rate 0.0%

netflix testing which anime to highlight in an 'add to my list' nudge.

thompson sampling

a wine tasting analogy. imagine three wine bottles. after a few sips of each, you have:

bottle A: pretty sure it's around 7/10 (tried 50 sips)
bottle B: maybe 6/10? but could be 8 or 4 (tried 5 sips)
bottle C: no idea — maybe 5? (tried 1 sip)

for each bottle, in your head, you draw a plausible rating given your current uncertainty:

A: "i'd guess 7.1" — narrow distribution, sampled value close to 7
B: "i'd guess 7.8 today" — wide distribution, today's sample came out high
C: "i'd guess 4.2" — very wide, today's sample came out low

you pick B for your next sip — not because B's average is best, but because today's imagined draw was highest. tomorrow you might draw A: 7.0, B: 5.5, C: 8.0 — and try C instead. this naturally balances exploring uncertain arms with exploiting confident winners.

how it works

thompson sampling is bayesian. for each arm, maintain a posterior distribution over its true mean — initially wide (weak prior), narrowing as evidence accumulates.

on each round:

for each arm a:
  draw θ_a ~ posterior(a)

pull arg max θ_a

observe reward r
update posterior(arg max θ_a) with r

it explores more when posteriors are wide (early on) and exploits more as they sharpen. arms with high uncertainty occasionally produce optimistic samples and get pulled — but only as long as their posteriors stay wide.

for bernoulli rewards (success/failure), beta priors make the math trivial. each arm is a Beta(α, β) where:

α = 1 + number of successes
β = 1 + number of failures

drawing from Beta(α, β) is one line in any stats library. empirically, thompson sampling usually beats ucb. theoretically, it has the same logarithmic regret bound — provably optimal up to constants.

thompson sampling · 1 starts

attack on titantruth 55%0.0% shown · episode-1 completion 0.0%

demon slayertruth 41%0.0% shown · episode-1 completion 0.0%

jujutsu kaisentruth 32%0.0% shown · episode-1 completion 0.0%

one piecetruth 24%0.0% shown · episode-1 completion 0.0%

netflix testing which anime to autoplay first — did the viewer finish episode 1?

what this looks like in practice

netflix runs bandits at every stage of the funnel — and the right metric depends on what the surface is trying to do.

'for you' rail → click-through rate. did the thumbnail get a click?
'add to my list' nudge → add-to-list rate. did the user save it for later?
autoplay → episode-1 completion rate. did they actually finish the first episode?

three different surfaces, three different metrics, same anime catalog. each one has a hidden true rate the algorithm doesn't see. it just picks an anime, watches whether the user acted, and updates.

across the three demos above, the winner doesn't get declared. it gradually absorbs more impressions until the losing options barely show up. how fast that happens depends on the algorithm: thompson reallocates fastest, ε-greedy slowest (it never stops exploring), and ucb sits in between.

that gradual reallocation is the whole point: every viewer is a little more likely than the last to see the anime most likely to make them act — at whichever step of the funnel you're optimizing for.

sources

stitch fix — multi-armed bandits and the explore/exploit trade-off