a follow-up on the multi-arm bandit note. thompson sampling on its own picks one winning option for everybody. netflix does something cleverer.
the actual problem
every title on netflix has several candidate artworks. the same show — stranger things — gets shown with a horror-vibe thumbnail to one user, a kids-on-bikes thumbnail to another. a horror fan and a romcom fan click on completely different art for the same content. so "which thumbnail wins?" has no single answer.
a vanilla a/b test would converge on the one thumbnail with the highest average ctr — and that's strictly worse than showing each segment the thumbnail that works for them.
the trick: contextual thompson sampling
the standalone version of thompson sampling keeps one belief per arm: Beta(S, F). clicks → S goes up. non-clicks → F goes up. sample, pick highest, update. nothing about who the user is.
the contextual version keeps a belief per (arm, user) — or, more practically, per (arm, user-feature-bucket). same sample-pick-update loop, but the guess for "thumbnail 3" now depends on whether the current user has watched a lot of horror, or it's a saturday night, or whatever signals you've decided to encode.
so the loop is:
for the incoming user:
for each thumbnail:
guess = sample(belief(thumbnail, user_features))
show the thumbnail with the highest guess
log impression + (later) click
update that thumbnail's belief for users like this one
what makes it actually work
a few things i'd miss if i wasn't paying attention:
- no fixed split. there's no "10% control, 90% experiment." every user gets a fresh sample. share of traffic per thumbnail just emerges from how confident the system is that each one is best — for that kind of user.
- probability matching. share of traffic per thumbnail roughly equals the probability that it's the best. nice property — you over-explore exactly as much as your uncertainty justifies, no more.
- batched updates. strictly per-user updates don't scale. in practice you sample per request, batch the outcomes (every few minutes), and refresh beliefs in mini-batches. you lose tiny optimality, gain operability.
- delayed reward. clicks come fast. watch-time arrives later. log the impression now, reconcile the reward when it lands (e.g. a 24h attribution window), update beliefs once it's final.
- cold start. new thumbnails get a wide prior (flat
Beta(1, 1), or warm-started from the platform's average ctr). flat priors get explored aggressively early because their samples are wild. they tighten with data. - floors and caps. force every arm to keep at least ~1% of traffic so you don't permanently kill an unlucky-early one. cap any single arm at e.g. 90% so you keep collecting signal in case tastes shift.
why this beats running an a/b test
an a/b test holds the split fixed for two weeks and stops learning the moment you "finalize." thompson sampling shifts traffic toward winners as evidence builds and never stops learning. add a new thumbnail tomorrow and it gets a wide prior — the algorithm folds it in without anyone running a fresh experiment.
the personalization part is what makes the netflix case interesting. without context, the system finds one winner. with context, it finds a different winner for every kind of viewer — and that's the whole point.