iOS AND ANDROID
Topics:Choose a tag
“The Multi-armed bandit problem
The multi-armed bandit problem takes its terminology from a casino. You are faced with a wall of slot machines, each with its own lever. You suspect that some slot machines pay out more frequently than others. How can you learn which machine is the best, and get the most coins in the fewest trials?
Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards.
def choose(): if math.random() < 0.1: # exploration! # choose a random lever 10% of the time. else: # exploitation! # for each lever, # calculate the expectation of reward. # This is the number of trials of the lever divided by the total reward # given by that lever. # choose the lever with the greatest expectation of reward. # increment the number of times the chosen lever has been played. # store test data in redis, choice in session key, etc.. def reward(choice, amount): # add the reward to the total for the given lever.
Lets say we are choosing a colour for the “Buy now!” button. The choices are orange, green, or white. We initialize all three choices to 1 win out of 1 try. It doesn’t really matter what we initialize them too, because the algorithm will adapt. So when we start out, the internal test data looks like this.
|1/1 = 100%||1/1=100%||1/1=100%|
Then a web site visitor comes along and we have to show them a button. We choose the first one with the highest expectation of winning. The algorithm thinks they all work 100% of the time, so it chooses the first one: orange. But, alas, the visitor doesn’t click on the button.
|1/2 = 50%||1/1=100%||1/1=100%|
Another visitor comes along. We definately won’t show them orange, since we think it only has a 50% chance of working. So we choose Green. They don’t click. The same thing happens for several more visitors, and we end up cycling through the choices. In the process, we refine our estimate of the click through rate for each option downwards.
|1/4 = 25%||1/4=25%||1/4=25%|
But suddenly, someone clicks on the orange button! Quickly, the browser makes an Ajax call to our reward function
$.ajax(url:"/reward?testname=buy-button"); and our code updates the results:
|2/5 = 40%||1/4=25%||1/4=25%|
When our intrepid web developer sees this, he scratches his head. What the F*? The orange button is the worst choice. It’s font is tiny! The green button is obviously the better one. All is lost! The greedy algorithm will always choose it forever now!
But wait, let’s see what happens if Orange is really the suboptimal choice. Since the algorithm now believes it is the best, it will always be shown. That is, until it stops working well. Then the other choices start to look better.
|2/9 = 22%||1/4=25%||1/4=25%|
After many more visits, the best choice, if there is one, will have been found, and will be shown 90% of the time. Here are some results based on an actual web site that I have been working on. We also have an estimate of the click through rate for each choice.
|114/4071 = 2.8%||205/6385=3.2%||59/2264=2.6%|
I have not discussed the randomization part. The randomization of 10% of trials forces the algorithm to explore the options. It is a trade-off between trying new things in hopes of something better, and sticking with what it knows will work. There are several variations of the epsilon-greedy strategy. In the epsilon-first strategy, you can explore 100% of the time in the beginning and once you have a good sample, switch to pure-greedy. Alternatively, you can have it decrease the amount of exploration as time passes. The epsilon-greedy strategy that I have described is a good balance between simplicity and performance. Learning about the other algorithms, such as UCB, Boltzmann Exploration, and methods that take context into account, is fascinating, but optional if you just want something that works.”