August 2019

Volume 34 Number 8

### [Test Run]

# The UCB1 Algorithm for Multi-Armed Bandit Problems

Imagine you’re in a casino standing in front of three slot machines. You have 30 tokens. Each machine wins according to a different probability distribution, and these distributions are unknown to you. Your goal is to quickly find the best machine so you can maximize the amount of money you win. This is an example of a multi-armed bandit problem—named as such because slot machines are informally referred to as one-armed bandits.

In your day-to-day work environment, it’s unlikely you’ll have to deal with casino slot machines. But the multi-armed bandit scenario corresponds to many real-life problems. For example, a pharmaceutical company that has three new drugs for a medical condition has to find which drug is the most effective with a minimum number of clinical trials on human subjects. And an online advertising campaign with several new approaches needs to find which approach maximizes revenue as quickly as possible.

There are many different algorithms that can be used for a multi-armed bandit problem. The UCB1 (upper confidence bound, version 1) algorithm is one of the most mathematically sophisticated, but somewhat surprisingly, one of the easiest algorithms to implement. A good way to understand what the UCB1 algorithm is and to see where this article is headed is to take a look at a demo run in **Figure 1**.

**Figure 1 UCB1 Algorithm Demo Run**

The demo sets up three 0-based index machines, each with a different probability of winning. The three probabilities are (0.3, 0.7, 0.5), so machine [1] is the best machine. On each pull, if a machine wins it pays out $1 and if it loses it pays $0. The UBC1 algorithm starts by playing each machine once. In the demo, machines [0] and [1] won but machine [2] lost.

The UCB1 algorithm is iterative. The demo specifies six trials after the initialization pulls. In the first trial, the algorithm computes the average reward for each machine. Because machines [0] and [1] won in the initialization phase, their current average reward = $1.00 / 1 pull = $1.00. Because machine [2] lost, its current average reward = $0.00 / 1 pull = $0.00.

The current average rewards and the current trial number are used in an ingenious way to compute a decision value for each machine. For trial No. 1, the decision values are the same as the average rewards. The arm/machine to play is the one with the largest decision value. At this point machines [0] and [1] are tied with the largest decision value. Machine [0] is arbitrarily selected rather than machine [1]. Machine [0] is then played but loses.

In trial No. 2, the updated average reward for machine [0] is $1.00 / 2 pulls = $0.50. The average rewards for machines [1] and [2] are still $1.00 and $0.00, respectively, because neither machine was played. The decision values are computed as (1.33, 2.18, 1.18), so machine [1] is selected and it wins.

The process continues through trial No. 6. At that point, the UCB1 algorithm appears to be successful because machine [1], the best machine, has been played the most times (4) and has the highest average reward ($0.75).

The UCB1 algorithm is quite clever. Look at trial No. 5 in **Figure 1**. The cumulative rewards are ($1.00, $3.00, $0.00) and the numbers of times the machines have been played are (2, 4, 1). Therefore, machine [2] hasn’t been tried since its initial loss in the initialization phase. An unsophisticated algorithm would continue by selecting machine [0] or [1], but UCB1 balances exploration of machine characteristics with exploitation of the best machine found and selects machine [2].

The UCB1 algorithm is designed specifically for bandit problems where the payout values are 0 or 1. This is called a Bernoulli process. UCB1 can be applied to other types of problems, such as one where the payout follows a Gaussian distribution. But the more different a payout distribution is from Bernoulli, the worse the UCB1 algorithm will perform. I don’t recommend UCB1 for non-Bernoulli problems, but some of my colleagues believe that UCB1 can be successful if used conservatively.

The information in this article is based on the 2002 research paper titled “Finite-Time Analysis of the Multiarmed Bandit Problem” by P. Auer, N. Cesa-Bianchi and P. Fischer. In addition to UCB1, the paper presents an algorithm named UCB-Normal intended for use with Gaussian distribution multi-armed bandit problems.

This article assumes you have intermediate or better programming skills with C# or a C-family language such as Python or Java, but doesn’t assume you know anything about the UCB1 algorithm. The demo is coded using C#, but you shouldn’t have any trouble refactoring the code to another language if you wish. The complete demo code is presented in this article. The source code is also available in the accompanying download.

## Understanding the UCB1 Algorithm

The key to the UCB1 algorithm is a function that converts a set of average rewards at trial t into a set of decision values, which are then used to determine which machine to play. The equation is shown in **Figure 2**. In words, at trial t, select the arm, a, from all arms, that has the largest average reward (the r-hat) plus the upper confidence bound (which is the square root term). Here, n(a) is the number of times arm a has been pulled.

**Figure 2 The Key Equation for UCB1**

The equation is simpler than it appears and is best explained by example. Suppose, as in the demo, the algorithm is at trial t = 5, and the cumulative rewards are (1.00, 3.00, 0.00), and the arm counts are (2, 4, 1). The first step is to compute the average reward for each arm: (1.00 / 2, 3.00 / 4, 0.00 / 1) = (0.50, 0.75, 0.00). Then, the decision value for arm [0] is computed as:

```
decision[0] = 0.50 + sqrt( 2 * ln(5) / 2 )
= 0.50 + sqrt(1.61)
= 0.50 + 1.27
= 1.77
```

Similarly, the decision values for arms [1] and [2] are:

```
decision[1] = 0.75 + sqrt( 2 * ln(5) / 4 )
= 0.75 + sqrt(0.80)
= 0.75 + 0.90
= 1.65
decision[2] = 0.00 + sqrt( 2 * ln(5) / 1 )
= 0.00 + sqrt(3.22)
= 0.00 + 1.79
= 1.79
```

Because the number of times an arm has been played is in the denominator of the fraction part of the upper confidence bound term, small values increase the decision value. This allows rarely pulled arms to eventually get a chance against arms with a high average reward. Very nice.

## The Demo Program

The complete demo program, with a few minor edits to save space, is presented in **Figure 3**. To create the program, I launched Visual Studio and created a new console application named BanditUCB. I used Visual Studio 2017, but the demo has no significant .NET Framework dependencies.

Figure 3 The UCB1 Algorithm Demo Program

```
using System;
namespace BanditUCB
{
class BanditProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin UCB1 bandit demo \n");
Console.WriteLine("Three arms with true means u1 = " +
"0.3, u2 = 0.7, u3 = 0.5");
Random rnd = new Random(20);
int N = 3;
int trials = 6;
double p = 0.0;
double[] means = new double[] { 0.3, 0.7, 0.5 };
double[] cumReward = new double[N];
int[] armCounts = new int[N];
double[] avgReward = new double[N];
double[] decValues = new double[N];
// Play each arm once to get started
Console.WriteLine("Playing each arm once: ");
for (int i = 0; i < N; ++i) {
Console.Write("[" + i + "]: ");
p = rnd.NextDouble();
if (p < means[i]) {
Console.WriteLine("win");
cumReward[i] += 1.0;
}
else {
Console.WriteLine("lose");
cumReward[i] += 0.0;
}
++armCounts[i];
}
Console.WriteLine("-------------");
for (int t = 1; t <= trials; ++t) {
Console.WriteLine("trial #" + t);
Console.Write("curr cum reward: ");
for (int i = 0; i < N; ++i)
Console.Write(cumReward[i].ToString("F2") + " ");
Console.Write("\ncurr arm counts: ");
for (int i = 0; i < N; ++i)
Console.Write(armCounts[i] + " ");
Console.Write("\ncurr avg reward: ");
for (int i = 0; i < N; ++i) {
avgReward[i] = (cumReward[i] * 1.0) / armCounts[i];
Console.Write(avgReward[i].ToString("F2") + " ");
}
Console.Write("\ndecision values: ");
for (int i = 0; i < N; ++i) {
decValues[i] = avgReward[i] +
Math.Sqrt( (2.0 * Math.Log(t) / armCounts[i]) );
Console.Write(decValues[i].ToString("F2") + " ");
}
int selected = ArgMax(decValues);
Console.WriteLine("\nSelected machine = [" +
selected + "]");
p = rnd.NextDouble();
if (p < means[selected]) {
cumReward[selected] += 1.0;
Console.WriteLine("result: a WIN");
}
else {
cumReward[selected] += 0.0;
Console.WriteLine("result: a LOSS");
}
++armCounts[selected];
Console.WriteLine("-------------");
} // t
Console.WriteLine("End bandit UCB1 demo ");
Console.ReadLine();
} // Main
static int ArgMax(double[] vector)
{
double maxVal = vector[0];
int result = 0;
for (int i = 0; i < vector.Length; ++i) {
if (vector[i] > maxVal) {
maxVal = vector[i];
result = i;
}
}
return result;
}
} // Program
} // ns
```

After the template code loaded, at the top of the editor window I removed all unneeded namespace references, leaving just the reference to the top-level System namespace. In the Solution Explorer window, I right-clicked on file Program.cs, renamed it to the more descriptive BanditProgram.cs and allowed Visual Studio to automatically rename class Program.

All normal error checking has been omitted to keep the main ideas as clear as possible. All the control logic is contained in the Main method. There’s a single helper function named ArgMax that returns the index of the largest value in a numeric array. For example, if an array holds values (5.0, 7.0, 2.0, 9.0), then ArgMax returns 3.

## Setting Up UCB1

The demo program begins with these statements:

```
Random rnd = new Random(20);
int N = 3;
int trials = 6;
double p = 0.0;
double[] means = new double[] { 0.3, 0.7, 0.5 };
```

The Random object is used to determine whether a selected machine wins or loses. The seed value, 20, is used only because it gives a representative demo. The array named means could have been named something like probsWin instead. But because each machine pays either $1 or $0, the average (mean) value for each machine is the same as the probability of winning. For example, if the probability of winning for a machine is 0.6 and you play the machine 1,000 times, you’d expect to win $1 about 600 times. The mean value per pull is $600 / 1000 = 0.60.

## Computing the Decision Values

The demo program computes decision values using a direct mapping to the UCB1 equation in **Figure 2**:

```
for (int i = 0; i < N; ++i) {
decValues[i] = avgReward[i] +
Math.Sqrt( (2.0 * Math.Log(t) / armCounts[i]) );
Console.Write(decValues[i] + " ");
}
```

The computation will throw an exception if t = 0 (the log of 0 is negative infinity) or if any arm count is 0 (division by 0). However, the UCB1 initialization phase where each machine is tried once prevents either of the exception conditions from occurring.

After the decision values have been computed, the machine to play is determined by this statement:

```
int selected = ArgMax(decValues);
Console.WriteLine("Selected machine = [" + selected + "]");
```

The ArgMax function returns the index of the largest decision value. If two or more decision values are tied as largest, ArgMax returns the first index encountered. This introduces a slight bias toward smaller-indexed machines. One approach to eliminate this bias would be to refactor ArgMax so that if there’s a tie, one of the tied indexes is chosen randomly.

## The Epsilon-Greedy Algorithm

The UCB1 algorithm is closely related to another multi-armed bandit algorithm called epsilon-greedy. The epsilon-greedy algorithm begins by specifying a small value for epsilon. Then at each trial, a random probability value between 0.0 and 1.0 is generated. If the generated probability is less than (1 - epsilon), the arm with the current largest average reward is selected. Otherwise, an arm is selected at random. An epsilon-greedy implementation based on the demo program structure could look like:

```
//int selected = ArgMax(decValues); // UCB1
double epsilon = 0.05; // Epsilon-greedy
int selected;
double pr = rnd.NextDouble(); // [0.0, 1.0)
if (pr < (1 - epsilon))
selected = ArgMax(avgReward); // Usually
else
selected = rnd.Next(0, N); // 5% of the time
```

One of several variations of the basic epsilon-greedy algorithm is to slowly decrease the value of epsilon over time. This has the effect of concentrating on exploration early in the run and then emphasizing exploitation of the best arm found, later in the run. The biggest problem with epsilon-greedy is that there’s no easy way to determine a good value for epsilon.

In my opinion, the research results comparing UCB1 and epsilon-greedy and many other multi-armed bandit algorithms are inconclusive. Based on my experience, there’s no single, consistently best algorithm to use and, if possible, it’s good practice to run a few experiments with different algorithms using a simulation of your real problem.

The standard way to compare different multi-armed bandit algorithms is to compute a regret metric. Regret is the difference between the expected value of the system, assuming you know the best arm, and the actual value of the system in experiments. For example, suppose you played the three machines of the demo system 10 times and you won six times and lost four times. The total reward is $6.00. But if you hypothetically pulled the best arm (with probability of winning equal to 0.7) all 10 times, the average total reward would be $7.00. Therefore, the regret is $7.00 - $6.00 = $1.00.

## Wrapping Up

I mentally organize machine learning into three categories: supervised learning, where you have training data with known, correct answers; unsupervised learning, where you have data without correct answers; and reinforcement learning (RL), where a correct or incorrect result is called a reward (which could be negative) and comes from a function instead of data. Multi-armed bandit problems are usually considered a part of RL, but some of my research colleagues consider the multi-armed bandit a distinct type of problem.

There are dozens of algorithms for multi-armed bandit scenarios. Based on my experience, in addition to the UCB1 and epsilon-greedy algorithms described in this article, the most common algorithm used in practice is called Thompson Sampling. You can learn about that algorithm in the February 2018 issue of MSDN Magazine at msdn.com/magazine/mt829274.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd