November 2018

Volume 33 Number 11

Artificially Intelligent - A Closer Look at Reinforcement Learning

By Frank La

Frank La VigneIn last month’s column, I explored a few basic concepts of reinforcement learning (RL), trying both a strictly random approach to navigating a simple environment and then implementing a Q-Table to remember both past actions and which actions led to which rewards. In the demo, an agent working randomly was able to reach the goal state approximately 1 percent of the time and roughly half the time when using a Q-Table to remember previous actions. However, this experiment only scratched the surface of the promising and expanding field of RL.

Recall that in the previous column (, an RL problem space consists of an environment, an agent, actions, states and rewards. An agent examines the state of an environment and takes an action. The action then changes the state of the agent and/or environment. The agent receives a reward and examines the updated state of its environment. The cycle then restarts and runs for a number of iterations until the agent succeeds or fails at a predefined goal. When an agent succeeds or fails, the simulation ends. With a Q-table, an agent remembers which actions yielded positive rewards and references it when making decisions in subsequent simulations.

Multi-Armed Bandit Problem

One of the classical problems in RL is the tension between exploration and exploitation. Slot machines, often referred to as “one-armed bandits” are the inspiration for this problem. A bank of slot machines then creates “multi-armed bandit.” Each of these slot machine has a probability of paying out a jackpot or not. The probability of each turn resulting in a jackpot may be represented as P, and the probability of not paying out is 1 – P.  If a machine has a jackpot probability (JP) of .5, then each pull of the lever has an equal chance of winning or losing. Conversely, a machine with a JP of 0.1 would yield a losing result 90 percent of the time.

Now, imagine a bank of five slot machines and the player (or agent) has a goal to maximize winnings and minimize losses. With no foreknowledge of any of the machines’ jackpot probability (JP), the agent must take some risks at first. With the first pull of the lever, the agent wins and receives a payout. However, subsequent tries reveal that this machine pays out about half of the time, a JP of .54. As slot machines go, this is quite generous. The agent must decide if it should exploit the current known resource or explore a new machine. If the probability of the first slot machine paying out is this generous, is it worth trying any of the machines in the bank to see if their payout chances are better?

The best way to further explore this problem space is with some Python code in a Jupyter notebook. Create a Python 3 notebook on your preferred platform. I covered Jupyter notebooks in a previous article ( Create an empty cell and enter the following code and execute the cell.

import numpy as np
import matplotlib.pyplot as plt
number_of_slot_machines = 5
JPs =  np.random.uniform(0,1, number_of_slot_machines)

The output should read and show a plot of the values, as shown in Figure 1.

[0.54340494 0.27836939 0.42451759 0.84477613 0.00471886]

Jackpot Probabilities of the Five Slot Machines
Figure 1 Jackpot Probabilities of the Five Slot Machines

The code creates an array of JP values for a series of five slot machines ranging from 0.004 to 0.844. However, the first machine the agent tried, while generous, is not the best. Clearly, the fourth slot machine with an 84.4 percent payout rate is the best paying machine in the environment. It is also worth noting that the final slot machine has the worst odds of paying out a jackpot. Remember that the agent has no prior knowledge of the payout rates and it must discover them on its own. Had the agent stayed on the first machine, choosing exploitation over exploration, the agent would never have found the best paying slot machine.

To represent what the agent knows at the start of a simulation, add the following code to a new cell:

known_JPs = np.zeros(number_of_slot_machines)

This creates an array of zeros, meaning that the agent assumes that the JP of each slot machine is zero. While this may not be the best initial value in all cases, it will suffice for our purposes here. To create a simulation of a slot machine, add the following code to a new cell and execute it:

def play_machine(slot_machine):
  x = np.random.uniform(0, 1)
  if (x <= JPs[slot_machine]):

This code snippet simulates a slot machine paying out a reward of 10 if the machine pays out and a negative reward of -1 if the machine does not. Odds of a payout are based on the likelihood defined in the JPs numpy array. To test the code, enter the following Python code into a new cell and execute:

# Test Slot Machine 4
for machines in range(10):
print ("------")      
# Test Slot Machine 5
for machines in range(10):

This code pits the best performing machine against the worst performing machine. As this is all based on chance, there’s no guarantee of the output results. The results should reflect that, with a majority of 10 values for machine 4 and nearly all -1 values for machine 5. With the simulated slot machine code behaving as expected, it’s now time to examine a common algorithm in RL: Epsilon Greedy.

The Epsilon Greedy Algorithm

The core dilemma the agent faces here is whether to prioritize greed, the desire to exploit a known resource, or curiosity, the desire to explore other slot machines in the hopes of a better chance of rewards. One of the simplest algorithms for solving this dilemma is known as the Epsilon Greedy algorithm, where the agent chooses at random between using the slot machine with the best odds of payout observed thus far, or trying out another machine in the hopes that it may provide a better payout. With a low value of Epsilon, this algorithm follows the greedy algorithm, but will occasionally try another slot machine. For instance, if the Epsilon value is .1, the algorithm will opt to exploit 90 percent of the time and explore only 10 percent of the time. Typically, default values of Epsilon tend to fall between .05 and .1. In short, the agent will primarily play the best slot machine discovered that it knows of and sometimes try a new machine. Remember that each pull of the lever comes at a cost and the agent doesn’t know what we know: that slot 4 pays out the best.

This underscores the notion of RL. The agent knows nothing about the environment initially, so it needs to first explore, then exploit later. Learning continues throughout the entire process. Essentially, this is the notion of delayed gratification, and it’s in the agent’s best interest not to be totally greedy so it leaves some room for exploration.

Testing the Epsilon Greedy Hypothesis

To test this hypothesis, add the code in Figure 2 to a new cell and execute it. This code creates the multi_armed_bandit function, which simulates a series of runs against a collection of slot machines. The function stores the observed odds of a jackpot payout. At each iteration, the agent will randomly play the slot machine with the best payout it has observed thus far, or arbitrarily try another machine. The argmax function returns the highest value in the numpy array. Here, that means the slot machine with the best odds of hitting a jackpot. The function’s parameters allow for control over the number of slot machines, the amount of iterations to run and the value of epsilon. 

Figure 2 Reinforcement Learning Code

def multi_armed_bandit(arms, iterations, epsilon):
  total_reward, optimal_action = [], []
  estimated_payout_odds = np.zeros(arms)
  count = np.zeros(arms)
  for i in range(0, iterations):
    epsilon_random = np.random.uniform(0, 1)
    if epsilon_random > epsilon :
      # exploit
      action = np.argmax(estimated_payout_odds)
    else :
      # explore
      action = np.random.choice(np.arange(arms))
    reward = play_machine(action)
    estimated_payout_odds[action] = estimated_payout_odds[action] +
      (1/(count[action]+1)) *
      (reward - estimated_payout_odds[action])
    optimal_action.append(action == np.argmax(estimated_payout_odds))
    count[action] += 1
  return(estimated_payout_odds, total_reward)

With the RL code in place, now it’s time to test the Epsilon Greedy algorithm. Enter the code from Figure 3 into an empty cell and execute it. The results show the chart from Figure 1 for easy reference, followed by the odds that the RL code observed.

Figure 3 Code to Compare the Actual Slot Machine Odds with the Agent’s Observations

print ("Actual Odds"),JPs)
print (JPs)
iterations = 1000
print ("Learned Odds with epsilon of .1")
learned_payout_odds, reward = multi_armed_bandit(number_of_slot_machines, iterations, .1),learned_payout_odds)
print (learned_payout_odds)
print ("Reward: ", sum(reward))

As you can see in Figure 4, the algorithm did an excellent job, not only of determining the slot machine with the most favorable odds, but also producing fairly accurate payout probabilities for the other four slot machines. The graphs line up rather well. The exception being the fifth slot machine, which has such low odds of a payout that it scored negatively in the agent’s observations.

Results with an Epsilon Value of .1
Figure 4 Results with an Epsilon Value of .1

Now, with the baseline established, it’s time to experiment some more. What would happen if epsilon were set to zero, meaning that the algorithm will never explore? Enter the following code in a new cell and execute it to run that experiment:

print ("Learned Odds with epsilon of 0")
learned_payout_odds, reward =
  multi_armed_bandit(number_of_slot_machines, iterations, 0),learned_payout_odds)
print (learned_payout_odds)
print ("Reward: ", sum(reward))

The resulting chart shows with one value higher than zero. One machine dominates the others, making it quite clear that the agent found one machine and stuck with it. However, run the code several times and you may notice that occasionally an interesting pattern develops. There will be one or more machines with negative values, with one machine with a higher than zero value. In these cases, the agent lost on a given machine and then won on another machine. Once the agent discovers a winning machine, it will stick with that machine, as it will be the machine that the argmax function will choose. If epsilon is set to zero, the agent may still explore, but it will not be intentional. As such, the observed slot machine odds are nowhere near the actual odds. It is also worth noting that the “greedy” method produces a lower reward score than when epsilon was set to .1. Greed, at least absolute greed, would appear to be counterproductive.

What if epsilon were set to 1, making the agent explore every time and not exploit at all? Enter the following code into a new cell and execute it:

print ("Learned Odds with epsilon of 1")
learned_payout_odds, reward  = multi_armed_bandit(number_of_slot_machines, iterations, 1),learned_payout_odds)
print (learned_payout_odds)
print ("Reward: ", sum(reward))

The results will show that the agent did an excellent job of observing odds similar to those of the true odds, and the chart lines up very closely with Figure 1. In fact, the results of setting epsilon to 1 look very similar to when the value was .1. Take note of the Reward value, however, and there is a stark difference. The reward value when epsilon was set to .1 will nearly always be higher than when it’s set to 1. When the agent is set to only explore, it will try a machine at random at every iteration. While it may be learning from observation, it is not acting on those observations.

Wrapping Up

RL remains one of the most exciting spaces in artificial intelligence. In this article, I explored the Epsilon Greedy algorithm with the classic “Multi-Armed Bandit” problem, specifically drilling into the explore-or-exploit dilemma that agents face. I encourage you to further explore the trade offs by experimenting with different values of epsilon and larger amount of slot machines.

Frank La Vigne works at Microsoft as an AI Technology Solutions professional where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).

Thanks to the following technical expert for reviewing this article: Andy Leonard

Discuss this article in the MSDN Magazine forum