August 2018

Volume 33 Number 8

### [Test Run]

# Introduction to Q-Learning Using C#

Reinforcement learning (RL) is a branch of machine learning that tackles problems where there’s no explicit training data with known, correct output values. Q-learning is an algorithm that can be used to solve some types of RL problems. In this article, I explain how Q-learning works and provide an example program.

The best way to see where this article is headed is to take a look at the simple maze in **Figure 1** and the associated demo program in **Figure 2**. The 3x4 maze has 12 cells, numbered from 0 to 11. The goal is to get from cell 8 in the lower-left corner, to cell 11 in the lower-right corner, in the fewest moves. You can move left, right, up or down, but not diagonally.

**Figure 1 Simple Maze Problem**

**Figure 2 Q-Learning Demo Program**

The demo program sets up a representation of the maze in memory and then uses the Q-learning algorithm to find a Q matrix. The Q stands for quality, where larger values are better. The row indices are the “from” cells and the column indices are the “to” cells. If the starting cell is 8, then scanning that row shows the largest Q value is 0.44 at to-cell 9. Then from cell 9, the largest value in the row is 1.08 at to-cell 5. The process continues until the program reaches the goal state at cell 11.

It’s not likely you’ll need to write code that solves a maze, but this is the Hello World for Q-learning because the problem is easy to grasp. I’ll explain how Q-learning can generalize to more realistic problems later in this article.

This article assumes you have intermediate or better programming skills, but doesn’t assume you know anything about Q-learning. The demo program is coded using C#, but you shouldn’t have too much trouble refactoring the code to another language, such as Visual Basic or Python. The code for the demo program is presented in its entirety in this article and is also available in the accompanying file download.

## Show Me the Code

For me at least, Q-learning is a bit unusual because I think the concepts are best understood by examining specific demo code rather than by starting with general principles. The overall structure of the demo program, with a few minor edits to save space, is shown in **Figure 3**.

Figure 3 Q-Learning Demo Program Structure

```
using System;
using System.Collections.Generic;
namespace QLearning
{
class QLearningProgram
{
static Random rnd = new Random(1);
static void Main(string[] args)
{
Console.WriteLine("Begin Q-learning maze demo");
Console.WriteLine("Setting up maze and rewards");
int ns = 12;
int[][] FT = CreateMaze(ns);
double[][] R = CreateReward(ns);
double[][] Q = CreateQuality(ns);
Console.WriteLine("Analyzing maze using Q-learning");
int goal = 11;
double gamma = 0.5;
double learnRate = 0.5;
int maxEpochs = 1000;
Train(FT, R, Q, goal, gamma, learnRate, maxEpochs);
Console.WriteLine("Done. Q matrix: ");
Print(Q);
Console.WriteLine("Using Q to walk from cell 8 to 11");
Walk(8, 11, Q);
Console.WriteLine("End demo");
Console.ReadLine();
}
static void Print(double[][] Q) { . . }
static int[][] CreateMaze(int ns) { . . }
static double[][] CreateReward(int ns) { . . }
static double[][] CreateQuality(int ns) { . . }
static List<int> GetPossNextStates(int s,
int[][] FT) { . . }
static int GetRandNextState(int s, int[][] FT) { . . }
static void Train(int[][] FT, double[][] R, double[][] Q,
int goal, double gamma, double lrnRate,
int maxEpochs) { . . }
static void Walk(int start, int goal, double[][] Q) { . . }
static int ArgMax(double[] vector) { . . }
} // Program
} // ns
```

To create the demo program, I launched Visual Studio and created a new C# console application project named QLearning. I used Visual Studio 2017, but the demo has no significant .NET dependencies so any version of Visual Studio will work fine. After the template code loaded into the editor I removed all unneeded using statements, leaving just the reference to the System namespace. Then I added a reference to the Collections.Generic namespace because the demo uses a List<int> collection.

The demo program has a class-scope Random object because Q-learning involves a random selection component, as you’ll see shortly. Variable ns stands for the number of states, which is synonymous with the number of cells in the maze. Object FT (feasible transitions) is an array-of-arrays-style matrix. Matrix R is the reward matrix, and matrix Q is the quality matrix.

The Q-learning algorithm requires parameters gamma (also known as the discount factor) and learnRate. I’ll explain these later. Q-learning is iterative, so the demo sets up a maxEpochs variable to control how long the algorithm can use to find the Q matrix.

## Setting up the Maze and the Rewards

The maze is created by method CreateMaze, which is defined as follows:

```
static int[][] CreateMaze(int ns) {
int[][] FT = new int[ns][];
for (int i = 0; i < ns; ++i) FT[i] = new int[ns];
FT[0][1] = FT[0][4] = FT[1][0] = FT[1][5] = FT[2][3] = 1;
FT[2][6] = FT[3][2] = FT[3][7] = FT[4][0] = FT[4][8] = 1;
FT[5][1] = FT[5][6] = FT[5][9] = FT[6][2] = FT[6][5] = 1;
FT[6][7] = FT[7][3] = FT[7][6] = FT[7][11] = FT[8][4] = 1;
FT[8][9] = FT[9][5] = FT[9][8] = FT[9][10] = FT[10][9] = 1;
FT[11][11] = 1; // Goal
return FT;
}
```

The method returns a matrix that defines allowable moves. For example, you can move from cell 4 to cell 8, but you can’t move from cell 4 to cell 5 because there’s a wall in the way. Recall that C# initializes int arrays to 0, so CreateMaze needs to specify only allowable moves. Notice that you can’t move from a cell to itself, except for the goal-cell 11.

The reward matrix is defined by:

```
static double[][] CreateReward(int ns) {
double[][] R = new double[ns][];
for (int i = 0; i < ns; ++i) R[i] = new double[ns];
R[0][1] = R[0][4] = R[1][0] = R[1][5] = R[2][3] = -0.1;
R[2][6] = R[3][2] = R[3][7] = R[4][0] = R[4][8] = -0.1;
R[5][1] = R[5][6] = R[5][9] = R[6][2] = R[6][5] = -0.1;
R[6][7] = R[7][3] = R[7][6] = R[7][11] = R[8][4] = -0.1;
R[8][9] = R[9][5] = R[9][8] = R[9][10] = R[10][9] = -0.1;
R[7][11] = 10.0; // Goal
return R;
}
```

In this example, moving to goal-cell 11 gives a reward of 10.0, but any other move gives a negative reward of -0.1. These values are somewhat arbitrary. In general, when working with RL, the reward structure is entirely problem-dependent. Here, the small negative reward punishes every move a bit, which has the effect of preferring shorter paths over longer paths to the goal. Notice you don’t have to set a reward for moves that aren’t allowed because they will never happen.

The goal of Q-learning is to find the value of the Q matrix. Initially, all Q values are set to 0.0 and the Q matrix is created like so:

```
static double[][] CreateQuality(int ns) {
double[][] Q = new double[ns][];
for (int i = 0; i < ns; ++i)
Q[i] = new double[ns];
return Q;
}
```

## Defining Possible Next States

As you’ll see shortly, the Q-learning algorithm needs to know what states the system can transition to, given a current state. In this example, a state of the system is the same as the location in the maze so there are only 12 states. Method GetPossNextStates is defined like so:

```
static List<int> GetPossNextStates(int s, int[][] FT) {
List<int> result = new List<int>();
for (int j = 0; j < FT.Length; ++j)
if (FT[s][j] == 1) result.Add(j);
return result;
}
```

For example, if the current state s is 5, then GetPossNextStates returns a List<int> collection holding (1, 6, 9). The Q-learning algorithm sometimes goes from the current state to a random next state. That functionality is defined by method GetRandNextState:

```
static int GetRandNextState(int s, int[][] FT) {
List<int> possNextStates = GetPossNextStates(s, FT);
int ct = possNextStates.Count;
int idx = rnd.Next(0, ct);
return possNextStates[idx];
}
```

So, if the current state s is 5, then GetRandNextState returns either 1 or 6 or 9 with equal probability (0.33 each).

## The Q-Learning Algorithm

The key update equation for Q-learning is based on the mathematical Bellman equation and is shown at the bottom of Figure 1. The algorithm is implemented in method Train. In high-level pseudo-code, the Q-learning algorithm is:

```
loop maxEpochs times
set currState = a random state
while currState != goalState
pick a random next-state but don't move yet
find largest Q for all next-next-states
update Q[currState][nextState] using Bellman
move to nextState
end-while
end-loop
```

The algorithm is not at all obvious, and for me, it’s best understood by examining the code. The definition begins:

```
static void Train(int[][] FT, double[][] R, double[][] Q,
int goal, double gamma, double lrnRate, int maxEpochs)
{
for (int epoch = 0; epoch < maxEpochs; ++epoch) {
int currState = rnd.Next(0, R.Length);
...
```

The number of training epochs must be determined by trial and error. An alternative design is to iterate until the values in the Q matrix don’t change, or until they stabilize to very small changes per iteration. The inner loop iterates until the current state becomes the goal state, cell 11 in the case of the demo maze:

```
while (true) {
int nextState = GetRandNextState(currState, FT);
List<int> possNextNextStates = GetPossNextStates(nextState, FT);
double maxQ = double.MinValue;
for (int j = 0; j < possNextNextStates.Count; ++j) {
int nns = possNextNextStates[j]; // short alias
double q = Q[nextState][nns];
if (q > maxQ) maxQ = q;
}
...
```

Imagine you’re in a maze. You see that you can go to three different rooms, A, B, C. You pick B, but don’t move yet. You ask a friend to go into room B and the friend tells you that from room B you can go to rooms X, Y, Z and that of those rooms Y has the best Q value. In other words, Y is the best next-next state.

The update to the Q matrix is performed:

```
...
Q[currState][nextState] =
((1 - lrnRate) * Q[currState][nextState]) +
(lrnRate * (R[currState][nextState] + (gamma * maxQ)));
currState = nextState;
if (currState == goal) break;
} // while
} // for
} // Train
```

The update equation has two parts. The first part, ((1 - lrnRate) * Q[currState][nextState]), is called the exploit component and adds a fraction of the old value. The second part, (lrnRate * (R[currState][nextState] + (gamma * maxQ))), is called the explore component. Larger values of the lrnRate increase the influence of both current rewards and future rewards (explore) at the expense of past rewards (exploit). The value of gamma, the discount factor, influences the importance of future rewards.

## Using the Q Matrix

After the quality matrix has been computed, it can be used to find an optimal path from any starting state to the goal state. Method Walk implements this functionality:

```
static void Walk(int start, int goal, double[][] Q) {
int curr = start; int next;
Console.Write(curr + "->");
while (curr != goal) {
next = ArgMax(Q[curr]);
Console.Write(next + "->");
curr = next;
}
Console.WriteLine("done");
}
```

Notice the method assumes that the goal state is reachable from the starting state. The method uses helper ArgMax to find the best next state:

```
static int ArgMax(double[] vector) {
double maxVal = vector[0]; int idx = 0;
for (int i = 0; i < vector.Length; ++i) {
if (vector[i] > maxVal) {
maxVal = vector[i]; idx = i;
}
}
return idx;
}
```

For example, if a vector has values (0.5, 0.3, 0.7, 0.2) then ArgMax returns 2. The demo defines a Print method to display the Q matrix. You can get the pretty-print version in the accompanying file download. A simplified version is:

```
static void Print(double[][] Q) {
int ns = Q.Length;
Console.WriteLine("[0] [1] . . [11]");
for (int i = 0; i < ns; ++i) {
for (int j = 0; j < ns; ++j) {
Console.Write(Q[i][j].ToString("F2") + " ");
}
Console.WriteLine();
}
}
```

## Wrapping Up

The Q-learning example presented here should give you a good understanding of the main principles involved. The problem scenario presented in this article is one with discrete states, but Q-learning can work with continuous states, too. The general challenge is to maximize the long-term reward, which for the maze example is the same as getting to the goal state in the fewest number of moves. Q-learning can be useful when you can safely train a system with many trials, such as training an industrial robot how to perform a task. But Q-learning isn’t applicable in scenarios like training a driverless car how to navigate through traffic. Q-learning and RL remind me somewhat of neural networks in the 1980s—there are relatively few practical applications right now, but there are intense research efforts. Many of my colleagues believe that at some point RL will explode into usefulness in unexpected ways.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Asli Celikyilmaz, Chris Lee, Ricky Loynd, Amr Sharaf, Ken Tran