December 2013

Volume 28 Number 12

# Test Run - Radial Basis Function Network Training

A radial basis function (RBF) network is a software system that can classify data and make predictions. RBF networks have some superficial similarities to neural networks, but are actually quite different. An RBF network accepts one or more numeric inputs and generates one or more numeric outputs. The output values are determined by the input values, plus a set of parameters called the RBF centroids, a set called the RBF widths, a set called the RBF weights and a set called the RBF biases.

For simplicity of terminology, the combination of RBF weights and bias values are sometimes collectively referred to as the weights. The context in which the term weights is used usually makes the meaning clear. For more information, see my article, “Radial Basis Function Networks for Programmers” in the October issue of MSDN Magazine (msdn.microsoft.com/magazine/dn451445).

When using an RBF network for classification and prediction, the challenge is to find a set of values for the centroids, widths, weights and biases so that computed outputs best match a set of known outputs. This is called training the RBF network. Although researchers have studied RBF networks since their introduction in 1988, there’s not much practical guidance that explains how to implement RBF network training. This article will present and describe a complete demo RBF network.

Take a look at the demo program in **Figure 1**. The program creates an RBF model that predicts the species of an iris flower (“setosa,” “versicolor” or “virginica”) from four numeric values for the flower’s sepal length and width, and petal length and width. The demo program’s data source consists of 30 items that are a subset of a well-known 150-item benchmark set called Fisher’s Iris data. The 30 data items have been preprocessed. The four numeric x-values have been normalized so that values less than zero mean shorter-than-average length or width, and values greater than zero mean longer-than-average length or width. The y-value for species has been encoded as (0,0,1), (0,1,0), or (1,0,0) for setosa, versicolor, and virginica, respectively.

**Figure 1 A Radial Basis Function Network Demo Program**

The demo splits the 30-item data set into a 24-item set to be used for training. There’s also a holdout six-item set for testing/evaluation of the resulting RBF model. It instantiates an RBF network with four input nodes (one for each input data value), five hidden processing nodes and three output nodes (one for each output data value). Determining the best number of hidden nodes is mostly a matter of trial and error. The choice of five in the demo was arbitrary.

In **Figure 1** you can see that training an RBF network consists of three phases. The first phase determines the centroids. You can think of centroids as representative x-values selected from the training data. An RBF network requires one centroid for every hidden node, so the demo needs five centroids. The training algorithm selects the x-values from training data items [9], [19], [21], [20] and [4]. In other words, the first centroid is (-0.362, -2.019, 0.074, 0.112).

The second phase of training determines widths. You can think of widths as values that describe the distance between the centroids. An RBF network requires one width for every hidden node. The demo computes a single common width with the value 3.3318 for all five hidden nodes, rather than computing five separate widths.

The third phase of training determines the RBF weights and bias values. You can think of weights and bias values as numeric constants. If an RBF network has NI number of input nodes, NH number of hidden nodes, and NO number of output nodes, then the network requires (NH * NO) weight values and NO bias values. So, because the demo RBF network has a 4-5-3 architecture, it needs 5 * 3 = 15 weights plus three biases, for a total of 18 weights and bias values. The demo program uses particle swarm optimization (PSO) to determine the 18 weights and biases.

After you’ve trained the demo RBF network using the 24-item training data set, you feed the six-item test data set into the network. In this example, the RBF network correctly predicts the species of five out of the six test items, for a classification accuracy of 0.8333.

This article assumes you have advanced programming skills with C# and a basic familiarity with the radial basis function network input-process-output mechanism. I discussed that mechanism in my October column. The source code for the demo program is too long to present in its entirety in this article, but the complete code download is available at msdn.com/magazine/msdnmag1213.

## Overall Program Structure

To create the demo program, I launched Visual Studio 2012 and created a C# console application named RadialNetworkTrain. The demo has no significant .NET dependencies so any version of Visual Studio should work. After the template code loaded, in the Solution Explorer window I renamed file Program.cs to the more descriptive RadialTrainProgram.cs and Visual Studio automatically renamed associated class Program. At the top of the source code, I deleted all unnecessary references to .NET namespaces, leaving just the reference to the System namespace.

The overall program structure, with some WriteLine statements removed and a few minor edits, is presented in **Figure 2**. In addition to the program class that houses the Main method, the demo has a RadialNetwork class that encapsulates RBF network creation and training, a Particle class that defines a particle object for use with the RBF training algorithm that determines weights and bias values, and a Helpers class that contains utility display methods.

Figure 2 Overall RBF Network Demo Program Structure

```
using System;
namespace RadialNetworkTrain
{
class RadialTrainProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin radial basis function (RBF) network training demo");
double[][] allData = new double[30][];
allData[0] = new double[] { -0.784, 1.255, -1.332, -1.306, 0, 0, 1 };
allData[1] = new double[] { -0.995, -0.109, -1.332, -1.306, 0, 0, 1 };
// Etc.
allData[28] = new double[] { 0.904, -1.473, 1.047, 0.756, 1, 0, 0 };
allData[29] = new double[] { 1.431, 1.528, 1.209, 1.659, 1, 0, 0 };
Console.WriteLine("First four and last line of normalized, encoded input data:");
Helpers.ShowMatrix(allData, 4, 3, true, true);
double[][] trainData = null;
double[][] testData = null;
int seed = 8; // Gives a good demo
GetTrainTest(allData, seed, out trainData, out testData);
Helpers.ShowMatrix(trainData, trainData.Length, 3, true, false);
Helpers.ShowMatrix(testData, testData.Length, 3, true, false);
int numInput = 4;
int numHidden = 5;
int numOutput = 3;
RadialNetwork rn = new RadialNetwork(numInput, numHidden, numOutput);
Console.WriteLine("Beginning RBF training");
int maxIterations = 100; // Max for PSO
double[] bestWeights = rn.Train(trainData, maxIterations);
Console.WriteLine("Evaluating RBF accuracy on the test data");
rn.SetWeights(bestWeights);
double acc = rn.Accuracy(testData);
Console.WriteLine("Classification accuracy = " + acc.ToString("F4"));
Console.WriteLine("End RBF network training demo");
}
static void GetTrainTest(double[][] allData, int seed,
out double[][] trainData, out double[][] testData) { . . }
}
public class RadialNetwork
{
private static Random rnd = null;
private int numInput;
private int numHidden;
private int numOutput;
private double[] inputs;
private double[][] centroids;
private double[] widths;
private double[][] hoWeights;
private double[] oBiases;
private double[] outputs;
public RadialNetwork(int numInput, int numHidden, int numOutput) { . . }
private static double[][] MakeMatrix(int rows, int cols) { . . }
public void SetWeights(double[] weights) { . . }
public double[] GetWeights() { . . }
private double MeanSquaredError(double[][] trainData,
double[] weights) { . . }
public double Accuracy(double[][] testData) { . . }
private static int MaxIndex(double[] vector) { . . }
public double[] ComputeOutputs(double[] xValues) { . . }
private static double[] Softmax(double[] rawOutputs) { . . }
public double[] Train(double[][] trainData, int maxIterations) { . . }
private void DoCentroids(double[][] trainData) { . . }
private static double AvgAbsDist(double[] v1, double[] v2,
int numTerms) { . . }
private int[] DistinctIndices(int n, int range) { . . }
private void DoWidths(double[][] centroids) { . . }
private double[] DoWeights(double[][] trainData, int maxIterations) { . . }
private static double EuclideanDist(double[] v1, double[] v2,
int numTerms) { . . }
private static void Shuffle(int[] sequence) { . . }
}
public class Particle
{
// Implementation here
}
public class Helpers
{
// Implementation here
}
}
```

Class RadialNetwork isn’t quite as complex as the program structure suggests because most of the class methods are helpers. Method Train performs the three-phase training process by calling helpers DoCentroids, DoWidths, and DoWeights. Private methods AvgAbsDist and DistinctIndices are helpers for DoCentroids. Method DoWeights uses private method Shuffle to process training data items in a different order each time through the iterative particle swarm optimization algorithm.

The heart of the demo is fairly simple. First, the normalized and encoded data is set up:

```
double[][] allData = new double[30][];
allData[0] = new double[] { -0.784, 1.255, -1.332, -1.306, 0, 0, 1 };
allData[1] = new double[] { -0.995, -0.109, -1.332, -1.306, 0, 0, 1 };
// Etc.
allData[28] = new double[] { 0.904, -1.473, 1.047, 0.756, 1, 0, 0 };
allData[29] = new double[] { 1.431, 1.528, 1.209, 1.659, 1, 0, 0 };
```

Here the data is hardcoded for simplicity. In most scenarios your data will be stored in a text file or a SQL table. Next, the data is split into training and test subsets:

```
double[][] trainData = null;
double[][] testData = null;
int seed = 8;
GetTrainTest(allData, seed, out trainData, out testData);
```

The RBF network is instantiated:

```
int numInput = 4;
int numHidden = 5;
int numOutput = 3;
RadialNetwork rn = new RadialNetwork(numInput,
numHidden, numOutput);
```

As mentioned in the previous section, the optimal number of hidden processing nodes must be determined essentially by trial and error. The network is trained:

```
int maxIterations = 100;
double[] bestWeights = rn.Train(trainData, maxIterations);
```

And, finally, the resulting model is evaluated:

```
rn.SetWeights(bestWeights);
double acc = rn.Accuracy(testData);
Console.WriteLine("Classification accuracy = " +
acc.ToString("F4"));
```

The bestWeights array holds the RBF weights and bias values as determined by the Train method. Method SetWeights loads these weights and bias values. You don’t need to have the centroids and widths explicitly loaded because these values were set by method Train.

## Radial Basis Function Network Input-Process-Output

To understand the RBF network training process, you need to understand the RBF network input-process-output mechanism. The diagram in **Figure 3** shows how the demo RBF network computes the outputs for test data item [1] = (0.482, 0.709, 0.452, 0.498) after the network has been trained. The input x-values are passed to each hidden node. Each hidden node computes its local output using its own centroid and width.

**Figure 3 Radial Basis Function Network Architecture**

For example, the top-most hidden node’s centroid is (-0.362, -2.019, 0.074, 0.112) and its width is 3.3318. The local outputs from each hidden node are then used to determine preliminary output values by computing a weighted sum of inputs, plus a bias value. For example, if hOutput[0] represents the local output of hidden node 0, then the preliminary output for the topmost output node is (hOutput[0] * w[0][0]) + (hOutput[1] * w[1][0]) + (hOutput[2] * w[2][0]) + (hOutput[3] * w[3][0]) + (hOutput[4] * w[4][0]) + bias[0] = -12.7999.

After the three preliminary output values have been computed, they’re converted to final output values using a softmax function. The idea is to modify the preliminary output values so the final output values are all between 0.0 and 1.0, and sum to 1.0. This lets the output values be loosely interpreted as probabilities.

In **Figure 3**, the final outputs are (0.2897, 0.6865, 0.0237). Because the middle node has the highest value, it’s interpreted as a 1, and the other two values are interpreted as 0, giving an inferred output of (0, 1, 0). Recall the test data is (0.482, 0.709, 0.452, 0.498, 0.000, 1.000, 0.000), where the first four values are inputs and the last three values are the target values, so the RBF network makes a correct prediction of the species (Iris versicolor in this case). The question now is: Where did the values for the RBF network’s centroids, widths, weights and biases come from?

## Determining RBF Network Centroids

The Train method of the RadialNetwork class is essentially a wrapper around three helper methods that do all the actual work:

```
public double[] Train(double[][] trainData, int maxIterations)
{
DoCentroids(trainData);
DoWidths(this.centroids);
double[] bestWeights = DoWeights(trainData, maxIterations);
return bestWeights;
}
```

Method DoCentroids determines representative input x-values. There are many possibilities here. One common approach is to use a k-means or k-medoids clustering algorithm that iteratively assigns and reassigns data items so similar data items are grouped together. When finished, each cluster will have a representative data member. You can use these as RBF centroids.

A different approach is to extract the x-values from randomly selected training data items. This is simple, but has the risk that bad centroids may be selected by chance.

The demo program uses what might be termed a lightweight clustering approach suggested by this pseudocode:

```
initialize maxDistance
intialize bestIndices
loop
select numHidden random indices into train data
compute an estimated distance between selected data items
if estimated distance > maxDistance then
set maxDistance = curr distance
set bestIndices = curr indices
end if
end loop
fetch the x-values in train data at bestIndices
store x-values into RBF centroids
```

The idea is best illustrated by example. Suppose the training data consists of the 24 items shown in **Figure 1**. Further suppose that the first time through the processing loop the four randomly selected indices are [0], [1], [2] and [3]. These correspond to:

```
0: ( 1.537, -0.382, 1.317, 0.756)
1: (-0.468, 2.346, -1.170, -1.048)
2: ( 1.115, 0.164, 0.560, 0.370)
3: ( 1.220, 0.436, 0.452, 0.241)
```

These are candidate centroids. The idea is to get representative x-values, which means you don’t want values that are close together. So, you compute some measure of distance between these candidate centroids. Here, there are many possible approaches. The demo estimates an average distance between all possible pairs of candidate centroids by computing an average distance between adjacent pairs of candidates, instead of computing an average distance between all possible pairs. For this example, it computes the distances between candidates [0] and [1], between [1] and [2], and between [2] and [3].

A common approach to computing a distance is to use Euclidean distance, which is the square root of the sum of squared differences between values. (Note: The demo RBF network uses a Gaussian kernel, which uses Euclidean distance to compute hidden node local output values.) However, the demo program uses a variation of Manhattan distance, where distance is an average of the difference of absolute values. So, the distance between candidates [0] and [1] is:

```
d = abs(1.537 - (-0.468)) + . . . + abs(0.756 - (-1.048)) / 4
= 2.256
```

The process of generating a set of candidate centroids and computing an estimated average distance for the set of candidates is repeated a specified number of times, and the set of candidates with the greatest estimated average distance is selected as the RBF centroid set.

Notice that determining RBF network centroids can be considered an unsupervised training technique because the target values (such as 0, 1, 0) in the training data aren’t needed or used. This means centroids can be determined quickly. Also, RBF network widths, weights, and bias values can—in theory at least—be computed much more quickly than the roughly equivalent neural network weights and bias values. This gives RBF networks a potential advantage over neural networks (but, as you’ll see, there’s more to the story).

During the process of determining RBF network centroids, determining the candidate indices is an interesting subproblem. The demo program uses a clever algorithm called reservoir sampling. The idea is to pick the first possible n indices, then probabilistically replace the initial indices with the remaining possible indices:

```
private int[] DistinctIndices(int n, int range)
{
// Reservoir sampling. assumes rnd exists
int[] result = new int[n];
for (int i = 0; i < n; ++i)
result[i] = i;
for (int t = n; t < range; ++t) {
int m = rnd.Next(0, t + 1);
if (m < n) result[m] = t;
}
return result;
}
```

Although the method is short, it’s subtle. Alternatives include using a brute-force approach where random indices are generated and then checked to see if there are any duplicates.

## Determining RBF Network Widths

The RBF network input-process-output mechanism requires a width value for each hidden node. There are many possibilities for determining the width values. The simplest approach, and the one used by the demo program, is to compute one common width, which all hidden processing nodes can use. Research in this area tends to be hazy and conclusions are sometimes contradictory. The demo program computes a common width as the average Euclidean distance between all possible pairs of centroids. In pseudocode:

```
sumOfDists = 0.0
for each pair of centroids
accumulate Euclidean distance between curr pair
end loop
return accumulated sumOfDists / number of pairs
```

Based on my experience, the effectiveness of RBF networks is extremely sensitive to the values used for hidden node widths. Research shows a width that’s too small tends to over-fit the training data, leading to poor classification accuracy. A width that’s too large tends to under-fit the data, which also leads to poor classification. If you experiment with the demo code by manually setting the values of the RBF network widths, you can see this effect in action.

Besides using the average distance between centroids, davg, for a common hidden nodes width value, research also suggests using (2 * davg), or (davg / sqrt(2 * numHidden)), and many other values. And instead of using a common width, there are many possibilities for computing different width values for each hidden node. In my opinion, the high sensitivity of RBF networks to width values, along with the related lack of convincing research results on how to best compute width values, are the major disadvantages of using RBF networks compared to alternatives such as neural networks and support vector machines.

## Determining RBF Network Weights and Biases

After determining centroids and widths, the final step in training an RBF network is determining the values for the weights and biases. Theoretically, you can compute RBF network weights easily and quickly because, loosely speaking, there are n equations with n unknown values. So, standard numerical techniques can, in theory, be used to solve for the weight values.

Unfortunately, in practice, using standard techniques runs into many practical problems. For example, many standard techniques for solving systems of equations involve the use of matrix inversion. Matrix inversion can fail for many reasons.

Rather than use a deterministic but possibly brittle numerical technique to solve for RBF network weights exactly, the demo program uses particle swarm optimization to estimate the best values. PSO is a meta-heuristic based on coordinated group behavior, such as flocks of birds or schools of fish. In PSO, a particle has a position that represents a potential solution (the best set of weight values in this case). Each particle has a velocity that determines the particle’s next position.

In PSO, a set of particles is created. In each simulated time tick, each particle moves to a new position based on the particle’s current position and velocity, the best-known historical position of the particle, and the best-known historical position of any of the particles. Here’s PSO in high-level pseudocode:

```
set number of particles
set maxIterations
initialize all particles to random positions
loop maxIterations times
for each particle
update curr particle's velocity
use new velocity to compute new position
compute error for new position
check if new particle best position
check if new best position for all particles
end for
end loop
return best position found by any particle
```

PSO is a fascinating topic in its own right. You can learn more about it by reading my August 2011 article, “Particle Swarm Optimization” (msdn.microsoft.com/magazine/hh335067). PSO requires the specification of several free parameters, including weight constants that control the relative influence of a particle’s current position, best historical position and best global historical position. PSO also requires specifying the number of particles, the maximum number of iterations and, optionally, an error threshold for early algorithm exit. You can experiment with these factors using the demo code.

In addition to PSO and traditional numerical techniques, there are many alternatives for finding RBF network weights, including simple gradient descent, and real-valued genetic algorithms. Although the theory of RBF networks has been studied fairly extensively, there are relatively few convincing research results on the comparative effectiveness of different training techniques.

## Wrapping Up

The demo program code along with the explanation presented here should give you a solid foundation for investigating RBF networks. Although RBF networks are well-known in the research community, they don’t seem to be used very often in the software developer community compared to alternatives such as neural network classifiers, naive Bayes classifiers, and logistic regression. One possible reason for this could be the scarcity of practical implementation examples. Another possible reason is the uncertainty surrounding fundamental RBF network factors, especially those related to the computation of RBF network widths. In my opinion, there’s no solid research evidence to answer the question of whether RBF networks are more effective, less effective, or roughly equivalent to alternative machine-learning techniques.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. He can be reached at jammc@microsoft.com.*

Thanks to the following technical expert for reviewing this article: Kirk Olynyk (Microsoft Research)