January 2019

Volume 34 Number 1

# Machine Learning Through Probabilistic Programming

By Yordan Zaykov | January 2019

Programming that’s probabilistic? Really? That doesn’t make much sense ... Or that’s what I thought when I started to work in this domain. The researchers I was listening to didn’t have the traditional view of placing machine learning (ML) problems into categories. Instead, they just expressed the real-world processes that lead to the generation of their data in a computer-readable form. That’s what they call a model and its representation—a probabilistic program.

The paradigm is actually quite appealing. First, you don’t need to learn the hundreds of ML algorithms available out there. You just have to learn how to express your problems in a probabilistic program. This involves some knowledge of statistics, because you’re modeling the uncertainty of the real world. Let’s say, for example, you want to predict the price of a house, which you decide is a linear combination of some features (location, size and so forth). Your model is that the price is the sum of the products of each feature and some weight, with noise added. This so happens to be called a linear regressor. In abstract terms:

For each house:
score = Weights * house.Features;
house.Price = score + Noise;

You can learn the weights during training and then use them directly in prediction. If you make a tiny change in your model by having the noisy score at the end thresholded against zero, the new label is suddenly one of two classes. You’ve just modeled a binary linear classifier, maybe without even knowing what it’s called.

Second, you don’t need to try and adapt your problem and data to one of the existing ML algorithms. This should be obvious—you designed the model for your problem, so it fits your data. Modern probabilistic programming tools can automatically generate an ML algorithm from the model you specified, using a general-purpose inference method. You don’t even need to know much about it, because it’s already implemented for you. So, an application-specific model combined with a general-purpose inference method gives an application-specific ML algorithm.

Finally, the approach seems to withstand the test of time. Most successful ML algorithms fit within this paradigm—a model, representing a set of assumptions, plus an inference method, which does the computations. The methodology has evolved over time. These days, deep neural networks are popular; the model is a composition of thresholded linear functions, and the inference method is called stochastic gradient descent (SGD). Modifying the structure of the network to be recurrent or convolutional; that is, changing the model allows targeting different applications. But the inference method can remain the same—SGD, as shown in Figure 1. So, though the choice of model has evolved over the years, the general approach of designing a model and applying an inference method has remained constant.

Figure 1 Different Models for Different Applications, All Using the Same Inference Method

So, the task for the developer is to come up with a model for his or her application. Let’s learn how to do this.

## Hello Uncertain World!

The first program everyone writes in a new language is “Hello world.” The equivalent in a probabilistic setup is, of course, “Hello uncertain world.” You can think of the probabilistic program as a simulation or a data sampler. I’ll start with some parameters and use them to generate data. For instance, let’s have two strings that are completely unknown. That is, they can likely be any string, or in more statistical terms, they’re string random variables drawn from a uniform distribution. I’ll concatenate them with a space in the middle and constrain the result to be the string, “Hello uncertain world”:

str1 = String.Uniform()
str2 = String.Uniform()
Constrain(str1 + " " + str2 == "Hello uncertain world")
Print(str1)
Print(str2)

This is a probabilistic model—I laid out my assumptions of how the data got generated. Now I can run some inference method that will do the necessary computations. The space between the two strings can be either of the two spaces—the one between “Hello” and “uncertain,” or the one between “uncertain” and “world.” So I can’t be sure about the value of either of the two variables. Therefore, the result is a distribution that captures the uncertainty about the possible values. Str1 is equally likely to be “Hello” or “Hello uncertain,” and str2 is 50 percent “uncertain world” and 50 percent “world.”

## A Real Example

Let’s move on to a more realistic example now. Probabilistic programming can be used to solve an enormous range of ML problems. For instance, my team developed a recommender system some time ago and shipped it in Azure Machine Learning. Before that, we productized an e-mail classifier in Exchange. Now, we’re working on improving the player matchmaking in Xbox by upgrading the skill-rating system. We’re also in the process of developing a system for Bing that automatically extracts knowledge from the Internet by modeling unstructured text. All of this is achieved using the same paradigm—define the model as a set of assumptions represented in a probabilistic program, and then have a general-purpose inference method do the necessary computations.

The skill-rating system—called TrueSkill—demonstrates many of the advantages of probabilistic programming, including the ability to interpret the behavior of the system, to incorporate domain knowledge in the model, and to learn as new data arrives. So, let’s implement a simplified version of the model that’s running in production in blockbuster titles like “Halo” and “Gears of War.”

Problem and Data The problem to solve is rating players in games. This has many uses, such as player matchmaking (where fair and enjoyable games are achieved by matching players or teams with similar skills). For simplicity, let’s assume there are only two participants in each match and that the outcome is a win or a loss with no draws. So, the data for each match will be the unique identifiers of the two players and an indicator of who wins. I’ll use a small handmade dataset in this article, but the approach scales to hundreds of millions of matches in Xbox.

What I’d like to do is learn the skills of all players and be able to predict the winner of future matchups. A naïve approach would be to simply count the number of wins and losses of each player, but this doesn’t consider the strength of the opponents in these matches. There’s a much better way.

Model Design Let’s start by making assumptions about how the data came into existence. First, each player has some hidden (or latent) skill that’s never observed directly—you only see the effects of their skill. I’ll assume this is a real number, but I also need to specify how it was generated. A reasonable choice is that it was generated from a normal (or Gaussian) distribution. In a more complex example, the parameters of this Gaussian would be unknown and learnable. For simplicity I’ll set them directly.

The model of the skill can be represented graphically, as shown in the leftmost sketch in Figure 2, which simply says that the random variable skill is drawn from a normal distribution.

Figure 2 The Composition of a Two-Player Game

Another assumption I can make is that in each match players have a performance number. This number is close to their latent skill, but can vary above or below it, depending on whether the player plays better or worse than their typical level. In other words, the performance is a noisy version of the skill. And noise is typically also modeled to be Gaussian, as depicted in the center diagram in Figure 2. Here, the performance is a random variable drawn from a Gaussian whose mean is the player’s skill and whose variance is fixed, indicated by the hatched noise variable. In a more complicated model I’d try to learn the noise variance from the data, but for simplicity here I’ll fix it at 1.

The player who performs better wins. In a two-player match I can represent this graphically as shown in the rightmost diagram in Figure 2. I cheated a bit in this notation by having the Boolean Player 1 Wins variable “somewhat” hatched. That’s because its value is observed during training, where the outcomes of the matches are given, but it’s not observed during prediction.

Before putting this all together, I need to introduce a couple of new notations. The first is called a plate and represents a foreach loop. It’s a rectangle that captures parts of the model that need to be repeated over a given range (for example, over players or over games). Second, I’ll use dashed lines to indicate selection, as when I select the skills of the two players in each game. The simplified TrueSkill model is shown in Figure 3.

Figure 3 The Simplified TrueSkill Model

Here, I use a plate over the range of players. Then, for each game, I select the latent skills of the two players in the game, add some noise and compare the performances. The noise variable isn’t inside any plates because its value is assumed not to change per player or game.

This visualization of the model, also called a factor graph, is a convenient representation in this simple case. But when models grow larger, the diagram becomes confusing and difficult to maintain. That’s why developers prefer to express it in code—as a probabilistic program.

## Infer.NET

The probabilistic programming framework for .NET is called Infer.NET. Developed by Microsoft Research, it went open source a couple of months ago. Infer.NET provides a modeling API to specify the statistical model, a compiler to generate the ML algorithm from the user-defined model, and a runtime on which the algorithm executes.

Infer.NET is becoming more tightly integrated with ML.NET and is now under the Microsoft.ML.Probabilistic namespace. You can install Infer.NET by running:

The compiler will automatically pull down the runtime package. Note that Infer.NET runs on .NET Standard and, therefore, on Windows, Linux and macOS.

Let’s use C#, and start by including the Infer.NET namespaces:

using Microsoft.ML.Probabilistic.Models;
using Microsoft.ML.Probabilistic.Utilities;
using Microsoft.ML.Probabilistic.Distributions;

Then, I’ll implement the model I defined earlier and see how to train it and how to make predictions.

Model Implementation The model is written like a program that simulates a game, in terms of player skill, performance and outcome, but this program will not actually be run. Behind the scenes, it builds up a data structure that represents the model. When this data structure is given to an inference engine, it’s compiled into ML code, which is then executed to perform the actual computation. Let’s implement the model in three steps: define the skeleton of the plates over players and games, define the content of the player plate, and define the content of the game plate.

Plates are implemented in Infer.NET using the Range class. Its instances are basically control variables in probabilistic foreach loops. I need to define the size of these plates, which is the number of players and games, respectively. Because I don’t know these upfront, they’ll be variables used as placeholders for these values. Conveniently, Infer.NET provides the Variable class for just this purpose:

var playerCount = Variable.New<int>();
var player = new Range(playerCount);
var gameCount = Variable.New<int>();
var game = new Range(gameCount);

With the plates defined, let’s focus on their contents. For the player plate, I’ll need an array of double random variables for the skills of each player. In Infer.NET this is done using Variable.Array<T>. I’ll also need an array of Gaussian random variables for the prior distribution over the skills of the players. Then, I’ll go over the players and connect their skills to the priors. This is achieved using the Variable<T>.Random method. Note how Variable.ForEach(Range) provides the means to implement the internals of the plate:

var playerSkills = Variable.Array<double>(player);
var playerSkillsPrior = Variable.Array<Gaussian>(player);
using (Variable.ForEach(player))
{
playerSkills[player] = Variable<double>.Random(playerSkillsPrior[player]);
}

The last piece of the model is the game plate. I’ll start by defining the arrays that will contain the training data—the first and second player in each game and the outcomes of the games. Notice how I’m creating a model specifically for my data, as opposed to trying to rejig the data to fit some algorithm. With the data containers ready, I need to go over the games, select the skills of the players in each game, add some noise to them, and compare the performances to generate the game outcome:

var players1 = Variable.Array<int>(game);
var players2 = Variable.Array<int>(game);
var player1Wins = Variable.Array<bool>(game);
const double noise = 1.0;
using (Variable.ForEach(game))
{
var player1Skill = playerSkills[players1[game]];
var player1Performance =
Variable.GaussianFromMeanAndVariance(player1Skill, noise);
var player2Skill = playerSkills[players2[game]];
var player2Performance =
Variable.GaussianFromMeanAndVariance(player2Skill, noise);
player1Wins[game] = player1Performance > player2Performance;
}

Interestingly, the same model is used for both training and prediction. The difference is that the observed data will be different. During training, you know the match outcomes, while in prediction you don’t. So, while the model remains the same, the generated algorithms will differ. Luckily, the Infer.NET compiler takes care of all that.

Training All queries to the model (training, prediction and so forth) go through three steps: set priors, observe data, run inference. Both training and prediction are called “inference” because fundamentally they do the same thing: They use observed data to move from a prior distribution to a posterior distribution. In training, you start from a broad prior distribution over skills, indicating that uncertainty over the skills is high. I’ll use a Gaussian distribution for the prior. After observing data, I obtain a narrower Gaussian posterior distribution over the skills.

For the prior on skills I’ll just borrow parameters learned from “Halo 5”—a good choice for mean and variance are 6.0 and 9.0, respectively. I set these values by assigning them to the ObservedValue property of the variable holding the prior. For four players, the code would look like this:

const int PlayerCount = 4;
playerSkillsPrior.ObservedValue =
Enumerable.Repeat(Gaussian.FromMeanAndVariance(6, 9),
PlayerCount).ToArray();

Next comes the data. For each game, I have the two players and the outcome. Let’s work with a fixed example. Figure 4 shows three matches played by four players, with arrows indicating a match that was played, each pointing to the loser of the match.

Figure 4 The Results of Three Matches

To simplify the code, I’ll assume each player has been assigned a unique id. In this example, the first game is between Alice and Bob, and the arrow indicates that Alice wins. In the second game Bob beats Charlie, and finally Donna beats Charlie. Here it’s expressed in code, again using ObservedValue:

playerCount.ObservedValue = PlayerCount;
gameCount.ObservedValue = 3;
players1.ObservedValue = new[] { 0, 1, 2 };
players2.ObservedValue = new[] { 1, 2, 3 };
player1Wins.ObservedValue = new[] { true, true, false };

Finally, I run inference by instantiating an inference engine and calling Infer on the variables I want to obtain. In this case, I’m interested only in the posterior over the player skills:

var inferenceEngine = new InferenceEngine();
var inferredSkills = inferenceEngine.Infer<Gaussian[]>(playerSkills);

The Infer statement here actually does a lot of work because it performs two compilations (Infer.NET and C#) and an execution. What happens is that the Infer.NET compiler traces the passed playerSkills variable to the probabilistic model, builds an abstract syntax tree from the model code and generates an inference algorithm in C#. Then the C# compiler gets invoked on the fly and the algorithm is executed against the observed data. Because of all that, you might find the calls to Infer somewhat slow, even though you’re operating on very little data here. The Infer.NET manual explains how to speed this process up, by working with precompiled algorithms. That is, you perform the compilation steps upfront, so that only the computation part is executed in production.

For this example, the inferred skills are:

Alice: Gaussian(8.147, 5.685)
Bob: Gaussian(5.722, 4.482)
Charlie: Gaussian(3.067, 4.814)
Donna: Gaussian(7.065, 6.588)

Prediction Having inferred the player skills, I can now make predictions for future matches. I’ll follow the same three steps as in training, but this time I’ll be interested in inferring the posterior over the player1Wins variable. The way I like to think about this is that in training the information flows upward in the factor graph—from the data at the bottom to the model parameters at the top. By contrast, in prediction you already have the model parameters (learned in training) and information flows downward.

In my dataset, both Alice and Donna have one win and no losses. However, intuitively it feels like in a match between the two, Alice has higher chances of winning because her win is of more significance—it’s against Bob, who is more likely a stronger player than Charlie. Let’s try to predict the outcome of the match between Alice and Donna (see Figure 5).

Figure 5 Predicting the Outcome of a Match

In this case, the prior distribution over skills is the posterior distribution inferred in training. The observed data is one game between player 0 (Alice) and player 3 (Donna), with an unknown outcome. To make the outcome unknown, I need to clear the previously observed value of player1Wins, because this is what I want a posterior over:

playerSkillsPrior.ObservedValue = inferredSkills;
gameCount.ObservedValue = 1;
players1.ObservedValue = new[] { 0 };
players2.ObservedValue = new[] { 3 };
player1Wins.ClearObservedValue();
var player0Vs3 = inferenceEngine.Infer<Bernoulli[]>(player1Wins).First();

It’s worth mentioning that uncertainty is propagated through this model all the way to the outcome of the match. This means that the obtained posterior over the predicted result is not simply a Boolean variable, but a value indicating the probability of the first player to win the match. Such a distribution is called a Bernoulli.

The value of the inferred variable is Bernoulli(0.6127). This means that Alice has a greater than 60 percent chance of winning the match against Donna, which aligns with my intuition.

Evaluation This example demonstrates how to build a well-known probabilistic model—TrueSkill. In practice, coming up with the right model requires multiple iterations over its design. Different models are compared by carefully selecting a set of metrics that indicate the model performance on the given data.

Evaluation is a central topic in ML, far too broad to cover here. It’s also not something specific to probabilistic programming. It’s worth mentioning, though, that having a model allows you to compute a unique “metric”—the model evidence. This is the probability that the training data was produced by this particular model. It’s great for comparing different models—no need for a test set!

## Benefits of Probabilistic Programming

The approach shown in this article probably looks more difficult than what you’ve seen before. For instance, the Connect(); special  issue of the magazine (msdn.com/magazine/mt848634) introduced you to ML.NET, which focuses less on model design and more on data transformation. And in many cases, that’s the right path to follow—if the data seems to match an existing algorithm and you’re comfortable with treating the model as a black box, start with something off the shelf. However, if you need a bespoke model, probabilistic programming may be the right choice for you.

There are other cases when you might want to use probabilistic programming. One of the main advantages of having a model you understand is your improved ability to explain the behavior of the system. When the model isn’t a black box, you can examine its internals and look at the learned parameters. These would make sense to you because you designed the model. For instance, in the previous example you can look at the learned skills of the players. In a more advanced version of TrueSkill, called TrueSkill 2, many more aspects of the game are modeled, including how performance in one game mode is connected to that in another. Understanding this connection helps game designers realize how similar the different game modes are. Interpretability of an ML system is also crucial for debugging. When a black-box model doesn’t produce the desired results, you may not even know where to start looking for the problem.

Another advantage of probabilistic programming is the capability to incorporate domain knowledge in the model. This is captured both in the structure of the model and the ability to set priors. By contrast, traditional approaches usually only look at data, without allowing the domain expert to inform the behavior of the system. This capability is required in certain domains, such as health care, where there’s strong domain knowledge and data can be scarce.

An advantage of the Bayesian approach, and one that is very well supported in Infer.NET, is the ability to learn as new data arrives. This is called online inference, and it’s particularly useful in systems that interact with user data. For example, TrueSkill needs to update the skills of the players after each match, and the knowledge extraction system needs to continuously learn from the Internet as it grows. This is all so easy, though—you just plug in the inferred posteriors as the new priors, and the system is ready to learn from new data!

Probabilistic programming also naturally applies to problems with certain data traits, such as heterogeneous data, scarce data, unlabeled data, data with missing parts and data collected with known biases.

## What Next?

There are two ingredients to successfully building a probabilistic model. The first one, obviously, is to learn how to model. In this article I introduced the main concepts and techniques of the methodology, but if you want to learn more, I recommend a freely available online book, “Model-Based Machine Learning” (mbmlbook.com). It gives a gentle introduction to model-based ML in production and is targeted at developers (as opposed to scientists).

Once you know how to design models, you have to learn how to express them in code. The Infer.NET user guide is an excellent resource for this. There are also tutorials and examples that cover many scenarios, which can all be reached from our repository at github.com/dotnet/infer, where we also warmly welcome you to join and contribute.

Yordan Zaykov is the principal research software engineering lead of the Probabilistic Inference Development team at Microsoft Research Cambridge. He focuses on applications based on the Infer.NET ML framework, including work on e-mail classification, recommendations, player ranking and matchmaking, health care, and knowledge mining.

Thanks to the following Microsoft technical experts who reviewed this article: John Guiver, James McCaffrey, Tom Minka, John Winn