May 2013

Volume 28 Number 05

# CLR - Shortest-Path Graph Analysis Using a CLR Stored Procedure

By James McCaffrey | May 2013

Graph analysis is becoming increasingly important in software applications. Here a graph is a collection of nodes and edges, not a data visualization such as a bar chart. This article presents a demonstration of how to perform shortest-path analysis using a SQL CLR stored procedure. The techniques presented here can also be used for many other data-access programming tasks.

Shortest-path graph analysis really involves two closely related problems. The first is to determine the shortest path from a specified graph start node to an end node in terms of number of hops. The second problem is to determine the length of the shortest path if graph connections have some kind of weight. For example, suppose the nodes in a graph represent people and the edges between nodes represent an e-mail communication. You might be interested in the shortest number of hops between two people, where you’re implicitly assuming that each hop has a weight of 1. This is similar to the “six degrees of Kevin Bacon” game or the Erdos number for a researcher. If each graph edge has a weight—for example, representing some measure of importance of a relationship—you might want to determine the shortest path between two people taking the importance of weight into account.

But why use a CLR stored procedure? Traditional shortest-path algorithms assume that the problem graph representation can be completely stored in machine memory, typically in a matrix or adjacency list. For large graphs—for example, graphs representing social networks—this approach often isn’t feasible. Large graphs can be conveniently stored in a SQL database. One approach for performing shortest-path analysis of graphs stored in a SQL database is to write a native SQL language stored procedure. The MSDN Magazine article, “Graph-Based Shortest-Path Analysis with SQL” (msdn.microsoft.com/magazine/jj863138), explains that approach in detail. However, using a CLR stored procedure instead of a pure SQL approach can provide dramatically better performance and much greater flexibility for customization.

Take a look at the dummy graph representation in Figure 1. The graph has eight nodes. The numbers next to each arrow represent a weight. The shortest path between node 222 and node 444 is 222 -> 555 -> 666 -> 777 -> 444, which has a weighted distance 1.0 + 1.0 + 1.0 + 2.0 = 5.0. Notice that 222 -> 333 -> 666 -> 777 -> 444 is also a shortest path from 222 to 444.

Figure 1 Dummy Graph for Shortest-Path Analysis

Figure 2 shows a screenshot of a call to a CLR stored procedure named csp_ShortestPath that determines the shortest path between node 222 and node 444. In this case the shortest path is displayed as a semicolon-delimited string in reverse order. The output is at the bottom of the image. The top part of the image contains SQL statements that create a graph corresponding to the one in Figure 1.

Figure 2 Calling a Shortest-Path CLR Stored Procedure

This article assumes you have advanced C# programming skills and a basic familiarity with SQL. I present all the SQL code to create the dummy graph and all the C# code to create the CLR stored procedure, and I also describe several ways to call the CLR stored procedure. All the code for this article is available at msdn.com/magazine/msdnmag0513.

## Creating the Database

To create the dummy SQL-based graph, I launched Microsoft SQL Server Management Studio (SSMS) and connected to an instance of a SQL Server 2008 database. CLR stored procedures are supported by SQL Server 2005 and later. First, I created a database named dbShortPathWithCLR using the following commands:

``````use master
go
if exists(select * from sys.sysdatabases
where name='dbShortPathWithCLR')
drop database dbShortPathWithCLR
go
create database dbShortPathWithCLR
go
use dbShortPathWithCLR
go
``````

I strongly recommend experimenting with a dummy database rather than with a production database. The commands to create a table that hold node and edge data are:

``````create table tblGraph
(
fromNode bigint not null,
toNode bigint not null,
edgeWeight real not null
)
go
``````

Node IDs are stored as SQL type bigint, which roughly corresponds to C# type long. The edge weights are stored as type real, which is a synonym for SQL type float(24), which roughly corresponds to C# type double. In many situations you won’t be concerned with an edge weight and the edgeWeight column can be omitted.

The 14 statements in Figure 3 define the graph.

Figure 3 Defining the Graph

``````insert into tblGraph values(111,222,1.0)
insert into tblGraph values(111,555,1.0)
insert into tblGraph values(222,111,2.0)
insert into tblGraph values(222,333,1.0)
insert into tblGraph values(222,555,1.0)
insert into tblGraph values(333,666,1.0)
insert into tblGraph values(333,888,1.0)
insert into tblGraph values(444,888,1.0)
insert into tblGraph values(555,111,2.0)
insert into tblGraph values(555,666,1.0)
insert into tblGraph values(666,333,2.0)
insert into tblGraph values(666,777,1.0)
insert into tblGraph values(777,444,2.0)
insert into tblGraph values(777,888,1.0)
go
``````

If you compare the statements in Figure 3 with the image in Figure 1 you’ll see that each statement corresponds to an edge between nodes.

Next, the database is prepared for shortest-path analysis:

``````create nonclustered index idxTblGraphFromNode
on tblGraph(fromNode)
go
create nonclustered index idxTblGraphToNode
on tblGraph(toNode)
go
create nonclustered index idxTblGraphEdgeWeight
on tblGraph(edgeWeight)
go
alter database dbShortPathWithCLR set trustworthy on
go
select is_trustworthy_on from sys.databases
where name = 'dbShortPathWithCLR'
go
``````

The first three statements create indexes on the fromNode, toNode and edgeWeight columns. When working with large graphs, indexes are almost always necessary to give reasonable performance. The next two statements alter the database so the trustworthy property is set to “on.” The default value of the property is “off.” I’ll explain why the trustworthy property must be set to “on” shortly.

At this point the dummy SQL-based graph is created. The next step is to use Visual Studio to create a CLR stored procedure to perform shortest-path analysis.

## Creating the CLR Stored Procedure

To create the CLR shortest-path stored procedure, I launched Visual Studio 2010. To create CLR stored procedures your development environment must include the Microsoft .NET Framework 3.5 or later. From the File | New | Project menu I selected the Database | SQL Server project template group and then selected the Visual C# SQL CLR Database Project option as shown in Figure 4. I named the project CreateStoredProc.

Figure 4 A New CLR Project in Visual Studio 2010

Notice that the .NET Framework 3.5 is selected in the New Project dialog dropdown control. The version of the framework to target must match the version of the framework on the SQL Server hosting the database. Because the dummy database is on an instance of SQL Server 2008, I targeted .NET Framework 3.5. If the dummy database had been on an instance of SQL Server 2005, I would’ve targeted .NET Framework 2.0. The CLR stored procedure documentation is a bit murky in describing the correlations between the framework versions on the development environment, on the SQL Server and in the C# stored procedure creation project. You might have to resort to a bit of trial and error.

After clicking OK to create the CLR stored procedure creation project, Visual Studio prompts with a request for information about the name and authentication model of the target database (see Figure 5). After clicking OK on the New Database Reference dialog, Visual Studio loads a new project but doesn’t directly create a template. To generate a template, right-click on the project name—CreateStoredProc in this case—and select Add | New Item from the context menu (see Figure 6).

Figure 5 New Database Reference for the Project

Figure 6 New Item Stored Procedure

I selected the Stored Procedure item and named it csp_ShortestPath.cs. This name, without the .cs extension, will become the name of the stored procedure in the target database. As a matter of style, I generally prepend CLR stored procedure names with csp to distinguish them from system stored procedures (sp), extended stored procedures (xp) and user-defined stored procedures (usp). After clicking the Add button, Visual Studio generated a lightweight template of a partial class named StoredProcedures:

``````using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class StoredProcedures
{
[Microsoft.SqlServer.Server.SqlProcedure]
public static void csp_ShortestPath()
{
}
};
``````

## A Poor Man’s Priority Queue

The shortest-path algorithm presented in this article requires a random-access priority queue data structure. The .NET Framework doesn’t supply a built-in priority queue that will exactly meet the needs of the shortest-path algorithm, so you must code your own priority queue.

A priority queue is similar to a normal first-in-first-out (FIFO) queue with a few differences. The items in a priority queue are assumed to have some sort of priority field in addition to a data field. For example, a priority queue item for a group of customers waiting for technical support might consist of a customer’s name and the length of time the customer has been waiting for service.

Priority queues support a Dequeue operation that always removes the item with the highest priority. Here, the meaning of highest depends on the problem context and can be a largest value or a smallest value. A random-access priority queue supports an operation that modifies the priority field of an item with a specified ID.

This article presents a poor man’s priority queue—one that gets the job done but has much room for improvement. The shortest-path algorithm operates on a priority queue where items in the queue have two fields: a node ID (such as 111 or 222) and a distance field. The distance field is used by the algorithm to store the best (shortest)  known distance between the start node and the item node at any given time during the algorithm’s execution. The distance field acts as the item’s priority, and smaller values for distance represent higher priority.

So, to support a random-access priority queue, the C# stored procedure template needs an additional using statement that references the System.Collections.Generic namespace and two additional program-defined class definitions placed below the partial StoredProcedures class:

``````public class NodeDistance
{
// Definition here
}
public class MyPriorityQueue
{
// Definition here
}
Class NodeDistance is simple:
public class NodeDistance
{
public long nodeID;
public double distance;
public NodeDistance(long nodeID, double distance)
{
this.nodeID = nodeID;
this.distance = distance;
}
}
``````

I use public scope for the class fields for simplicity. The priority queue is essentially a List of NodeDistance items:

``````public class MyPriorityQueue
{
public List<NodeDistance> list;
public MyPriorityQueue()
{
this.list = new List<NodeDistance>();
}
``````

Again, I use public scope for simplicity only. The Dequeue operation removes the NodeDistance item with the smallest distance value:

``````public NodeDistance Dequeue()
{
int i = IndexOfSmallestDist();
NodeDistance result = this.list[i];
this.list.RemoveAt(i);
return result;
}
``````

Method Dequeue calls a helper method to find the location of the item that has the smallest distance, saves a reference to that item and then uses the built-in List.RemoveAt method to remove the item.

Helper method IndexOfSmallestDist is defined as:

``````private int IndexOfSmallestDist() {
double smallDist = this.list[0].distance;
int smallIndex = 0;
for (int i = 0; i < list.Count; ++i) {
if (list[i].distance < smallDist) {
smallDist = list[i].distance;
smallIndex = i;
}
}
return smallIndex;
}
``````

The method does a simple linear search through the underlying List collection. Note that this approach means that Dequeue has O(n) performance.

The Enqueue operation is defined as:

``````public void Enqueue(NodeDistance nd)
{
}
``````

The change-priority operation is defined as:

``````public void ChangePriority(long nodeID, double newDist)
{
int i = IndexOf(nodeID);
list[i].distance = newDist;
}
``````

Method ChangePriority calls helper method IndexOf that locates the position of a NodeDistance item, given the item’s ID, and is defined as:

``````private int IndexOf(long nodeID)
{
for (int i = 0; i < list.Count; ++i)
if (list[i].nodeID == nodeID)
return i;
return -1;
}
``````

Again, note that because the IndexOf method performs a linear search, its performance is O(n).

The shortest-path algorithm needs to know the number of items in the priority queue at any given time:

``````public int Count()
{
return this.list.Count;
}
``````

To summarize, the poor man’s random-access priority queue defined here supports an Enqueue operation; a Count property; a Dequeue operation that removes the NodeDistance item with the smallest distance; and a ChangePriority operation that modifies the distance of a specified item. Operations Dequeue and ChangePriority have O(n) performance.

For large graphs, the performance of the shortest-path algorithm is highly dependent on the efficiency of the priority queue used. While O(n) performance is often acceptable, it’s possible to implement a priority queue that has much better O(log n) performance. Such implementations typically use a binary heap data structure and are quite tricky. I describe an efficient priority queue in my Visual Studio Magazine article, “Priority Queues with C#,” available at bit.ly/QWU1Hv.

## Creating the Stored Procedure

With the custom priority queue defined, the shortest-path algorithm can be implemented. The method signature is:

``````public static void csp_ShortestPath(System.Data.SqlTypes.SqlInt64
startNode, SqlInt64 endNode, SqlInt32 maxNodesToCheck,
out SqlString pathResult, out SqlDouble distResult)
{
``````

Method csp_ShortestPath accepts three input parameters and has two out result parameters. Parameter startNode holds the node ID of the node to begin the search. Recall that in the SQL table definition, columns fromNode and toNode are defined as SQL type bigint, which corresponds to C# type long. When defining a CLR stored procedure, the procedure’s parameters use a type model defined in namespace System.Data.SqlTypes. These types are usually pretty easy to map to SQL types and C# types. Here, type bigint maps to SqlInt64. Input parameter endNode is also declared as type SqlInt64.

Input parameter maxNodesToCheck is used to prevent the stored procedure from spinning out of control on extremely large graphs. Here, maxNodesToCheck is declared as type SqlInt32, which corresponds to C# type int.

If you’re new to SQL stored procedures, you might be surprised to learn that they can have no explicit return value, or can return an int value, but they can’t explicitly return any other data type. So in situations where a SQL stored procedure must return two or more values, or return a value that isn’t an int type, the approach taken is to use out parameters. Here, the CLR stored procedure returns the shortest path as a string, such as “444;777;666;333;222,” and also returns the total distance of the shortest path as a numeric value, such as 5.0. So out parameter pathResult is declared as type SqlString and out parameter distResult is declared as type Sql­Double, corresponding to C# types string and double, respectively.

The CLR stored procedure definition continues by setting up four data structures:

``````Dictionary<long, double> distance = new Dictionary<long, double>();
Dictionary<long, long> previous = new Dictionary<long, long>();
Dictionary<long, bool> beenAdded = new Dictionary<long, bool>();
MyPriorityQueue PQ = new MyPriorityQueue();
``````

As the shortest-path algorithm executes, at any given point, the algorithm needs to access the current best (shortest) known total distance from the start node to all other nodes. The Dictionary collection named “distance” holds this information. The dictionary key is a node ID and the dictionary value is the shortest known total distance. The Dictionary collection named “previous” stores the immediate predecessor node ID to a key node ID in the shortest path. For example, in the example shown in Figure 2, the end node is 444. Its immediate predecessor in the shortest path is 777. So previous[444] = 777. In essence the previous collection holds the actual shortest path in an encoded way.

The Dictionary collection named “beenAdded” holds information that indicates whether a graph node has been added to the algorithm’s priority queue. The Boolean value is a dummy value because it isn’t really needed to determine if a node is in the collection. You might want to use a Hashtable instead of a Dictionary collection. The custom priority queue named “PQ” was defined and explained in the previous section of this article.

The stored procedure definition continues:

``````SqlConnection conn = null;
long startNodeAsLong = (long)startNode;
long endNodeAsLong = (long)endNode;
int maxNodesToCheckAsInt = (int)maxNodesToCheck;
``````

The SqlConnection object is the single connection to the target graph database. I declare it here so that I can check its state later if an Exception occurs. Although not strictly necessary, when writing CLR stored procedures I prefer to explicitly create local C# typed variables that correspond to the SQL typed parameter variables.

The definition continues:

``````distance[startNodeAsLong] = 0.0;
previous[startNodeAsLong] = -1;
PQ.Enqueue(new NodeDistance(startNodeAsLong, 0.0));
``````

These lines of code initialize the start node. The value in the distance Dictionary is set to 0.0 because the distance from the start node to itself is zero. The immediate predecessor to the start node doesn’t exist so a value of -1 is used to indicate this. The priority queue is initialized with the start node, and the beenAdded dictionary collection is updated.

The definition continues:

``````try
{
string connString = "Server=(local);Database=dbShortPathWithCLR;" +
"Trusted_Connection=True;MultipleActiveResultSets=True";
conn = new SqlConnection(connString);
conn.Open();
double alt = 0.0;  // 'Test distance'
``````

When writing CLR stored procedures I prefer using explicit try-catch blocks rather than the more elegant using statement. When setting up the connection string you have two options. In many cases, because the stored procedure is running on the same machine as the SQL database, you can simply set the connection string to “context connection=true.” A context connection in theory will deliver better performance than a standard connection. However, a context connection has several limitations. One limitation is that it can support only a single SQL data reader. As you’ll see shortly, shortest-path analysis often (but not always) requires two SQL data readers.

Because this CLR stored procedure requires two readers, a standard connection string that includes a clause that sets the MultipleActiveResultSets property to true is used. This clause isn’t currently supported by SQL context connections. Because the stored procedure is using a standard connection rather than a context connection, the Database Permission Level for the Visual Studio stored procedure creation project must be set to External, as shown in Figure 7. However, in order to set this property the SQL database must have its trustworthy property set to “on,” as shown back in Figure 2.

Figure 7 Setting Database Permission Level

To summarize, when creating the graph database, the database trustworthy property is set to “on.” This allows the Visual Studio project Database Permission Level property to be set to External. This allows the stored procedure definition to use a standard connection rather than a context connection. This allows the connection’s MultipleActiveResultSets property to be set to true. And this allows two SQL data readers in the stored procedure.

The stored procedure definition continues:

``````while (PQ.Count() > 0 &&
{
NodeDistance u = PQ.Dequeue();
if (u.nodeID == endNode) break;
``````

The algorithm used here is a variation of Dijkstra’s shortest path, one of the most famous in computer science. Although short, the algorithm is very subtle and can be modified in many ways. The heart of the algorithm is a loop that terminates when the priority queue is empty. Here, an additional sanity check based on the total number of graph nodes processed is added. Inside the main loop, the priority queue Dequeue call returns the node in the queue that has the best (shortest) known total distance from the start node. If the node just removed from the priority queue is the end node, then the shortest path has been found and the loop terminates.

The definition continues:

``````SqlCommand cmd = new SqlCommand(
"select toNode from tblGraph where fromNode=" + u.nodeID);
cmd.Connection = conn;
long v;  // ID of a neighbor to u
v = long.Parse(sdr[0].ToString());
distance[v] = double.MaxValue;
previous[v] = -1;
PQ.Enqueue(new NodeDistance(v, double.MaxValue));
}
``````

These lines of code get all neighbors to the current node u. Notice that this requires a first SQL data reader. Each neighbor node v is checked to see if it’s the first appearance of the node in the algorithm. If so, a NodeDistance item with v as its nodeID is instantiated and added to the priority queue. Instead of adding nodes to the priority queue as they’re encountered, an alternative design is to initially add all nodes in the graph to the priority queue. However, for very large graphs this could require a very large amount of machine memory and take a very long time.

``````SqlCommand distCmd =
new SqlCommand("select edgeWeight from tblGraph where fromNode=" +
u.nodeID + " and toNode=" + v);
distCmd.Connection = conn;
double d = Convert.ToDouble(distCmd.ExecuteScalar());
alt = distance[u.nodeID] + d;
``````

This code queries the database to get the distance from the current node u to the current neighbor node v. Notice a second data reader is needed to do this. The existence of the second data reader necessitates the multiple changes to properties including the database trustworthy property and the Visual Studio project Permission property. If your shortest-path analysis uses an unweighted graph—that is, one where all edge weights are assumed to be 1—you can simplify by eliminating the second reader and substituting

``````alt = distance[u.nodeID] + 1;
``````

for

``````double d = Convert.ToDouble(distCmd.ExecuteScalar());
alt = distance[u.nodeID] + d;
``````

Variable alt is a test distance from the start node to the current neighbor node v. If you examine the assignment statement carefully, you’ll see that alt is the shortest known distance from the start node to node u plus the actual distance from node u to node v. This represents a potential new shorter distance from the start node to node v.

The inner read-all-neighbors loop and the main algorithm loop continue with:

``````if (alt < distance[v])
{
distance[v] = alt;
previous[v] = u.nodeID;
PQ.ChangePriority(v, alt);
}
sdr.Close();
} // Main loop
conn.Close();
``````

If the test distance from the start node to v is smaller than the shortest known distance from the start node to v (stored in the distance Dictionary), then a new shorter distance from start to v has been found and local data structures distance, previous and the priority queue are updated accordingly.

The stored-procedure logic now determines if the main algorithm loop terminated because a shortest path was in fact found, or terminated because no path between start node and end was found:

``````pathResult = "NOTREACHABLE";
distResult = -1.0;
if (distance.ContainsKey(endNodeAsLong) == true) {
double sp = distance[endNodeAsLong];
if (sp != double.MaxValue) {
pathResult = "";
long currNode = endNodeAsLong;
while (currNode != startNodeAsLong) {
pathResult += currNode.ToString() + ";";
currNode = previous[currNode];
}
pathResult += startNode.ToString();
distResult = sp;
``````

Recall that there are really two results to this shortest-path implementation: the shortest path expressed as a semicolon-­delimited string in reverse order, and the shortest path measured by the sum of the edge weights between nodes in the shortest path. The stored procedure uses defaults of “NOTREACHABLE” and -1.0 for situations where the end node isn’t reachable from the start node. The while loop extracts the shortest-path nodes from the previous Dictionary and stitches them together from end node to start node using string concatenation. If you’re ambitious you can use a stack and construct the result string from start node to end node. Recall that the two results are returned via out parameters pathResult and distResult.

The stored procedure definition concludes by checking for errors:

``````} // Try
catch(Exception ex)
{
pathResult = ex.Message;
distResult = -1.0;
}
finally
{
if (conn != null && conn.State == ConnectionState.Open)
conn.Close();
}
}  // csp_ShortestPath()
``````

If an Exception has occurred—typically if the SQL Server runs out of memory due to an extremely large graph or due to a SQL connection time-out—the code attempts to clean the SQL connection up and return results that have some meaning.

## Building, Deploying and Calling the Stored Procedure

After making sure you’ve set the database trustworthy property to “on” and the Database Permission Level to External, select Build CreateStoredProc from the Visual Studio Build menu. If the build succeeds, select Deploy Create­StoredProc from the Build menu to copy the CLR stored procedure from your development machine to the SQL Server that hosts the SQL-based graph.

There are several ways to call the CLR stored procedure. The Visual Studio project contains a test template you can use. Or you can call the stored procedure directly from SSMS as shown in Figure 2. For example:

``````declare @startNode bigint
declare @endNode bigint
declare @maxNodesToCheck int
declare @pathResult varchar(4000)
declare @distResult float
set @startNode = 222
set @endNode = 444
set @maxNodesToCheck = 100000
exec csp_ShortestPath @startNode, @endNode, @maxNodesToCheck,
@pathResult output, @distResult output
select @pathResult
select @distResult
``````

Another option is to call the CLR stored procedure from within a C# application using ADO.NET, along the lines of Figure 8.

Or, if you’re a real glutton for punishment, you can call the CLR stored procedure using LINQ technology.

Figure 8 Calling a Stored Procedure from Within C# Using ADO.NET

``````SqlConnection sc = null;
string connString = "Server=" + dbServer + ";Database=" +
database + ";Trusted_Connection=True";
sc = new SqlConnection(connString);
SqlCommand cmd = new SqlCommand("csp_ShortestPath", sc);
cmd.CommandType = System.Data.CommandType.StoredProcedure;
// sp signature: (System.Data.SqlTypes.SqlInt64 startNode, SqlInt64 endNode,   SqlInt32 maxNodesToCheck, out SqlString path)
cmd.CommandTimeout = commandTimeout;  // Seconds
System.Data.SqlDbType.BigInt);
sqlStartNode.Direction = ParameterDirection.Input;
sqlStartNode.Value = sn;
// ...
cmd.ExecuteNonQuery();
string result = (string)cmd.Parameters["@pathResult"].Value;
distResult = (double)cmd.Parameters["@distResult"].Value;
``````

## More Than Just Shortest Path

The code and explanation presented here should allow you to tackle shortest-path graph analysis using a CLR stored procedure. Additionally, you might find the design paradigm useful in general. Instead of fetching data from SQL to a client application and then doing filtering and complex processing on the client, use a CLR stored procedure to fetch, filter and process on the server—and then transfer results to the client. I’ve found this approach to be extremely useful in several situations with large databases and requirements for real-time performance.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Internet Explorer and Bing. He holds degrees from the University of California at Irvine and the University of Southern California. His personal blog is at jamesmccaffrey.wordpress.com. He can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Darren Gehring (Microsoft)
Darren Gehring is a test manager at Microsoft Research in Redmond, Wash.  Prior to working in Microsoft Research, he worked in the Microsoft SQL Server product group for 10 years. His Web page is at research.microsoft.com/people/darrenge/.