Transitive closure clustering with CLR and JSON
Transitive closure is a graph algorithm that tries to follow paths in graph edges and tries to find all elements that can be reached from some element, or groups of elements that are mutually reachable. Although SQL Server still don’t provides native function for transitive closure, this algorithm can be implemented using CLR aggregates that can be placed in SQL database.
Imagine that you have some relation in a database such as customer who bought product 17 also bought product 25, customer who bought product 25 also bought product 48. You can easily find these kinds of relations in foreign keys of the tables - for example, join OrderLines with same OrderID and take ProductIDs from the order lines like in the following query (from WideWorldImporters sample database):
SELECT ol1.StockItemID AS Product1, ol2.StockItemID AS Product2 FROM Sales.OrderLines AS ol1 INNER JOIN Sales.OrderLines AS ol2 ON ol1.OrderLineID = ol2.OrderLineID
The harder job is to and create groups of the products that are frequently bought together like [17,25,48] in the example above. In this case, you would need to follow the “edges” and reach all products in the chain.
Davide Mauri (SQL MVP) created an interesting aggregate that implements this clustering algorithm. The aggregate takes the pairs of number that represent related product ids, people ids, etc.:
The values in this table might be related products, friendship relationships or any other pairs of related ids.
In this example you can see that there are two groups (cluster) – one with root 1 and another with root 7. Davide built an CLR aggregate that takes the nodes from the rows that represent relations and returns the groups containing clustered nodes – in this case:
You can download the aggregate from this GitHub repository: yorek/non-scalar-uda-transitive-closure, build solution, and create assembly in SQL database:
CREATE ASSEMBLY TransitiveClosure FROM 'D:\...\TransitiveClosure\bin\Release\TransitiveClosureAggregatorLibrary.dll'; GO CREATE SCHEMA TC; GO CREATE AGGREGATE TC.CLUSTERING(@id1 INT, @id2 INT) RETURNS NVARCHAR(MAX) EXTERNAL NAME TransitiveClosure.[TransitiveClosure.Aggregate];
Clustering in action
The simple code that performs clustering is shown in the following example:
declare @edges table(n1 int, n2 int); insert into @edges values (1,2),(2,3),(3,4),(4,5),(2,21),(2,22), (7,8),(8,9),(9,10); select TC.CLUSTERING(n1,n2) from @edges;
TC.CLUSTERING aggregate performs clustering using transitive closure and returns groups formatted as JSON collection. To unpack these groups, we can use OPENJSON function:
declare @edges table(n1 int, n2 int); insert into @edges values (1,2),(2,3),(3,4),(4,5),(2,21),(2,22), (7,8),(8,9),(9,10); select TC.CLUSTERING(n1,n2) from @edges; select cluster = [key], elements = value from openjson( (select TC.CLUSTERING(n1,n2) from @edges) );
This code returns the following result:
As you might see, with this CLR aggregate you can easily perform grouping using transitive closure algorithm.