SQL Azure Horizontal Partitioning: Part 2

Article
06/24/2010

[This article was contributed by the SQL Azure team.]

SQL Azure currently supports 1 GB and 10 GB databases, and on June 28^th, 2010 there will be 50 GB support. If you want to store larger amounts of data in SQL Azure you can divide your tables across multiple SQL Azure databases. This article will discuss how to use a data access layer to join two tables on different SQL Azure databases using LINQ. This technique horizontal partitions your data in SQL Azure.

In our version of horizontal partitioning, every table exists in all the databases in the partition set. We are using a hash base partitioning schema in this example – hashing on the primary key of the row. The middle layer determines which database to write each row based on the primary key of the data being written. This allows us to evenly divide the data across all the databases, regardless of individual table growth. The data access knows how to find the data based on the primary key, and combines the results to return one result set to the caller.

This is considered hash based partitioning. There is another style of horizontal portioning that is range based. If you are using integers as primary keys you can implement your middle layer to fill databases in consecutive order, as the primary key grows. You can read more about the different types of partitioning here.

Performance Gain

There is also a performance gain to be obtained from partitioning your database. Since SQL Azure spreads your databases across different physical machines, you can get more CPU and RAM resources by partitioning your workload. For example, if you partition your database across 10 - 1 GB SQL Azure databases you get 10X the CPU and memory resources. There is a case study (found here) by TicketDirect, who partitioning their workload across hundreds of SQL Azure databases during peak load.

Considerations

When horizontal partitioning your database you lose some of the features of having all the data in a single database. Some considerations when using this technique include:

Foreign keys across databases are not supported. In other words, a primary key in a lookup table in one database cannot be referenced by a foreign key in a table on another database. This is a similar restriction to SQL Server’s cross database support for foreign keys.
You cannot have transactions that span two databases, even if you are using Microsoft Distributed Transaction Manager on the client side. This means that you cannot rollback an insert on one database, if an insert on another database fails. This restriction can be mitigated through client side coding – you need to catch exceptions and execute “undo” scripts against the successfully completed statements.
All the primary keys need to be uniqueidentifier. This allows us to guarantee the uniqueness of the primary key in the middle layer.
The example code shown below doesn’t allow you to dynamically change the number of databases that are in the partition set. The number of databases is hard coded in the SqlAzureHelper class in the ConnectionStringNames property.
Importing data from SQL Server to a horizontally partitioned database requires that you move each row one at a time emulating the hashing of the primary keys like the code below.

The Code

The code will show you how to make multiple simultaneous requests to SQL Azure and combine the results to take advantage of those resources. Before you read this post you should familiarize yourself with our previous article about using Uniqueidentifier and Clustered Indexes and Connections and SQL Azure. In order to accomplish horizontal partitioning, we are using the same SQLAzureHelper class as we used in the vertical partitioning blog post.

The code has these goals:

Use forward only cursors to maximize performance.
Combine multiple responses into a complete response using Linq.
Only access one database for primary key requests.
Evenly divide row data across all the databases.

Accounts Table

The example table I am using is named Accounts and has a primary key of a uniqueidentifier, with a clustered index built on the Date column. I created it with this script on two databases:

 CREATE TABLE [dbo].[Accounts](
    [Id] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](max) NULL,
    [Date] [datetime] NULL,
 CONSTRAINT [PK_Accounts] PRIMARY KEY NONCLUSTERED 
(
    [Id] ASC
)
)


ALTER TABLE [dbo].[Accounts] ADD  CONSTRAINT [DF__Accounts__Date__7C8480AE]  DEFAULT (getdate()) FOR [Date]


CREATE CLUSTERED INDEX [idxDate] ON [dbo].[Accounts] 
(
    [Date] ASC
)

Partitioning by Primary Key

Because the table we are partitioning is using uniqueidentifiers as its primary key, and these are generated by Guid.NewGuid(), we have a fairly random primary key. The primary key is hashed to figure out which database contains the row for that key. The code looks like this:

 /// <summary>
/// Names of the Databases In Horizontal Partition
/// </summary>
public static String[] ConnectionStringNames = 
{ "Database001", "Database002" };

/// <summary>
/// Connections Strings In the Horizontal Partition
/// </summary>
/// <returns></returns>
public static IEnumerable<String> ConnectionStrings()
{
    foreach (String connectionStringName in ConnectionStringNames)
        yield return ConfigurationManager.
ConnectionStrings[connectionStringName].ConnectionString;
}

/// <summary>
/// Return the Index to the Database For the Primary Key
/// </summary>
/// <param name="primaryKey"></param>
/// <returns></returns>
private static int DatabaseIndex(Guid primaryKey)
{
    return (BitConverter.ToInt32(primaryKey.ToByteArray(), 0));
}

/// <summary>
/// Returns the Connectiong String Name for the Primary Key
/// </summary>
/// <param name="primaryKey"></param>
/// <returns></returns>
private static String ConnectionStringName(Guid primaryKey)
{
    return (ConnectionStringNames[DatabaseIndex(primaryKey) % 
ConnectionStringNames.Length]);
}

/// <summary>
/// Returns the Connection String For the Primary Key
/// </summary>
/// <param name="primaryKey"></param>
/// <returns></returns>
public static String ConnectionString(Guid primaryKey)
{
    return (ConfigurationManager.ConnectionStrings[ConnectionStringName(primaryKey)]
.ConnectionString);
}

Notice that the ConnectionString() method returns the connection string to use when referencing the primary key. Configure the connection strings in the .config file for your application or web site. The array of databases referenced by the .config file is contained in the ConnectionStrings property.

Fetching A Single Row

I am building on top of the SQLAzureHelper class defined in the blog post. The idea behind the class is to have a multipurpose access layer to connect to SQL Azure. Within the SQLAzureHelper class, the code to find an account name in the example databases using the primary key looks like this:

 static String AccountName(Guid id)
{
    var accountDataReader = SQLAzureHelper.ExecuteReader(
        SQLAzureHelper.ConnectionString(id),
        sqlConnection =>
        {
            String sql = @"SELECT [Name] FROM [Accounts] WHERE Id = @Id";
            SqlCommand sqlCommand = new SqlCommand(sql, sqlConnection);
            sqlCommand.Parameters.AddWithValue("@Id", id);
            return (sqlCommand.ExecuteReader());
        });

    return ((from row in accountDataReader select (string)row["Name"]).
        FirstOrDefault());
}

Notice that we use the primary key to calculate the connection string and the as well as a parameter to the SqlCommand.

Inserting a Single Row

When inserting a single row, we need to know the primary key before connecting to the database. In order to accomplish this we call Guid.NewGuid() in the C# code, instead of NewID() on the SQL Azure. The code using the ExecutionContext class looks like this:

 static Guid InsertAccount(String name)
{
    Guid id = Guid.NewGuid();

    SQLAzureHelper.ExecuteNonQuery(
         SQLAzureHelper.ConnectionString(id),
         sqlConnection =>
         {
             String sql = @"INSERT INTO [Accounts] ([Id], [Name]) VALUES (@Id, @Name)";
             SqlCommand sqlCommand = new SqlCommand(sql, sqlConnection);
             sqlCommand.Parameters.AddWithValue("@Name", name);
             sqlCommand.Parameters.AddWithValue("@Id", id);
             sqlCommand.ExecuteNonQuery();
         });

    return (id);
}

Summary

In part three, I will show how to fetch a result set that is merged from multiple responses and how to insert multiple rows into the partitioned tables, including some interesting multi-threaded aspects of calling many SQL Azure databases at the same time. Do you have questions, concerns, comments? Post them below and we will try to address them.