October 2012

Volume 27 Number 10

The Working Programmer - Cassandra NoSQL Database, Part 2: Programming

By Ted Neward | October 2012

Ted NewardIn my August 2012 column, “Cassandra NoSQL Database: Getting Started,” I examined Apache Cassandra. It’s described as the “open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon Dynamo and its data model on Google Bigtable” in the book, “Cassandra: The Definitive Guide” (O’Reilly Media, 2010). To be more precise, I looked at how to install Cassandra (which, because it’s a Java-based database, also required getting a Java Virtual Machine up and running on your machine if you didn’t have one already), how to connect to it from the command line and what its data model looked like. The data model bears repeating because it’s quite noticeably different in structure than the relational database with which most developers are familiar.

As I discussed last time (msdn.microsoft.com/magazine/JJ553519), Cassandra is a “column-oriented” data store, which means that instead of storing identically structured tuples of data arranged according to a fixed structure (the table schema), Cassandra stores “column families” in “keyspaces.” In more descriptive terms, Cassandra associates a key value with a varying number of name/value pairs (columns) that might be entirely different from one “row” to another.

For example, consider the keyspace “Earth” I created last time, with a column family named “People,” into which I’ll write rows that (may or may not) look like this:

RowKey: tedneward
  ColumnName:"FirstName", ColumnValue:"Ted"
  ColumnName:"LastName", ColumnValue:"Neward"
  ColumnName:"Age", ColumnValue:41
  ColumnName:"Title", ColumnValue:"Architect"
RowKey: rickgaribay
  ColumnName:"FirstName", ColumnValue:"Rick"
  ColumnName:"LastName", ColumnValue:"Garibay"
RowKey: theartistformerlyknownasprince
  ColumnName:"Identifier", ColumnValue: <image>
  ColumnName:"Title", ColumnValue:"Rock Star"

As you can see, each “row” contains conceptually similar data, but not all “rows” will have the same data, depending on what the developer or business needed to store for any particular row key. I don’t know Rick’s age, so I couldn’t store it. In a relational database, if the schema mandated that age was a non-NULLABLE column, I couldn’t have stored Rick at all. Cassandra says, “Why not?”

My previous column demonstrated inserting and removing data from the command line, but this isn’t particularly helpful if the goal is to write applications that will access and store data. So, without further background, let’s dive into what it takes to write applications that read from and store to Cassandra.

Cassandra, O Cassandra, Wherefore Art Thou Cassandra?

To start, I need to connect to Cassandra from the Microsoft .NET Framework. Doing so involves one of two techniques: I can use the native Apache Thrift API, or I can use a third-party wrapper on top of the native Thrift API. Thrift is a binary remote procedure call toolkit, similar in many ways to DCOM (bet you haven’t thought of that in a few years) or CORBA or .NET Remoting. It’s a particularly low-level approach to communicating with Cassandra, and while Thrift has C# support, it’s not trivial to get all that up and running. Alternatives to Thrift include FluentCassandra, cassandra-sharp, Cassandraemon and Aquiles (the Spanish translation of Achilles, which keeps the ancient Greek theme alive and well). All of these are open source and offer some nicer abstractions over the Cassandra API. For this column, I’m going to use FluentCassandra, but any of them seem to work pretty well, the odd Internet flame war notwithstanding.

FluentCassandra is available as a NuGet package, so the easiest way to get started is to fire up the NuGet Package Manager in a Visual Studio Test project (so I can write exploration tests) and do an “Install-Package FluentCassandra.” (The most recent version as of this writing is 1.1.0.) Once that’s done, and I’ve double-checked that the Cassandra server is still running after I toyed with it for the August column, I can write the first exploration test: connecting to the server.

FluentCassandra lives in the namespace “FluentCassandra” and two nested namespaces (“Connections” and “Types”), so I’ll bring those in, and then write a test to see about connecting to the database:

private static readonly Server Server = new Server("localhost");       
TestMethod]
public void CanIConnectToCassandra()
{
  using (var db = new 
    CassandraContext(keyspace: "system", server:Server))
  {
    var version = db.DescribeVersion();
    Assert.IsNotNull(version);
    testContextInstance.WriteLine("Version = {0}", version);
    Assert.AreEqual("19.30.0", version);
  }
}

Note that by the time you read this, it’s possible that the version number will be different from when I wrote it, so if that second assertion fails, check the output window to see the returned string. (Remember, exploration tests are about testing your understanding of the API, so writing output isn’t as much of a bad idea as it is in an automated unit test.)

The CassandraContext class has five different overloads for connecting to a running Cassandra server, all of them pretty easy to infer—they all deal with connection information of one form or another. In this particular case, because I haven’t created the keyspace in which I want to store (and later read) the data, I’m connecting to the “system” keyspace, which is used by Cassandra to store various systemic details in much the same way that most relational databases have one instance reserved for database metadata and security and such. But this means I don’t want to write to that system keyspace; I want to create my own, which forms the next exploration test, as shown in Figure 1.

Figure 1 Creating a System Keyspace

[TestMethod]
public void DoesMyKeyspaceExistAndCreateItIfItDoesnt()
{
  using (var db = new CassandraContext(keyspace: "system", server:Server))
  {
    bool foundEarth = false;
    foreach (CassandraKeyspace keyspace in db.DescribeKeyspaces())
    {
      Apache.Cassandra.KsDef def = keyspace.GetDescription();
      if (def.Name == "Earth")
        foundEarth = true;
    }
    if (!foundEarth)
    {
      var keyspace = new CassandraKeyspace(new CassandraKeyspaceSchema
      {
        Name = "Earth"
      }, db);
      keyspace.TryCreateSelf();
    }
    Assert.IsTrue(db.KeyspaceExists("Earth"));
  }
}

Admittedly, the loop through all the keyspaces in the database is unnecessary—I do it here to demonstrate that there are places in the FluentCassandra API where the underlying Thrift-based API peeks through, and the “Apache.Cassandra.KsDef” type is one of those.

Now that I have a keyspace, I need at least one column family within that keyspace. The easiest way to create this uses Cassandra Query Language (CQL), a vaguely SQL-like language, as shown in Figure 2.

Figure 2 Creating a Column Family Using Cassandra Query Language

[TestMethod]
public void CreateAColumnFamily()
{
  using (var db = new CassandraContext(keyspace: "Earth", server: Server))
  {
    CassandraColumnFamily cf = db.GetColumnFamily("People");
    if (cf == null)
    {
      db.ExecuteNonQuery(@"CREATE COLUMNFAMILY People (
        KEY ascii PRIMARY KEY,
        FirstName text,
        LastName text,
        Age int,
        Title text
);");
    }
    cf = db.GetColumnFamily("People");
    Assert.IsNotNull(cf);
  }
}

The danger of CQL is that its deliberately SQL-like grammar combines with the easy misperception that “Cassandra has columns, therefore it must have tables like a relational database” to trick the unwary developer into thinking in relational terms. This leads to conceptual assumptions that are wildly wrong. Consider, for example, the columns in Figure 2. In a relational database, only those five columns would be allowed in this column family. In Cassandra, those are just “guidelines” (in a quaintly “Pirates of the Caribbean” sort of way). But, the alternative (to not use CQL at all) is less attractive by far: Cassandra offers the API TryCreateColumnFamily (not shown), but no matter how many times I try to wrap my head around it, this still feels more clunky and confusing than the CQL approach.

‘Data, Data, Data! I Cannot Make Bricks Without Clay!’

Once the column family is in place, the real power of the FluentCassandra API emerges as I store some objects into the database, as shown in Figure 3.

Figure 3 Storing Objects in the Database

[TestMethod]
public void StoreSomeData()
{
  using (var db = new CassandraContext(keyspace: "Earth", server: Server))
  {
    var peopleCF = db.GetColumnFamily("People");
    Assert.IsNotNull(peopleCF);
    Assert.IsNull(db.LastError);
    dynamic tedneward = peopleCF.CreateRecord("TedNeward");
    tedneward.FirstName = "Ted";
    tedneward.LastName = "Neward";
    tedneward.Age = 41;
    tedneward.Title = "Architect";
    db.Attach(tedneward);
    db.SaveChanges();
    Assert.IsNull(db.LastError);
  }
}

Notice the use of the “dynamic” facilities of C# 4.0 to reinforce the idea that the column family is not a strictly typed collection of name/value pairs. This allows the C# code to reflect the nature of the column-oriented data store. I can see this when I store a few more people into the keyspace, as shown in Figure 4.

Figure 4 Storing More People in the Keyspace

[TestMethod]
public void StoreSomeData()
{
  using (var db = new CassandraContext(keyspace: "Earth", server: Server))
  {
    var peopleCF = db.GetColumnFamily("People");
    Assert.IsNotNull(peopleCF);
    Assert.IsNull(db.LastError);
    dynamic tedneward = peopleCF.CreateRecord("TedNeward");
    tedneward.FirstName = "Ted";
    tedneward.LastName = "Neward";
    tedneward.Age = 41;
    tedneward.Title = "Architect";
    dynamic rickgaribay = peopleCF.CreateRecord("RickGaribay");
    rickgaribay.FirstName = "Rick";
    rickgaribay.LastName = "Garibay";
    rickgaribay.HomeTown = "Phoenix";
    dynamic theArtistFormerlyKnownAsPrince =
      peopleCF.CreateRecord("TAFKAP");
    theArtistFormerlyKnownAsPrince.Title = "Rock Star";
    db.Attach(tedneward);
    db.Attach(rickgaribay);
    db.Attach(theArtistFormerlyKnownAsPrince);
    db.SaveChanges();
    Assert.IsNull(db.LastError);
  }
}

Again, just to drive the point home, notice how Rick has a HomeTown column, which wasn’t specified in the earlier description of this column family. This is completely acceptable, and quite common.

Also notice that the FluentCassandra API offers the “LastError” property, which contains a reference to the last exception thrown out of the database. This can be useful to check when the state of the database isn’t known already (such as when returning out of a set of calls that might have eaten the exception thrown, or if the database is configured to not throw exceptions).

Once Again, with Feeling

Connecting to the database, creating the keyspace (and later dropping it), defining the column families and putting in some seed data—I’m probably going to want to do these things a lot within these tests. That sequence of code is a great candidate to put into  pre-test setup and post-test teardown methods. By dropping the keyspace after and recreating it before each test, I keep the database pristine and in a known state each time I run a test, as shown in Figure 5. Sweet.

Figure 5 Running a Test

[TestInitialize]
public void Setup()
{
  using (var db = new CassandraContext(keyspace: "Earth", server: Server))
  {
    var keyspace = new CassandraKeyspace(new CassandraKeyspaceSchema {
      Name = "Earth",
      }, db);
    keyspace.TryCreateSelf();
    db.ExecuteNonQuery(@"CREATE COLUMNFAMILY People (
      KEY ascii PRIMARY KEY,
      FirstName text,
      LastName text,
      Age int,
      Title text);");
    var peopleCF = db.GetColumnFamily("People");
    dynamic tedneward = peopleCF.CreateRecord("TedNeward");
    tedneward.FirstName = "Ted";
    tedneward.LastName = "Neward";
    tedneward.Age = 41;
    tedneward.Title = "Architect";
    dynamic rickgaribay = peopleCF.CreateRecord("RickGaribay");
    rickgaribay.FirstName = "Rick";
    rickgaribay.LastName = "Garibay";
    rickgaribay.HomeTown = "Phoenix";
    dynamic theArtistFormerlyKnownAsPrince =
      peopleCF.CreateRecord("TAFKAP");
    theArtistFormerlyKnownAsPrince.Title = "Rock Star";
    db.Attach(tedneward);
    db.Attach(rickgaribay);
    db.Attach(theArtistFormerlyKnownAsPrince);
    db.SaveChanges();
  }
}
[TestCleanup]
public void TearDown()
{
  var db = new CassandraContext(keyspace: "Earth", server: Server);
  if (db.KeyspaceExists("Earth"))
    db.DropKeyspace("Earth");
}

Look Upon My Works, All Ye Mighty, and Despair!’

Reading data from Cassandra takes a couple of forms. The first is to fetch the data out of the column family using the Get method on the CassandraColumnFamily object, shown in Figure 6.

Figure 6 Fetching Data with the Get Method

[TestMethod]
public void StoreAndFetchSomeData()
{
  using (var db = new CassandraContext(keyspace: "Earth", server: Server))
  {
    var peopleCF = db.GetColumnFamily("People");
    Assert.IsNotNull(peopleCF);
    Assert.IsNull(db.LastError);
    dynamic jessicakerr = peopleCF.CreateRecord("JessicaKerr");
    jessicakerr.FirstName = "Jessica";
    jessicakerr.LastName = "Kerr";
    jessicakerr.Gender = "F";
    db.Attach(jessicakerr);
    db.SaveChanges();
    Assert.IsNull(db.LastError);
    dynamic result = peopleCF.Get("JessicaKerr").FirstOrDefault();
    Assert.AreEqual(jessicakerr.FirstName, result.FirstName);
    Assert.AreEqual(jessicakerr.LastName, result.LastName);
    Assert.AreEqual(jessicakerr.Gender, result.Gender);
  }
}

This is great if I know the key ahead of time, but much of the time, that’s not the case. In fact, it’s arguable that most of the time, the exact record or records won’t be known. So, another approach (not shown) is to use the FluentCassandra LINQ integration to write a LINQ-style query. This isn’t quite as flexible as traditional LINQ, however. Because the column names aren’t known ahead of time, it’s a lot harder to write LINQ queries to find all the Newards (looking at the LastName name/value pair in the column family) in the database, for example.

Fortunately, CQL rides to the rescue, as shown in Figure 7.

Figure 7 Using Cassandra LINQ Integration to Write a LINQ-Style Query

[TestMethod]
public void StoreAndFetchSomeDataADifferentWay()
{
  using (var db = new CassandraContext(keyspace: "Earth", server: Server))
  {
    var peopleCF = db.GetColumnFamily("People");
    Assert.IsNotNull(peopleCF);
    Assert.IsNull(db.LastError);
    dynamic charlotte = peopleCF.CreateRecord("CharlotteNeward");
    charlotte.FirstName = "Charlotte";
    charlotte.LastName = "Neward";
    charlotte.Gender = "F";
    charlotte.Title = "Domestic Engineer";
    charlotte.RealTitle = "Superwife";
    db.Attach(charlotte);
    db.SaveChanges();
    Assert.IsNull(db.LastError);
    var newards =
      db.ExecuteQuery("SELECT * FROM People WHERE LastName='Neward'");
    Assert.IsTrue(newards.Count() > 0);
    foreach (dynamic neward in newards)
    {
      Assert.AreEqual(neward.LastName, "Neward");
    }
  }
}

Note, however, that if I run this code as is, it will fail—Cassandra won’t let me use a name/value pair within a column family as a filter criteria unless an index is defined explicitly on it. Doing so requires another CQL statement:

db.ExecuteNonQuery(@"CREATE INDEX ON People (LastName)");

Usually, I want to set that up at the time the column family is created. Note as well that because Cassandra is schema-less, the “SELECT *” part of that query is a bit deceptive—it will return all the name/value pairs in the column family, but that doesn’t mean that every record will have every column. This means, then, that a query with “WHERE Gender=‘F’” will never consider the records that don’t have a “Gender” column in them, which leaves Rick, Ted and “The Artist Formerly Known as Prince” out of consideration. This is completely different from a relational database management system, where every row in a table must have values for each and every one of the columns (though I often duck that responsibility by storing “NULL” in those columns, which is considered by some to be a cardinal sin).

The full CQL language is too much to describe here, but a full reference is available on the Cassandra Web site at bit.ly/MHcWr6.

Wrapping up, for Now

I’m not quite done with the cursed prophetess just yet—while getting data in and out of Cassandra is the most interesting part to a developer (as that’s what they do all day), multi-node configuration is also a pretty big part of the Cassandra story. Doing that on a single Windows box (for development purposes; you’ll see how it would be easier to do across multiple servers) is not exactly trivial, which is why I’ll wrap up the discussion on Cassandra by doing that next time.

For now, happy coding!


Ted Neward is an architectural consultant with Neudesic LLC. He has written more than 100 articles and authored or coauthored a dozen books, including “Professional F# 2.0” (Wrox, 2010). He is an F# MVP and noted Java expert, and speaks at both Java and .NET conferences around the world. He consults and mentors regularly—reach him at ted@tedneward.com or Ted.Neward@neudesic.com if you’re interested in having him come work with your team. He blogs at blogs.tedneward.com and can be followed on Twitter at Twitter.com/tedneward.

Thanks to the following technical expert for reviewing this article: Kelly Sommers