Data Modeling Guidelines

Article
10/05/2016

Think of your access patterns before your design your schema. Unlike traditional SQL technologies, C* works best when the tables are designed for the specific queries you want to run against them. This gives good in depth coverage if everything that cqlsh syntax provides. At times, you may want to pre-aggregate / compute the data before its fed into Cassandra to reduce the number of duplicate copies to serve different API needs.

C* 2.1.13 has a bug around column name optimizations. Make sure your column names are less than 15 characters.

Do not use secondary indexes except when the indexed column isn't updated frequently and does not have a large cardinality. For example, if you have a "status" column that can only take "Success" and "Failure" as valid values and isn't changed often, it is a good candidate for secondary index. Using secondary indexes must be an exception in your data model, and not a norm. Simply create another table with partition keys on the columns that you wanted to do secondary index on to solve such problem.

This is a good read on on Spark Cassandra connector works: https://www.slideshare.net/RussellSpitzer/spark-cassandralocality

Pick correct compaction strategy. We have not used Leveled Compaction yet. For any data that looks like time series, use Date Tiered Compaction Strategy. For other types of data, use Size Tiered Compaction Strategy. Do not try to reason about how DTCS creates time buckets, that code is incredibly hard to understand.

Do not use Allow Filtering. Instead, change your data model and potentially duplicate.

Do not use queries with IN in Partition Key, else you will run into really bad performance: https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Think about spreading your data across partitions so that no single server gets hot. You can add things like month of a date field, device that the user is using and such to the partition key for this to happen.

Use TTL on your tables. Avoid explicit deletes (CQL delete from Foo) statements if possible.

Don't use C* distributed counters. There are known issues with using them.

EBay has some good articles on data modelling. Read this and this.

The sum of all data within a specific partition should be < 100MB. You can adjust TTL or Partition Key to make this happen.

Please avoid adding timestamp to partition key. Unless you have enough other elements to it, this will create hotspots in cluster.

Use static columns to model one to many relationships. Always set partition+clustering key values when setting static column value. Note that you can't add static columns to an existing schema, so it can be a prohibitive design choice.

Prefer prepared statements over simple statements.

Data Modeling Guidelines

Additional resources