Designing an effective data model is imperative to scale any application. This is true both in the relational and non relational world. Data modelling in Apache Cassandra is no different than any relational database. It requires a good understanding of the target domain and usage patterns within the domain. A deep understanding of your target data store is also essential to build a robust data model. To scale any applications you must get your data model design correct. In Cassandra is essential to get your data model correct. A bad data model will erase any of the benefits of using Cassandra. This blog post introduces you to key Cassandra data modelling principle.
Key Cassandra Guidelines
Simple But Often Overlooked Principles
- Understand Apache Cassandra Architecture – This may seem obvious but often gets overlooked. Picking the correct data model requires a deep understanding of Cassandra’s internals. Having a good understanding of Apache Cassandra architecture is a must. At a minimum data modellers must understand how Cassandra distributes data across a ring. The must also have a good understanding of Cassandra’s read and write path. For a detailed discussion on Apache Cassandra Architecture please refer to my previous post on Apache Cassandra Architecture.
- Understand your domain – Again an obvious but sometimes overlooked principle. As with any database, it is important to understand your domain. One must identify various entities and relationships between these entities. Please do not create your data model based on your entities and their relationships. Just like an RDBMS you need need to understand your data and its usage patterns. Although it is important to understand your domain never model based on your domain.
Two Key But Conflicting Principles
There are two main yet conflicting principles that you must adhere to in order to build and efficient model. Any Cassandra data model you must spread data evenly across the cluster. You must also minimise the number of nodes queried to retrieve the requested data. As these two requirements conflict, you will need to find a happy medium.
- Data must be evenly spread across the cluster – Apache Cassandra is a distributed database. A distributed database is a single database distributed across a cluster of machines. Cassandra makes it simple to evenly spread data across the cluster. It is up to the data modeller to ensure this actually happens. Any data model designed must spread data across the cluster. Unbalanced data distribution will result in pressurising a small subset of nodes. This will lead to performance and scalability issues.
- Minimise the number of partitions queried when retrieving data – Partitions are groups of CQL rows that share a partition key. Data is distributed across the nodes on the basis of the partition key. Rows are spread across the cluster based on a hash of the partition key i.e. the first element of the primary key. Different partitions live on different nodes. When querying we need to minimise the number of nodes we visit to retrieve the required data. Thus we need to minimise the partitions we read from.
Ignore these Principles at your own Peril
- Model around your queries – You must identify your queries and model around them. Each table/column family should be build to answer a high-level query that needs to be supported. Modelling around queries enables us to minimise reads and thus critical to model around queries. Modelling around queries leads to data duplication and a heavier write load which is a preferred tradeoff in Cassandra.
- Writes are Cheap and Fast – RDBMS were designed a couple of decades ago to be space efficient. This was due to expensive storage. Storage is now much cheaper and thus Cassandra has been designed with this in mind. Cassandra is optimised for writing data. Increasing the number of writes to improve read performance is a good tradeoff. In Cassandra reads tend to be more expensive than writes and are always trickier to tune.
- Denormalisation – If you are coming from a relational model it is important to understand that it is a must to denormalisation data. In the relational world the pro and cons of denormalisation are well understood. Normalisation leads to a clear, space efficient schema sometimes at the cost of inefficient queries due to the need to join a number of tables. Denormalised structures are more efficient to query but are space inefficient. Normalising data models in distributed database also lead to inefficient queries but the performance consequence are greatly magnified due to the distributed nature of the database. Duplicate data to satisfy different versions of a read query. Queries must be as simple as possible and data presented back to the application must require minimum manipulations. Most NoSQL database discourage and do not support joins. Joining at the application level is an alternative but we must be careful not to hit a large number of nodes in the cluster to satisfy our query. Denormalisation is thus a very useful tool as disk space is cheap and writes are extremely fast.
Apache Cassandra Anti Patterns
The following are highly ineffective when using Apache Cassandra
- Do Not Run Cassandra on a SAN or Network Storage – SAN storage is not recommended for a Cassandra deployment for a variety of reasons. These include
- SAN Storage is not cost effective
- SAN Storage introduces a single point of failure
- External storage adds latency even when used across a high-speed network. Cassandra uses uncoordinated IO, so each node assumes it has local disk and attempts to maximise the bandwidth it uses. If you use a SAN, you wind up hammering your SAN.
- Do not install your commit log and data directory on the same volume. The commit log and data directory have very different IO patterns and thus installing them on the same volume result in conflicting IO access patterns. This results in a slow node
- Distributed Joins – In Cassandra you are looking to retrieve your data from a single partition. Every time one retrieves data from more than one node we are essentially doing a distributed join. This could be explicitly done in your application or can be a result of an IN query. These types of queries should be avoided as it can easily end up querying a large number of nodes in the cluster. This will not only be slow but in certain circumstances end up putting to much pressure on the cluster.
- Mutable Data – Do not model data which can have mutable state. Store the events. Avoid updating cells. Essentially use event sourcing. Event Sourcing is a pattern where we save all changes to application state as a sequence of events. We can thus use the event log to reconstruct past states. Essentially avoid mutable state in a distributed system.
- Not Testing Your Schema – Performance test your schema as quickly as possible. Leaving testing of your schema to the later stages of your development cycle is expensive and risky. Test your schema as quickly as possible. I advocate a performance test driven schema development i.e. develop your performance test before developing your schema
Thanks for reading and would love to get any feedback the post.