A recent question in the Apache Cassandra mailing list triggered this blog post. The question revolved around events that trigger a memtable flush. Understanding the root cause of a memtable flush is essential to get a better understanding of Apache Cassandra. Another question that frequently crops up is the size of an SSTable as a result of a memtable flush.
A memtable is created for every table or column family. There can be multiple memtables for a table but only one of them will be active. The rest will be waiting to be flushed. There are a few properties that affect a memtables size and flushing frequency. These include:
- memtable_flush_writers – This is the number of threads allocated for flushing memtables to disk. This defaults to two.
- memtable_heap_space_in_mb – This is the total allocated space for all memtables on an Apache Cassandra node. By default, this is one-fourth your heap size. Specifying this property results in an absolute heap size in MB as opposed to a percentage of the total JVM heap.
- memtable_cleanup_threshold – A percentage of your total available memtable space that will trigger a memtable cleanup. memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1). By default this is essentially 33% of your memtable_heap_space_in_mb. A scheduled cleanup results in flushing of the table/column family that occupies the largest portion of memtable space. This keeps happening till your available memtable memory drops below the cleanup threshold.
- commitlog_total_space_in_mb – The total space in MB that is reserved for the commit log. If unspecified this defaults to the smaller of the two numbers i.e. 8192 MB or 25% of the total space of the commit log volume.
- memtable_flush_period_in_ms – This is a CQL table property that specifies the number of milliseconds after which a memtable should be flushed. This property is specified on table creation.
- memtable_allocation_type – Stipulates where Cassandra allocates and manages memtable memory. Memory can be allocated on the JVM heap or directly into memory. This does not affect the flushing of memtables but only signifies where memtables space is allocated.
So when does a memtable get flushed to disk? A memtable is flushed when:
- The commit log reaches its maximum size – The main aim of the commit log is to track all data that has not been written to disk i.e. data in a memtable which has not been flushed to disk. Commit log forces flushing of memtables if a commit log runs out of disk space. The commit log allocates chunks of space in what is called a commit log segment. When a commit log runs out of disk space (surpasses its config threshold) it needs to recycle allocated segements. This recycling process triggers flushing of all memtables. The commit log cannot be cleared as long as it refers to data in a memtable. Doing so risks data loss. Thus when a commit log is full all memetables are first flushed to disk and then the commit log is recycled. The org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.maybeFlushToReclaim() method is where this recycling is done.
- Periodically – If the CQL table sets the memtable_flush_period_in_ms property then the memtable gets flushed after the configured number of milliseconds has elapsed.
- Memtable surpasses it on and off-heap memory threshold – This is best understood using an example. Let assume we have an Apache Cassandra instance that has allocated 4G of space. Out of this only 3,925.5MB is available to the Java runtime. Please look at the following StackOverflow question for the reasons behind this. Of this, by default, we have 981 MB allocated towards memtable i.e. 1/4the of 3,925.5. Our memtable_cleanup_threshold is the default value i.e. 33 percent of the total memtable heap and off heap memory. In our example that comes to 327 MB. Thus when total space allocated for all memtables is greater than 327 MB a memtable clean-up is triggered. The cleanup process looks for the largest memtable and flushes that to disk. The org.apache.cassandra.db.ColumnFamilyStore.FlushLargestColumnFamily class can be examined for further details.
Due to the above configuration options and varying Apache Cassandra workloads, our SSTable size on disk can vary greatly. One thing to remember is that by default SSTables are compressed. SSTable compression can be turned off using compression table property.
if I am allocating 981MB for mem table and cassndra initiates a flush after 327 Mb, that means at any point of time cassandra will have max of 327 mb of active memtables…then what about (981-327)mb = 654mb mem space.What is it used for. I could sense that memtables which are in queue to be flushes occupy some porrtion of this 654mb, but what about the rest of the spaces, it not it being wasted??
Memtable memory utilization depends on the number of tables and the write throughput on each of those tables. Write heavy workloads can also result in multiple memtables (only one will be active) for each table. The default value of 1/4 the heap size is a good starting point but can be tuned according to an application’s workload. In the blog post, I am assuming a single table in order to explain a memtable’s lifecycle. In reality, most Cassandra clusters will have a fair number of tables.
Very good post.
I have one question.
commitlog_total_space_in_mb = do we have commit log directory on every node
memtables from node 1 are flushed when commitlog_total_space_in_mb threshold for node1 is reached
memtables from node 2 are flushed when commitlog_total_space_in_mb threshold for node2 is reached
Could you please explain how this works.
The commit log by default is stored at $CASSANDRA_HOME/data/commitlog. This can be changed using the commitlog_directory property in the Cassandra YAML configuration file. Commit logs and flushing of memtables is an independent operation on each node.
This is the best post I’ve seen explaining memtable flush. I don’t understand why they made the threshold inversely proportional to the number of flush writers. It doesn’t allow you to properly fully utilize your system if you happen to have a lot of CPUs and a lot of memory. Why not have an algorithm friendly to such server nodes? I want to have a lot of flush writers and still leverage a very large chunk of off-heap memory. You say, “just use fewer flush writers”…. I can do that but it makes for rougher tuning than should be possible if you thought this through…. Do you know if anything is in plan to improve on this?