A recent question in the Apache Cassandra mailing list triggered this blog post. The question revolved around events that trigger a memtable flush. Understanding the root cause of a memtable flush is essential to get a better understanding of Apache Cassandra. Another question that frequently crops up is the size of an SSTable as a result of a memtable flush.
A memtable is created for every table or column family. There can be multiple memtables for a table but only one of them will be active. The rest will be waiting to be flushed. There are a few properties that affect a memtables size and flushing frequency. These include:
- memtable_flush_writers – This is the number of threads allocated for flushing memtables to disk. This defaults to two.
- memtable_heap_space_in_mb – This is the total allocated space for all memtables on an Apache Cassandra node. By default, this is one-fourth your heap size. Specifying this property results in an absolute heap size in MB as opposed to a percentage of the total JVM heap.
- memtable_cleanup_threshold – A percentage of your total available memtable space that will trigger a memtable cleanup. memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1). By default this is essentially 33% of your memtable_heap_space_in_mb. A scheduled cleanup results in flushing of the table/column family that occupies the largest portion of memtable space. This keeps happening till your available memtable memory drops below the cleanup threshold.
- commitlog_total_space_in_mb – The total space in MB that is reserved for the commit log. If unspecified this defaults to the smaller of the two numbers i.e. 8192 MB or 25% of the total space of the commit log volume.
- memtable_flush_period_in_ms – This is a CQL table property that specifies the number of milliseconds after which a memtable should be flushed. This property is specified on table creation.
- memtable_allocation_type – Stipulates where Cassandra allocates and manages memtable memory. Memory can be allocated on the JVM heap or directly into memory. This does not affect the flushing of memtables but only signifies where memtables space is allocated.
So when does a memtable get flushed to disk? A memtable is flushed when:
- The commit log reaches its maximum size – The main aim of the commit log is to track all data that has not been written to disk i.e. data in a memtable which has not been flushed to disk. Commit log forces flushing of memtables if a commit log runs out of disk space. The commit log allocates chunks of space in what is called a commit log segment. When a commit log runs out of disk space (surpasses its config threshold) it needs to recycle allocated segements. This recycling process triggers flushing of all memtables. The commit log cannot be cleared as long as it refers to data in a memtable. Doing so risks data loss. Thus when a commit log is full all memetables are first flushed to disk and then the commit log is recycled. The org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.maybeFlushToReclaim() method is where this recycling is done.
- Periodically – If the CQL table sets the memtable_flush_period_in_ms property then the memtable gets flushed after the configured number of milliseconds has elapsed.
- Memtable surpasses it on and off-heap memory threshold – This is best understood using an example. Let assume we have an Apache Cassandra instance that has allocated 4G of space. Out of this only 3,925.5MB is available to the Java runtime. Please look at the following StackOverflow question for the reasons behind this. Of this, by default, we have 981 MB allocated towards memtable i.e. 1/4the of 3,925.5. Our memtable_cleanup_threshold is the default value i.e. 33 percent of the total memtable heap and off heap memory. In our example that comes to 327 MB. Thus when total space allocated for all memtables is greater than 327 MB a memtable clean-up is triggered. The cleanup process looks for the largest memtable and flushes that to disk. The org.apache.cassandra.db.ColumnFamilyStore.FlushLargestColumnFamily class can be examined for further details.
Due to the above configuration options and varying Apache Cassandra workloads, our SSTable size on disk can vary greatly. One thing to remember is that by default SSTables are compressed. SSTable compression can be turned off using compression table property.