Define and optimize data partitions in Apache Cassandra

Apache Cassandra is built for speed and scalability; here's how to get the most out of those benefits.
81 readers like this
81 readers like this
Person standing in front of a giant computer screen with numbers, data

Opensource.com

Apache Cassandra is a database. But it's not just any database; it's a replicating database designed and tuned for scalability, high availability, low-latency, and performance. Cassandra can help your data survive regional outages, hardware failure, and what many admins would consider excessive amounts of data.

Having a thorough command of data partitions enables you to achieve superior Cassandra cluster design, performance, and scalability. In this article, I'll examine how to define partitions and how Cassandra uses them, as well as the most critical best practices and known issues you ought to be aware of.

To set the scene: partitions are chunks of data that serve as the atomic unit for key database-related functions like data distribution, replication, and indexing. Distributed data systems commonly distribute incoming data into these partitions, performing the partitioning with simple mathematical functions such as identity or hashing, and using a "partition key" to group data by partition. For example, consider a case where server logs arrive as incoming data. Using the "identity" partitioning function and the timestamps of each log (rounded to the hour value) for the partition key, we can partition this data such that each partition holds one hour of the logs.

Data partitions in Cassandra

Cassandra operates as a distributed system and adheres to the data partitioning principles described above. With Cassandra, data partitioning relies on an algorithm configured at the cluster level, and a partition key configured at the table level.

Cassandra data partition

Cassandra Query Language (CQL) uses the familiar SQL table, row, and column terminologies. In the example diagram above, the table configuration includes the partition key within its primary key, with the format: Primary Key = Partition Key + [Clustering Columns].

A primary key in Cassandra represents both a unique data partition and a data arrangement inside a partition. Data arrangement information is provided by optional clustering columns. Each unique partition key represents a set of table rows managed in a server, as well as all servers that manage its replicas.

Defining primary keys in CQL

The following four examples demonstrate how a primary key can be represented in CQL syntax. The sets of rows produced by these definitions are generally considered a partition.

Definition 1 (partition key: log_hour, clustering columns: none)

CREATE TABLE server_logs(
   log_hour TIMESTAMP PRIMARYKEY,
   log_level text,
   message text,
   server text
   )

Here, all rows that share a log_hour go into the same partition.

Definition 2 (partition key: log_hour, clustering columns: log_level)

CREATE TABLE server_logs(
   log_hour TIMESTAMP,
   log_level text,
   message text,
   server text,
   PRIMARY KEY (log_hour, log_level)
   )

This definition uses the same partition key as Definition 1, but here all rows in each partition are arranged in ascending order by log_level.

Definition 3 (partition key: log_hour, server, clustering columns: none)

CREATE TABLE server_logs(
   log_hour TIMESTAMP,
   log_level text,
   message text,
   server text,
   PRIMARY KEY ((log_hour, server))
   )

In this definition, all rows share a log_hour for each distinct server as a single partition.

Definition 4 (partition key: log_hour, server, clustering columns: log_level)

CREATE TABLE server_logs(
   log_hour TIMESTAMP,
   log_level text,
   message text,
   server text,
   PRIMARY KEY ((log_hour, server),log_level)
   )WITH CLUSTERING ORDER BY (column3 DESC);

This definition uses the same partition as Definition 3 but arranges the rows within a partition in descending order by log_level.

How Cassandra uses the partition key

Cassandra relies on the partition key to determine which node to store data on and where to locate data when it's needed. Cassandra performs these read and write operations by looking at a partition key in a table, and using tokens (a long value out of range -2^63 to +2^63-1) for data distribution and indexing. These tokens are mapped to partition keys by using a partitioner, which applies a partitioning function that converts any partition key to a token. Through this token mechanism, every node of a Cassandra cluster owns a set of data partitions. The partition key then enables data indexing on each node.

Cassandra cluster with 3 nodes and token-based ownership

A Cassandra cluster with three nodes and token-based ownership. This is a simplistic representation: the actual implementation uses Vnodes.

Data partition impacts on Cassandra clusters

Careful partition key design is crucial to achieving the ideal partition size for the use case. Getting it right allows for even data distribution and strong I/O performance. Partition size has several impacts on Cassandra clusters you need to be aware of:

  • Read performance—In order to find partitions in SSTables files on disk, Cassandra uses data structures that include caches, indexes, and index summaries. Partitions that are too large reduce the efficiency of maintaining these data structures – and will negatively impact performance as a result. Cassandra releases have made strides in this area: in particular, version 3.6 and above of the Cassandra engine introduce storage improvements that deliver better performance for large partitions and resilience against memory issues and crashes.
  • Memory usage— Large partitions place greater pressure on the JVM heap, increasing its size while also making the garbage collection mechanism less efficient.
  • Cassandra repairs—Large partitions make it more difficult for Cassandra to perform its repair maintenance operations, which keep data consistent by comparing data across replicas.
  • Tombstone eviction—Not as mean as it sounds, Cassandra uses unique markers known as "tombstones" to mark data for deletion. Large partitions can make that deletion process more difficult if there isn't an appropriate data deletion pattern and compaction strategy in place.

While these impacts may make it tempting to simply design partition keys that yield especially small partitions, the data access pattern is also highly influential on ideal partition size (for more information, read this in-depth guide to Cassandra data modeling). The data access pattern can be defined as how a table is queried, including all of the table's select queries. Ideally, CQL select queries should have just one partition key in the where clause—that is to say, Cassandra is most efficient when queries can get needed data from a single partition, instead of many smaller ones.

Best practices for partition key design

Following best practices for partition key design helps you get to an ideal partition size. As a rule of thumb, the maximum partition size in Cassandra should stay under 100MB. Ideally, it should be under 10MB. While Cassandra versions 3.6 and newer make larger partition sizes more viable, careful testing and benchmarking must be performed for each workload to ensure a partition key design supports desired cluster performance.

Specifically, these best practices should be considered as part of any partition key design:

  • The goal for a partition key must be to fit an ideal amount of data into each partition for supporting the needs of its access pattern.
  • A partition key should disallow unbounded partitions: those that may grow indefinitely in size over time. For instance, in the server_logs examples above, using the server column as a partition key would create unbounded partitions as the number of server logs continues to increase. In contrast, using log_hour limits each partition to an hour of data.
  • A partition key should also avoid creating a partition skew, in which partitions grow unevenly, and some are able to grow without limit over time. In the server_logs examples, using the server column in a scenario where one server generates considerably more logs than others would produce a partition skew. To avoid this, a useful technique is to introduce another attribute from the table to force an even distribution, even if it's necessary to create a dummy column to do so.
  • It's helpful to partition time-series data with a partition key that uses a time element as well as other attributes. This protects against unbounded partitions, enables access patterns to use the time attribute in querying specific data, and allows for time-bound data deletion. The examples above each demonstrate this by using the log_hour time attribute.

Several tools are available to help test, analyze, and monitor Cassandra partitions to check that a chosen schema is efficient and effective. By carefully designing partition keys to align well with the data and needs of the solution at hand, and following best practices to optimize partition size, you can utilize data partitions that more fully deliver on the scalability and performance potential of a Cassandra deployment.

What to read next
Anil Inamdar is the Head of U.S. Consulting & Delivery at Instaclustr, which provides a managed service platform of open source technologies such as Apache Cassandra, Apache Spark, Elasticsearch and Apache Kafka. Anil has 20+ years of experience in data and analytics roles.

1 Comment

Hi.

One of the data analytics company has given me an assignment of creating architecture and explaining them with diagrams. This assignment has two questions. The Q1 is related to choosing right technology and data partitioning strategy using a nosql cloud database. Also reducing the compute time so that entire compute load can finish in few hours.

Coming to Q2. A trucking company deals with lots of invoices(daily 40000). And currently all people can see all the invoices which are not related to them. How would you design a system to store all this data in a cost efficient way. How would you design a authorization system to ensure organizations can only see invoices related only to themselves.

I saw your blog on data partitioning in Cassandra. I think you can help me as you may already be knowing the solution.

Thanks
Prakash Saswadkar
Mumbai, mob: +91-981 941 5206

-- Copy pasted from word doc --
Problem1:

A large fast food chain wants you to generate forecast for 2000 restaurants of this fast food chain. Each restaurant has close to 500 items that they sell.
The fast food chain provides data for last 3 years at a store, item, day level. The ask is provide forecast out for the following year.
Assume the data is static. Data Scientist look at the problem and have figured out a solution that provides the best forecast.
The data scientist have built an algorithm that takes all data at a store level and produce forecasted output at the store level.
It takes them 15 minutes to process each store.

Questions:
1) Given the input data is static. What is the right technology to store the data and what would be the partitioning strategy?
2) Each store takes 15 minutes, how would you design the system to orchestrate the compute faster - so the entire compute can finish this in < 5hrs

Make any assumptions in your way and state them as you design the solution and do not worry about the analytic part. Assume the analytic
part is a black box.

Problem 2:

A trucking company deals with a lot of invoices close to 40,000 a day. Regulatory requirements need 7 years of data to be stored.
A trucker scans the invoice on his mobile device at the point of delivery. Image recognition program scans the invoice and adds
meta information captured from the image. Meta information will include shipped from and shipped to and other information.
The trucking company can see all its invoices, the shipped from organizations can view all invoices whose shipped from matches with theirs,
similar rules apply to shipped to.

Questions:
How would you design a system to store all this data in a cost efficient way.
How would you design a authorization system to ensure organizations can only see invoices based on rules stated above.
What would be the design considerations to make the solution globally available ?
-- --

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.