Handling Partitions in NoSQL

By Adam Fowler

The word partition is used for two different concepts in NoSQL land. A data partition is a mechanism for ensuring that data is evenly distributed across a cluster. On the other hand, a network partition occurs when two parts of the same database cluster cannot communicate.

On very large clustered systems, it’s increasingly likely that a failure of one piece of equipment will happen. If a network switch between servers in a cluster fails, a phenomenon referred to as (in computer jargon) split brain occurs. In this case, individual servers are still receiving requests, but they can’t communicate with each other.

This scenario can lead to inconsistency of data or simply to reduced capacity in data storage, as the network partition with the least servers is removed from the cluster (or “voted off” in true Big Brother fashion).

Tolerating partitions

You have two choices when a network partition happens:

  • Continue, at some level, to service read and write operations.

  • “Vote off” one part of the partition and decide to fix the data later when both parts can communicate. This usually involves the cluster voting a read replica as the new master for each missing master partition node.

Riak allows you to determine how many times data is replicated (three copies, by default — that is, n=3) and how many servers must be queried in order for a read to succeed. This means that, if the primary master of a key is on the wrong side of a network partition, read operations can still succeed if the other two servers are available (that is, r=2 read availability).

Riak handles writes when the primary partition server goes down by using a system called hinted handoff. When data is originally replicated, the first node for a particular key partition is written to, along with (by default) two of the following neighbor nodes.

If the primary can’t be written to, the next node in the ring is written to. These writes are effectively handed off to the next node. When the primary server comes back up, the writes are replayed to that node before it takes over primary write operations again.

In both of these operations, versioning inconsistencies can happen because different replicas may be in different version states, even if only for a few milliseconds.

Riak employs yet another system called active antientropy to alleviate this problem. This system trawls through updated values and ensures that replicas are updated at some point, preferably sooner rather than later. This helps to avoid conflicts on read while maintaining a high ingestion speed, which avoids a two‐phase commit used by other NoSQL databases with master‐slave, shared‐nothing clustering support.

If a conflict on read does happen, Riak uses read repair to attempt to return only the latest data. Eventually though, and depending on the consistency and availability settings you use, the client application may be presented with multiple versions and asked to decide for itself.

In some situations, this tradeoff is desirable, and many applications may intuitively know, based on the data presented, which version to use and which version to discard.

Secondary indexing

Secondary indexes are indexes on specific data within a value. Most key‐value stores leave this indexing up to the application. However, Riak is different, employing a scheme called documentbased partitioning that allows for secondary indexing.

Document‐based partitioning assumes that you’re writing JSON structures to the Riak database. You can then set up indexes on particular named properties within this JSON structure, as shown:

{
  “order-id”: 5001,
  “customer-id”: 1429857,
  “order-date”: “2014-09-24”,
  “total”: 134.24
}

If you have an application that’s showing a customer’s orders for the previous month, then you want to query all the records, as shown, where the customer id is a fixed value (1429857) and the order‐date is within a particular range (the beginning and end of the month).

In most key‐value stores, you create another bucket whose key is the combined customer number and month and the value is a list of order ids. However, in Riak, you simply add a secondary index on both customer‐id (integer) and order‐date (date), which does take up extra storage space but has the advantage of being transparent to the application developer.

These indexes are also updated live — meaning there’s no lag between updating a document value in Riak and the indexes being up to date. This live access to data is more difficult to pull off than it seems. After all, if the indexes are inconsistent, you’ll never find the consistently held data!

Evaluating Riak

Basho, the commercial entity behind Riak, says that its upcoming version 2.0 NoSQL database always has strong consistency, a claim that other NoSQL vendors make. The claim by NoSQL vendors to always have strong consistency is like claiming to be a strong vegetarian . . . except on Sundays when you have roast beef.

Riak is not an ACID‐compliant database. Its configuration cannot be altered such that it runs in ACID compliance mode. Clients can get inconsistent data during normal operations or during network partitions. Riak trades absolute consistency for increased availability and partition tolerance.

Running Riak in strong consistency mode means that its read replicas are updated at the same time as the primary master. This involves a two‐phase commit — basically, the master node writing to the other nodes before it confirms that the write is complete.

At the time of this writing, Riak’s strong consistency mode doesn’t support secondary indexes or complex data types (for example, JSON). Hopefully, Basho will fix this issue in upcoming releases of the database.

Riak Search (a rebranded and integrated Apache Solr search engine uses an eventually consistent update model) may produce false positives when using strong consistency. This situation occurs because data may be written and then the transaction abandoned, but the data is still used for indexing — leaving a “false positive” search result — the result isn’t actually any longer valid for the search query.

Riak also uses a separate sentinel process to determine which node becomes a master in failover conditions. This process, however, isn’t highly available, which means that for a few seconds, it’s possible that, while a new copy of the sentinel process is brought online, a new node cannot be added or a new master elected. You need to be aware of this possibility in high‐stress failover conditions.

Riak does have some nice features for application developers, such as secondary indexing and built‐in JSON value support. Database replication for disaster recovery to other datacenters is available only in the paid for version, whose price can be found on their website (rental prices shown, perpetual license prices given on application only).

The Riak Control cluster monitoring tool also isn’t highly regarded because of its lag time when monitoring clusters. Riak holds a lot of promise, and if Basho will add more enterpriselevel cluster‐management facilities in future versions, it will become a best‐in‐class product.