ACID versus BASE Data Stores - dummies

ACID versus BASE Data Stores

By Dirk deRoos

One hallmark of relational database systems is something known as ACID compliance. As you might have guessed, ACID is an acronym — the individual letters, meant to describe a characteristic of individual database transactions, can be expanded as described in this list:

  • Atomicity: The database transaction must completely succeed or completely fail. Partial success is not allowed.

  • Consistency: During the database transaction, the RDBMS progresses from one valid state to another. The state is never invalid.

  • Isolation: The client’s database transaction must occur in isolation from other clients attempting to transact with the RDBMS.

  • Durability: The data operation that was part of the transaction must be reflected in nonvolatile storage (computer memory that can retrieve stored information even when not powered — like a hard disk) and persist after the transaction successfully completes. Transaction failures cannot leave the data in a partially committed state.

Certain use cases for RDBMSs, like online transaction processing, depend on ACID-compliant transactions between the client and the RDBMS for the system to function properly. A great example of an ACID-compliant transaction is a transfer of funds from one bank account to another.

This breaks down into two database transactions, where the originating account shows a withdrawal, and the destination account shows a deposit. Obviously, these two transactions have to be tied together in order to be valid so that if either of them fail, the whole operation must fail to ensure both balances remain valid.

Hadoop itself has no concept of transactions (or even records, for that matter), so it clearly isn’t an ACID-compliant system. Thinking more specifically about data storage and processing projects in the entire Hadoop ecosystem, none of them is fully ACID-compliant, either. However, they do reflect properties that you often see in NoSQL data stores, so there is some precedent to the Hadoop approach.

One key concept behind NoSQL data stores is that not every application truly needs ACID-compliant transactions. Relaxing on certain ACID properties (and moving away from the relational model) has opened up a wealth of possibilities, which have enabled some NoSQL data stores to achieve massive scalability and performance for their niche applications.

Whereas ACID defines the key characteristics required for reliable transaction processing, the NoSQL world requires different characteristics to enable flexibility and scalability. These opposing characteristics are cleverly captured in the acronym BASE:

  • BasicallyAvailable: The system is guaranteed to be available for querying by all users. (No isolation here.)

  • Soft State: The values stored in the system may change because of the eventual consistency model, as described in the next bullet.

  • Eventually Consistent: As data is added to the system, the system’s state is gradually replicated across all nodes. For example, in Hadoop, when a file is written to the HDFS, the replicas of the data blocks are created in different data nodes after the original data blocks have been written. For the short period before the blocks are replicated, the state of the file system isn’t consistent.

The acronym BASE is a bit contrived, as most NoSQL data stores don’t completely abandon all the ACID characteristics — it’s not really the polar opposite concept that the name implies, in other words. Also, the Soft State and Eventually Consistent characteristics amount to the same thing, but the point is that by relaxing consistency, the system can horizontally scale (many nodes) and ensure availability.

No discussion of NoSQL would be complete without mentioning the CAP theorem, which represents the three kinds of guarantees that architects aim to provide in their systems:

  • Consistency: Similar to the C in ACID, all nodes in the system would have the same view of the data at any time.

  • Availability: The system always responds to requests.

  • Partition tolerance: The system remains online if network problems occur between system nodes.

The CAP theorem states that in distributed networked systems, architects have to choose two of these three guarantees — you can’t promise your users all three. That leaves you with the three possibilities shown:

  • Systems using traditional relational technologies normally aren’t partition tolerant, so they can guarantee consistency and availability. In short, if one part of these traditional relational technologies systems is offline, the whole system is offline.

  • Systems where partition tolerance and availability are of primary importance can’t guarantee consistency, because updates (that destroyer of consistency) can be made on either side of the partition. The key-value stores Dynamo and CouchDB and the column-family store Cassandra are popular examples of partition tolerant/availability (PA) systems.

  • Systems where partition tolerance and consistency are of primary importance can’t guarantee availability because the systems return errors until the partitioned state is resolved.

    Hadoop-based data stores are considered CP systems (consistent and partition tolerant). With data stored redundantly across many slave nodes, outages to large portions (partitions) of a Hadoop cluster can be tolerated. Hadoop is considered to be consistent because it has a central metadata store (the NameNode) which maintains a single, consistent view of data stored in the cluster.