Zookeeper and HBase Reliability

By Dirk deRoos

Zookeeper is a distributed cluster of servers that collectively provides reliable coordination and synchronization services for clustered applications. Admittedly, the name “Zookeeper” may seem at first to be an odd choice, but when you understand what it does for an HBase cluster, you can see the logic behind it. When you’re building and debugging distributed applications “it’s a zoo out there,” so you should put Zookeeper on your team.

HBase clusters can be huge and coordinating the operations of the MasterServers, RegionServers, and clients can be a daunting task, but that’s where Zookeeper enters the picture. As in HBase, Zookeeper clusters typically run on low-cost commodity x86 servers.

Each individual x86 server runs a single Zookeeper software process (hereafter referred to as a Zookeeper server), with one Zookeeper server elected by the ensemble as the leader and the rest of the servers are followers. Zookeeper ensembles are governed by the principle of a majority quorum.

Configurations with one Zookeeper server are supported for test and development purposes, but if you want a reliable cluster that can tolerate server failure, you need to deploy at least three Zookeeper servers to achieve a majority quorum.

image0.jpg

So, how many Zookeeper servers will you need? Five is the minimum recommended for production use, but you really don’t want to go with the bare minimum. When you decide to plan your Zookeeper ensemble, follow this simple formula: 2F + 1 = N where F is the number of failures you can accept in your Zookeeper cluster and N is the total number of Zookeeper servers you must deploy.

Five is recommended because one server can be shut down for maintenance but the Zookeeper cluster can still tolerate one server failure.

Zookeeper provides coordination and synchronization with what it calls znodes, which are presented as a directory tree and resemble the file path names you’d see in a Unix file system. Znodes do store data but not much to speak of — currently less than 1 MB by default.

The idea here is that Zookeeper stores znodes in memory and that these memory-based znodes provide fast client access for coordination, status, and other vital functions required by distributed applications like HBase. Zookeeper replicates znodes across the ensemble so if servers fail, the znode data is still available as long as a majority quorum of servers is still up and running.

Another primary Zookeeper concept concerns how znode reads (versus writes) are handled. Any Zookeeper server can handle reads from a client, including the leader, but only the leader issues atomic znode writes — writes that either completely succeed or completely fail.

When a znode write request arrives at the leader node, the leader broadcasts the write request to the follower nodes and then waits for a majority of followers to acknowledge znode write complete. After the acknowledgement, the leader issues the znode write itself and then reports the successful completion status to the client.

Znodes provide some very powerful guarantees. When a Zookeeper client (such as an HBase RegionServer) writes or reads a znode, the operation is atomic. It either completely succeeds or completely fails — there are no partial reads or writes.

No other competing client can cause the read or write operation to fail. In addition, a znode has an access control lists (ACL) associated with it for security, and it supports versions, timestamps and notification to clients when it changes.

Zookeeper replicates znodes across the ensemble so if servers fail, the znode data is still available as long as a majority quorum of servers is still up and running. This means that writes to any znode from any Zookeeper server must be propagated across the ensemble. The Zookeeper leader manages this operation.

This znode write approach can cause followers to fall behind the leader for short periods. Zookeeper solves this potential problem by providing a synchronization command. Clients that cannot tolerate this temporary lack of synchronization within the Zookeeper cluster may decide to issue a sync command before reading znodes.