The HBase MasterServer

By Dirk deRoos

Starting a discussion of HBase (Hadoop Database) architecture by describing RegionServers instead of the MasterServer may surprise you. The term RegionServer would seem to imply that it depends on (and is secondary to) the MasterServer and that you should therefore discuss the MasterServer first. As the old song goes, though, “it ain’t necessarily so.”

The RegionServers do depend on the MasterServer for certain functions, but not in the sense of a master-slave relationship for data storage and retrieval. In the upper-left corner of the figure, notice that the clients do not point to the MasterServer, but point instead to the Zookeeper cluster and RegionServers.

image0.jpg

The MasterServer isn’t in the path for data storage and access — that’s the job of the Zookeeper cluster and the RegionServers. Take a look at the primary functions of the MasterServer, which is also a software process (or daemon) like the RegionServers. The MasterServer is there to

  • Monitor the RegionServers in the HBase cluster: The MasterServer maintains a list of active RegionServers in the HBase cluster.

  • Handle metadata operations: When a table is created or its attributes are altered (compression setting, cache settings, versioning, and more) the MasterServer handles the operation and stores the required metadata.

  • Assign regions: The MasterServer assigns regions to RegionServers.

  • Manage RegionServer failover: As with any distributed cluster, you hope that node failures don’t occur and you plan for them anyway. When region servers fail, Zookeeper notifies the MasterServer so that failover and restore operations can be initiated.

  • Oversee load balancing of regions across all available RegionServers: You may recall that tables are comprised of regions which are evenly distributed across all available RegionServers. This is the work of the balancer thread (or chore, if you prefer) which the MasterServer periodically activates.

  • Manage (and clean) catalog tables: Two key catalog tables are used by the HBase system to help a client find a particular key value pair in the system.

    The MasterServer provides management of these critical tables on behalf of the overall HBase system.

  • Clear the WAL: The MasterServer interacts with the WAL during RegionServer failover and periodically cleans the logs.

  • Provide a coprocessor framework for observing master operations: Here’s another new term for your growing HBase glossary. Coprocessors run in the context of the MasterServer or RegionServers. For example, a MasterServer observer coprocessor allows you to change or extend the normal functionality of the server when operations such as table creation or table deletion take place. Often coprocessors are used to manage table indexes for advanced HBase applications.

A coprocessor, which runs in the context of the MasterServer and or RegionServer (or both) can be used to enhance security, create secondary indexes, and more. You can find more information about coprocessors at a HBase community blog.

As with all open source Hadoop technologies, MasterServer operations will likely change over time as the community of engineers work on innovations designed to enhance HBase. As of this writing, however, you now have a fairly thorough list that serves as a high-level reference for the MasterServer.

Finally, one more important point to make about the HBase MasterServer: There can and should be a backup MasterServer in any HBase cluster. There needs to be only one active MasterServer at any given time, so the backup MasterServer is for failover purposes.

You may recall that the MasterServer isn’t in the data access path for HBase clients. However, you may also recall that the MasterServer is responsible for actions such as RegionServer failover and load balancing. The good news is that clients can continue to query the HBase cluster if the master goes down but for normal cluster operations, the master should not remain down for any length of time.