How to Choose a Hadoop Cluster Architecture

By Dirk deRoos

Hadoop is designed to be deployed on a large cluster of networked computers, featuring master nodes (which host the services that control Hadoop’s storage and processing) and slave nodes (where the data is stored and processed). You can, however, run Hadoop on a single computer, which is a great way to learn the basics of Hadoop by experimenting in a controlled space.

Hadoop has two deployment modes: pseudo-distributed mode and fully distributed mode, both of which are described here.

Pseudo-distributed mode (single node)

A single-node Hadoop deployment is referred to as running Hadoop in pseudo-distributed mode, where all the Hadoop services, including the master and slave services, all run on a single compute node. This kind of deployment is useful for quickly testing applications while you’re developing them without having to worry about using Hadoop cluster resources someone else might need.

It’s also a convenient way to experiment with Hadoop, as most of us don’t have clusters of computers at our disposal.

Fully distributed mode (a cluster of nodes)

A Hadoop deployment where the Hadoop master and slave services run on a cluster of computers is running in what’s known as fully distributed mode. This is an appropriate mode for production clusters and development clusters. A further distinction can be made here: a development cluster usually has a small number of nodes and is used to prototype the workloads that will eventually run on a production cluster.