The Origin and Design of Hadoop - dummies

The Origin and Design of Hadoop

By Dirk deRoos

So what exactly is this thing with the funny name — Hadoop? At its core, Hadoop is a framework for storing data on large clusters of commodity hardware — everyday computer hardware that is affordable and easily available — and running applications against that data. A cluster is a group of interconnected computers (known as nodes) that can work together on the same problem.

Using networks of affordable compute resources to acquire business insight is the key value proposition of Hadoop.

As for that name, Hadoop, don’t look for any major significance there; it’s simply the name that Doug Cutting’s son gave to his stuffed elephant. (Doug Cutting is, of course, the co-creator of Hadoop.) The name is unique and easy to remember — characteristics that made it a great choice.

Hadoop consists of two main components: a distributed processing framework named MapReduce (which is now supported by a component called YARN) and a distributed file system known as the Hadoop distributed file system, or HDFS.

An application that is running on Hadoop gets its work divided among the nodes (machines) in the cluster, and HDFS stores the data that will be processed. A Hadoop cluster can span thousands of machines, where HDFS stores data, and MapReduce jobs do their processing near the data, which keeps I/O costs low. MapReduce is extremely flexible, and enables the development of a wide variety of applications.

As you might have surmised, a Hadoop cluster is a form of compute cluster, a type of cluster that’s used mainly for computational purposes. In a compute cluster, many computers (compute nodes) can share computational workloads and take advantage of a very large aggregate bandwidth across the cluster.

Hadoop clusters typically consist of a few master nodes, which control the storage and processing systems in Hadoop, and many slave nodes, which store all the cluster’s data and is also where the data gets processed.