What Is Hadoop? - dummies

By Lillian Pierson

Hadoop is an open-source data processing tool that was developed by the Apache Software Foundation. Hadoop is currently the go-to program for handling huge volumes and varieties of data because it was designed to make large-scale computing more affordable and flexible. With the arrival of Hadoop, mass data processing has been introduced to significantly more people and more organizations.

Hadoop can offer you a great solution to handle, process, and group mass streams of structured, semi-structured, and unstructured data. By setting up and deploying Hadoop, you get a relatively affordable way to begin using and drawing insights from all of your organization’s data, rather than just continuing to rely solely on that transactional dataset you have sitting over in an old data warehouse somewhere.

Hadoop is one of the most popular programs available for large-scale computing requirements. Hadoop provides a map-and-reduce layer that’s capable of handling the data processing requirements of most big data projects.

Sometimes the data gets too big and fast for even Hadoop to handle. In these cases, organizations are turning to alternative, more-customized MapReduce deployments instead.

Hadoop uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is comprised of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in handling and storing big data.

Hadoop is comprised of the following two components:

  • A distributed processing framework: Hadoop uses Hadoop MapReduce as its distributed processing framework. Again, a distributed processing framework is a powerful framework where processing tasks are distributed across clusters of nodes so that large data volumes can be processed very quickly across the system as a whole.

  • A distributed file system: Hadoop uses the Hadoop Distributed File System (HDFS) as its distributed file system.

The workloads of applications that run on Hadoop are divided among the nodes of the Hadoop cluster, and then the output is stored on the HDFS. The Hadoop cluster can be comprised of thousands of nodes. To keep the costs of input/output (I/O) processes low, Hadoop MapReduce jobs are performed as close to the data as possible.

This means that the reduce tasks processors are positioned as closely as possible to the outgoing map task data that needs to be processed. This design facilitates sharing of computational requirements in big data processing.

Hadoop also supports hierarchical organization. Some of its nodes are classified as master nodes, and others are categorized as slaves. The master service, known as JobTracker, is designed to control several slave services. Slave services (also called TaskTrackers) are distributed one to each node. The JobTracker controls the TaskTrackers and assigns Hadoop MapReduce tasks to them.

In a newer version of Hadoop, known as Hadoop 2, a resource manager called Hadoop YARN was added. With respect to MapReduce in Hadoop, YARN acts as an integrated system that performs resource management and scheduling functions.

Hadoop processes data in batch. Consequently, if you’re working with real-time, streaming data, you won’t be able to use Hadoop to handle your big data issues. This said, it’s very useful for solving many other types of big data problems.