Managing Big Data with Hadoop: HDFS and MapReduce

By Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman

Part of Big Data For Dummies Cheat Sheet

Hadoop, an open-source software framework, uses HDFS (the Hadoop Distributed File System) and MapReduce to analyze big data on clusters of commodity hardware—that is, in a distributed computing environment.

The Hadoop Distributed File System (HDFS) was developed to allow companies to more easily manage huge volumes of data in a simple and pragmatic way. Hadoop allows big problems to be decomposed into smaller elements so that analysis can be done quickly and cost effectively. HDFS is a versatile, resilient, clustered approach to managing files in a big data environment.

HDFS is not the final destination for files. Rather it is a data “service” that offers a unique set of capabilities needed when data volumes and velocity are high.

MapReduce is a software framework that enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors. MapReduce was designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode.

The “map” component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called “reduce” aggregates all the elements back together to provide a result. An example of MapReduce usage would be to determine how many pages of a book are written in each of 50 different languages.