Big Data For Dummies
Book image
Explore Book Buy On Amazon

Hadoop, an open-source software framework, uses HDFS (the Hadoop Distributed File System) and MapReduce to analyze big data on clusters of commodity hardware—that is, in a distributed computing environment.

The Hadoop Distributed File System (HDFS) was developed to allow companies to more easily manage huge volumes of data in a simple and pragmatic way. Hadoop allows big problems to be decomposed into smaller elements so that analysis can be done quickly and cost effectively. HDFS is a versatile, resilient, clustered approach to managing files in a big data environment.

HDFS is not the final destination for files. Rather it is a data "service" that offers a unique set of capabilities needed when data volumes and velocity are high.

MapReduce is a software framework that enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors. MapReduce was designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode.

The "map" component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called "reduce" aggregates all the elements back together to provide a result. An example of MapReduce usage would be to determine how many pages of a book are written in each of 50 different languages.

About This Article

This article is from the book:

About the book authors:

Judith Hurwitz is an expert in cloud computing, information management, and business strategy. Alan Nugent has extensive experience in cloud-based big data solutions. Dr. Fern Halper specializes in big data and analytics. Marcia Kaufman specializes in cloud infrastructure, information management, and analytics.

This article can be found in the category: