Input Splits in Hadoop’s MapReduce
The way HDFS has been set up, it breaks down very large files into large blocks (for example, measuring 128MB), and stores three copies of these blocks on different nodes in the cluster. HDFS has no awareness of the content of these files.
In YARN, when a MapReduce job is started, the Resource Manager (the cluster resource management and job scheduling facility) creates an Application Master daemon to look after the lifecycle of the job. (In Hadoop 1, the JobTracker monitored individual jobs as well as handling job scheduling and cluster resource management.)
One of the first things the Application Master does is determine which file blocks are needed for processing. The Application Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application Master makes requests to the Resource Manager to have map tasks process specific blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is processed locally — on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more closely at how Hadoop stores data. In Hadoop, files are composed of individual records, which are ultimately processed one-by-one by mapper tasks.
For example, the sample data set contains information about completed flights within the United States between 1987 and 2008.
To download the sample data set, open the Firefox browser from within the VM, and go to the dataexpo page.
You have one large file for each year, and within every file, each individual line represents a single flight. In other words, one line represents one record. Now, remember that the block size for the Hadoop cluster is 64MB, which means that the light data files are broken into chunks of exactly 64MB.
Do you see the problem? If each map task processes all records in a specific data block, what happens to those records that span block boundaries? File blocks are exactly 64MB (or whatever you set the block size to be), and because HDFS has no conception of what’s inside the file blocks, it can’t gauge when a record might spill over into another block.
To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends.
In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.
The figure shows this relationship between data blocks and input splits.
You can configure the Application Master daemon (or JobTracker, if you’re in Hadoop 1) to calculate the input splits instead of the job client, which would be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks. Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally.