How to Launch a YARN-Based Application - dummies

How to Launch a YARN-Based Application

By Dirk deRoos

To show how the various YARN (Yet Another Resource Negotiator) components work together, you can walk through the execution of an application. For the sake of argument, it can be a MapReduce application, with the JobTracker and TaskTracker architecture.

Just remember that, with YARN, it can be any kind of application for which there’s an application framework. The figure shows the interactions, and the prose account is set down in the following step list:

  1. The client application submits an application request to the Resource Manager.

  2. The Resource Manager asks a Node Manager to create an Application Master instance for this application. The Node Manager gets a container for it and starts it up.

  3. This new Application Master initializes itself by registering itself with the Resource Manager.

  4. The Application Master figures out how many processing resources are needed to execute the entire application.

    This is done by requesting from the NameNode the names and locations of the files and data blocks the application needs and calculating how many map tasks and reduce tasks are needed to process all this data.

  5. The Application Master then requests the necessary resources from the Resource Manager.

    The Application Master sends heartbeat messages to the Resource Manager throughout its lifetime, with a standing list of requested resources and any changes (for example, a kill request).

  6. The Resource Manager accepts the resource request and queues up the specific resource requests alongside all the other resource requests that are already scheduled.

  7. As the requested resources become available on the slave nodes, the Resource Manager grants the Application Master leases for containers on specific slave nodes.

  8. The Application Master requests the assigned container from the Node Manager and sends it a Container Launch Context (CLC).

    The CLC includes everything the application task needs in order to run: environment variables, authentication tokens, local resources needed at runtime (for example, additional data files, or application logic in JARs), and the command string necessary to start the actual process. The Node Manager then creates the requested container process and starts it.

  9. The application executes while the container processes are running.

    The Application Master monitors their progress, and in the event of a container failure or a node failure, the task is restarted on the next available slot. If the same task fails after four attempts (a default value which can be customized), the whole job will fail. During this phase, the Application Master also communicates directly with the client to respond to status requests.

  10. Also, while containers are running, the Resource Manager can send a kill order to the Node Manager to terminate a specific container.

    This can be as a result of a scheduling priority change or a normal operation, such as the application itself already being completed.

  11. In the case of MapReduce applications, after the map tasks are finished, the Application Master requests resources for a round of reduce tasks to process the interim result sets from the map tasks.

  12. When all tasks are complete, the Application Master sends the result set to the client application, informs the Resource Manager that the application has successfully completed, deregisters itself from the Resource Manager, and shuts itself down.

    image0.jpg

Like the JobTracker and TaskTracker daemons and processing slots in Hadoop 1, all of the YARN daemons and containers are Java processes, running in JVMs. With YARN, you’re no longer required to define how many map and reduce slots you need — you simply decide how much memory map and reduce tasks can have. The Resource Manager will allocate containers for map or reduce tasks on the cluster based on how much memory is available.

When you’re writing Hadoop applications, you don’t need to worry about requesting resources and monitoring containers. Whatever application framework you’re using does all that for you. It’s always a good idea, however, to understand what goes on when your applications are running on the cluster. This knowledge can help you immensely when you’re monitoring application progress or debugging a failed task.