Developing Oozie Workflows in Hadoop

By Dirk deRoos

Oozie workflows are, at their core, directed graphs, where you can define actions (Hadoop applications) and data flow, but with no looping — meaning you can’t define a structure where you’d run a specific operation over and over until some condition is met (a for loop, for example).

Oozie workflows are quite flexible in that you can define condition-based decisions and forked paths for parallel execution. You can also execute a wide range of actions.

image0.jpg

In this figure, you see a workflow showing the basic capabilities of Oozie workflows. First, a Pig script is run, and is immediately followed by a decision tree. Depending on the state of the output, the control flow can either go directly to an HDFS (Hadoop Distributed File System) file operation (for example, a copyToLocal operation) or to a fork action.

If the control flow passes to the fork action, two jobs are run concurrently: a MapReduce job, and a Hive query. The control flow then goes to the HDFS operation once both the MapReduce job and Hive query are finished running. After the HDFS operation, the workflow is complete.

Oozie workflow definitions are written in XML, based on the Hadoop Process Definition Language (hPDL) schema. This particular schema is, in turn, based on the XML Process Definition Language (XPDL) schema, which is a product-independent standard for modeling business process definitions.

An Oozie workflow is composed of a series of actions, which are encoded by XML nodes. There are different kinds of nodes, representing different kinds of actions or control flow directives. Each Oozie workflow has its own XML file, where every node and its interconnections are defined.

Workflow nodes all require unique identifiers because they’re used to identify the next node to be processed in the workflow. This means that the order in which the actions are executed depends on where an action’s node appears in the workflow XML. To see how this concept would look, check out the following listing, which shows an example of the basic structure of an Oozie workflow’s XML file.

<workflow-app name="SampleWorkflow" xmlns="uri:oozie:workflow:0.1">
  <start to="firstJob"/>
  <action name="firstJob">
    <pig>...</pig>
    <ok to="secondJob"/>
    <error to="kill"/>
  </action>
  <action name="secondJob">
    <map-reduce>...</map-reduce>
    <ok to="end" />
    <error to="kill" />
  </action>
  <end name="end"/>
  <kill name="kill">
    <message>"Killed job."</message>
  </kill>
</workflow-app>

In this example, aside from the start, end, and kill nodes, you have two action nodes. Each action node represents an application or a command being executed.