Configuring Oozie Workflows - dummies

Configuring Oozie Workflows

By Dirk deRoos

As a workflow engine, Oozie enables you to run a set of Hadoop applications in a specified sequence known as a workflow. You can configure Oozie workflows in one of three ways, depending on your particular circumstances. You can use

  • The config-default.xml file: Defines parameters that don’t change for the workflow.

  • The job.properties file: Defines parameters that are common for a particular deployment of the workflow. Definitions here override those made in the config-default.xml file.

  • The command-line parameters: Defines parameters that are specific for the workflow invocation. Definitions here override those made in the job.properties file and the config-default.xml file.

The configuration details will differ, depending on the action they’re associated with. For example, as you can see in the MapReduce action (map-action) in the following listing, you have many more things to configure there:

<workflow-app name=" SampleWorkflow " xmlns="uri:oozie:workflow:0.1">
   ...
   <action name="firstJob">
      <map-reduce>
      @@1     <job-tracker>serverName:8021</job-tracker>
         <name-node>serverName:8020</name-node>
      @@2      <prepare>
            <delete path="hdfs://clientName:8020/usr/sample/output-data"/>
         </prepare>
      @@3     <job-xml>jobConfig.xml</job-xml>
         <configuration>
           ...
            <property>
               <name>mapreduce.map.class</name>
               <value>dummies.oozie.FlightMilesMapper</value>
            </property>
            <property>
               <name>mapreduce.reduce.class</name>
               <value>dummies.oozie.FlightMilesReducer </value>
            </property>
            <property>
               <name>mapred.mapoutput.key.class</name>
               <value>org.apache.hadoop.io.Text</value>
            </property>
            <property>
               <name>mapred.mapoutput.value.class</name>
               <value>org.apache.hadoop.io.IntWritable</value>
            </property>
            <property>
               <name>mapred.output.key.class</name>
               <value>org.apache.hadoop.io.Text</value>
            </property>
            <property>
               <name>mapred.output.value.class</name>
               <value>org.apache.hadoop.io.IntWritable</value>
            </property>
            <property>
               <name>mapred.input.dir</name>
               <value>’/usr/dirk/flightdata’</value>
            </property>
            <property>
               <name>mapred.output.dir</name>
               <value>’/usr/dirk/flightmiles’</value>
            </property>
            ...
         </configuration>
      </map-reduce>
      <ok to="end"/>
      <error to="end"/>
   </action>
   ...
</workflow-app>

As opposed to a file system (fs) action like the one shown here:

<workflow-app name="SampleWorkflow" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="firstJob">
         <fs>
            <delete path="hdfs://servername:8020/usr/sample/temp-data"/>
            <mkdir path="archives/${wf:id()}"/>
            <move source="${jobInput}" target="archives/${wf:id()}/processed-input"/>
            <chmod path="${jobOutput}" permissions="-rwxrw-rw-" dir-files="true"><recursive/></chmod>
        </fs>
        <ok to="end"/>
        <error to="end"/>
    </action>
    ...
</workflow-app>