How to Get Apache Oozie Set Up in Hadoop - dummies

How to Get Apache Oozie Set Up in Hadoop

By Dirk deRoos

Apache Oozie is included in every major Hadoop distribution, including Apache Bigtop. In your Hadoop cluster, install the Oozie server on an edge node, where you would also run other client applications against the cluster’s data, as shown.

image0.jpg

Edge nodes are designed to be a gateway for the outside network to the Hadoop cluster. This makes them ideal for data transfer technologies (Flume, for example), but also client applications and other application infrastructure like Oozie. Oozie does not need a dedicated server, and can easily coexist with other services that are ideally suited for edge nodes, like Pig and Hive.

After Oozie is deployed, you’re ready to start the Oozie server. Oozie’s infrastructure is installed in the $OOZIE_HOME directory. From there, run the oozie-start.sh command to start the server. (As you might expect, stopping the server involves typing oozie-stop.sh.) You can test the status of your Oozie instance by running the command

oozie admin -status

After you have the Oozie server deployed and started, you can catalog and run your various workflow, coordinator, or bundle jobs. When working with your jobs, Oozie stores the catalog definitions — the data describing all the Oozie objects (workflow, coordinator, and bundle jobs) — as well as their states in a dedicated database.

By default, Oozie is configured to use the embedded Derby database, but you can use MySQL, Oracle, or PostgreSQL, if you need to.

You have four options for interacting with the Oozie server:

  • The Java API: This option is useful in situations where you have your own scheduling code in Java applications, and you need to control the execution of your Oozie workflows, coordinators, or bundles from within your application.

  • The REST API: Again, this option works well in those cases where you want to use your own scheduling code as the basis of your Oozie workflows, coordinators, or bundles, or if you want to build your own interface or extend an existing one for administering the Oozie server.

  • Command Line Interface (CLI): It’s the traditional Linux command line interface for Oozie.

  • The Oozie Web Console: Okay, maybe you can’t do much interacting here, but the Oozie Web Console gives you a (read-only) view of the state of the Oozie server, which is useful for monitoring your Oozie jobs.

    image1.jpg

Hue, a Hadoop administration interface, provides another tool for working with Oozie. Oozie workflows, coordinators, and bundles are all defined using XML, which can be tedious to edit, especially for complex situations. Hue provides a GUI designer tool to graphically build workflows and other Oozie objects.

Underneath the covers, Oozie includes an embedded Tomcat web server, which handles its input and output.