Pig Script Interfaces in Hadoop

By Dirk deRoos

The Pig programming language is designed to handle any kind of data tossed its way — structured, semi-structured, unstructured data, you name it. Pig programs can be packaged in three different ways:

  • Script: This method is nothing more than a file containing Pig Latin commands, identified by the .pig suffix (FlightData.pig, for example). Ending your Pig program with the .pig extension is a convention but not required. The commands are interpreted by the Pig Latin compiler and executed in the order determined by the Pig optimizer.

  • Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin at the Grunt command line and immediately see the response. This method is helpful for prototyping during initial development and with what-if scenarios.

  • Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript programs.

Pig scripts, Grunt shell Pig commands, and embedded Pig programs can run in either Local mode or MapReduce mode.

The Grunt shell provides an interactive shell to submit Pig commands or run Pig scripts. To start the Grunt shell in Interactive mode, just submit the command pig at your shell.

To specify whether a script or Grunt shell is executed locally or in Hadoop mode just specify it in the –x flag to the pig command. The following is an example of how you’d specify running your Pig script in local mode

pig –x local milesPerCarrier.pig 

Here’s how you’d run the pig script in Hadoop mode, which is the default if you don’t specify the flag:

pig –x mapreduce milesPerCarrier.pig

By default, when you specify the pig command without any parameters, it starts the Grunt shell in Hadoop mode. If you want to start the Grunt shell in local mode just add the –x local flag to the command. Here is an example

pig -x local