The Pig Latin Application Flow in Hadoop

By Dirk deRoos

At its core, Pig Latin is a dataflow language, where you define a data stream and a series of transformations that are applied to the data as it flows through your application. This is in contrast to a control flow language (like C or Java), where you write a series of instructions.

In control flow languages, you use constructs like loops and conditional logic (like an if statement). You won’t find loops and if statements in Pig Latin.

If you need some convincing that working with Pig is a significantly easier row to hoe than having to write Map and Reduce programs, start by taking a look at some real Pig syntax:

A = LOAD 'data_file.txt';
.
B = GROUP ... ;
...
C= FILTER ...;
.
DUMP B;
.
STORE C INTO 'Results';

Some of the text in this example actually looks like English, right? Not too scary, at least at this point. Looking at each line in turn, you can see the basic flow of a Pig program. (Note that this code can either be part of a script or issued on the interactive shell called Grunt.)

  1. Load: You first load (LOAD) the data you want to manipulate.

    As in a typical MapReduce job, that data is stored in HDFS. For a Pig program to access the data, you first tell Pig what file or files to use. For that task, you use the LOAD ‘data_file’ command.

    Here, ‘data_file’ can specify either an HDFS file or a directory. If a directory is specified, all files in that directory are loaded into the program.

    If the data is stored in a file format that isn’t natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in (and interpret) the data.

  2. Transform: You run the data through a set of transformations that, way under the hood and far removed from anything you have to concern yourself with, are translated into a set of Map and Reduce tasks.

    The transformation logic is where all the data manipulation happens. Here, you can FILTER out rows that aren’t of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and do much, much more.

  3. Dump: Finally, you dump (DUMP) the results to the screen

    or

    Store (STORE) the results in a file somewhere.

    You would typically use the DUMP command to send the output to the screen when you debug your programs. When your program goes into production, you simply change the DUMP call to a STORE call so that any results from running your programs are stored in a file for further processing or analysis.