Pig Latin in Hadoop’s Pig Programs - dummies

Pig Latin in Hadoop’s Pig Programs

By Dirk deRoos

Pig Latin is the language for Pig programs. Pig translates the Pig Latin script into MapReduce jobs that it can be executed within Hadoop cluster. When coming up with Pig Latin, the development team followed three key design principles:

  • Keep it simple. Pig Latin provides a streamlined method for interacting with Java MapReduce. It’s an abstraction, in other words, that simplifies the creation of parallel programs on the Hadoop cluster for data flows and analysis. Complex tasks may require a series of interrelated data transformations — such series are encoded as data flow sequences.

    Writing data transformation and flows as Pig Latin scripts instead of Java MapReduce programs makes these programs easier to write, understand, and maintain because a) you don’t have to write the job in Java, b) you don’t have to think in terms of MapReduce, and c) you don’t need to come up with custom code to support rich data types.

    Pig Latin provides a simpler language to exploit your Hadoop cluster, thus making it easier for more people to leverage the power of Hadoop and become productive sooner.

  • Make it smart. You may recall that the Pig Latin Compiler does the work of transforming a Pig Latin program into a series of Java MapReduce jobs. The trick is to make sure that the compiler can optimize the execution of these Java MapReduce jobs automatically, allowing the user to focus on semantics rather than on how to optimize and access the data.

    For you SQL types out there, this discussion will sound familiar. SQL is set up as a declarative query that you use to access structured data stored in an RDBMS. The RDBMS engine first translates the query to a data access method and then looks at the statistics and generates a series of data access approaches. The cost-based optimizer chooses the most efficient approach for execution.

  • Don’t limit development. Make Pig extensible so that developers can add functions to address their particular business problems.

Traditional RDBMS data warehouses make use of the ETL data processing pattern, where you extract data from outside sources, transform it to fit your operational needs, and then load it into the end target, whether it’s an operational data store, a data warehouse, or another variant of database.

However, with big data, you typically want to reduce the amount of data you have moving about, so you end up bringing the processing to the data itself.

The language for Pig data flows, therefore, takes a pass on the old ETL approach, and goes with ELT instead: Extract the data from your various sources, load it into HDFS, and then transform it as necessary to prepare the data for further analysis.