Hadoop’s Pig Data Types and Syntax
Pig’s data types make up the data model for how Pig thinks of the structure of the data it is processing. With Pig, the data model gets defined when the data is loaded. Any data you load into Pig from disk is going to have a particular schema and structure. Pig needs to understand that structure, so when you do the loading, the data automatically goes through a mapping.
Luckily for you, the Pig data model is rich enough to handle most anything thrown its way, including table- like structures and nested hierarchical data structures. In general terms, though, Pig data types can be broken into two categories: scalar types and complex types. Scalar types contain a single value, whereas complex types contain other types, such as the Tuple, Bag and Map types listed below.
Pig Latin has these four types in its data model:
Atom: An atom is any single value, such as a string or a number — ‘Diego’, for example. Pig’s atomic values are scalar types that appear in most programming languages — int, long, float, double, chararray and bytearray, for example.
Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type — ‘Diego’, ‘Gomez’, or 6, for example). Think of a tuple as a row in a table.
Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible — each tuple in the collection can contain an arbitrary number of fields, and each field can be of any type.
Map: A map is a collection of key value pairs. Any type can be stored in the value, and the key needs to be unique. The key of a map must be a chararray and the value can be of any type.
The figure offers some fine examples of Tuple, Bag, and Map data types, as well.
The value of all these types can also be null. The semantics for null are similar to those used in SQL. The concept of null in Pig means that the value is unknown. Nulls can show up in the data in cases where values are unreadable or unrecognizable — for example, if you were to use a wrong data type in the LOAD statement.
Null could be used as a placeholder until data is added or as a value for a field that is optional.
Pig Latin has a simple syntax with powerful semantics you’ll use to carry out two primary operations: access and transform data.
In a Hadoop context, accessing data means allowing developers to load, store, and stream data, whereas transforming data means taking advantage of Pig’s ability to group, join, combine, split, filter, and sort data. The table gives an overview of the operators associated with each operation.
|Data Access||LOAD/STORE||Read and Write data to file system|
|DUMP||Write output to standard output (stdout)|
|STREAM||Send all records through external binary|
|Transformations||FOREACH||Apply expression to each record and output one or more
|FILTER||Apply predicate and remove records that don’t meet
|GROUP/COGROUP||Aggregate records with the same key from one or more
|JOIN||Join two or more records based on a condition|
|CROSS||Cartesian product of two or more inputs|
|ORDER||Sort records based on key|
|DISTINCT||Remove duplicate records|
|UNION||Merge two data sets|
|SPLIT||Divide data into two or more bags based on predicate|
|LIMIT||subset the number of records|
Pig also provides a few operators that are helpful for debugging and troubleshooting, as shown:
|Debug||DESCRIBE||Return the schema of a relation.|
|DUMP||Dump the contents of a relation to the screen.|
|EXPLAIN||Display the MapReduce execution plans.|
Part of the paradigm shift of Hadoop is that you apply your schema at Read instead of Load. According to the old way of doing things — the RDBMS way — when you load data into your database system, you must load it into a well-defined set of tables. Hadoop allows you to store all that raw data upfront and apply the schema at Read.
With Pig, you do this during the loading of the data, with the help of the LOAD operator.
The optional USING statement defines how to map the data structure within the file to the Pig data model — in this case, the PigStorage () data structure, which parses delimited text files. (This part of the USING statement is often referred to as a LOAD Func and works in a fashion similar to a custom deserializer.)
The optional AS clause defines a schema for the data that is being mapped. If you don’t use an AS clause, you’re basically telling the default LOAD Func to expect a plain text file that is tab delimited. With no schema provided, the fields must be referenced by position because no name is defined.
Using AS clauses means that you have a schema in place at read-time for your text files, which allows users to get started quickly and provides agile schema modeling and flexibility so that you can add more data to your analytics.
The LOAD operator operates on the principle of lazy evaluation, also referred to as call-by-need. Now lazy doesn’t sound particularly praiseworthy, but all it means is that you delay the evaluation of an expression until you really need it.
In the context of the Pig example, that means that after the LOAD statement is executed, no data is moved — nothing gets shunted around — until a statement to write data is encountered. You can have a Pig script that is a page long filled with complex transformations, but nothing gets executed until the DUMP or STORE statement is encountered.