Operators in Apache Pig: Part 2- Diagnostic Operators

Operators in Apache Pig: Part 2- Diagnostic Operators
This is the 2nd post in series of Apache Pig Operators. This post is about the ‘Diagnostic Operators’ in Apache Pig. You can also refer to our previous post on Relational Operators for more information.
Let’s create two files to run the commands. We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.
The second file contain two fields: url & rating. These two files are CSV files.

Diagnostic Operators:
The DUMP operator is used to run Pig Latin statements and display the results on the screen. In this example, the operator prints ‘loading1’ on to the screen.

DUMP Result:
 Use the DESCRIBE operator to review the schema of a particular relation. The DESCRIBE operator is best used for debugging a script.
ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script. This command alone might be a good reason for choosing Pig over something else.

The EXPLAIN operator prints the logical and physical plane.

Improvements in Apache Pig 0.12.0
0.12.0 is the current version of Apache Pig available. This release include several new features such as ASSERT operator, IN operator, CASE operator.
Assert Operator:
An Assert operator can be used for data validation. For example, the following script will fail if any value is a negative integer:
a = load ‘something’ as (a0: int, a1: int);
assert a by a0 > 0, ‘a can’t be negative for reasons’;
IN Operator:
Previously, Pig had no support for IN operators. To imitate an IN operation, users had to concatenate several OR operators, as shown in below example:
a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);
(i == 1) OR
(i == 22) OR
(i == 333) OR
(i == 4444) OR
(i == 55555)
Now, this type of expression can be re-written in a more compressed manner using an IN operator:
a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);
b = FILTER a BY i IN (1, 22, 333, 4444, 55555);
CASE Expression:
Earlier, Pig had no support for a CASE statement. To mimic it, users often use nested bincond operators. Those could become unreadable when there were multiple levels of nesting. Following is an example of the type of CASE expression that Pig currently supports:
Case_operator = FOREACH foo GENERATE (
CASE i % 3
WHEN 0 THEN ‘3n’
WHEN 1 THEN ‘3n+1’
ELSE ‘3n+2’

No comments:

Post a Comment