Pages

Pig Programming: Create Your First Apache Pig Script

Pig Programming: Create Your First Apache Pig Script
In our Hadoop Tutorial Series, we will now learn how to create an Apache Pig script. Apache Pig scripts are used to execute a set of Apache Pig commands collectively. This helps in reducing the time and effort invested in writing and executing each command manually while doing this in Pig programming. This blog is a step by step guide to help you create your first Apache Pig script.
Apache Pig Script Execution Modes
Local Mode: In ‘local mode’, you can execute the pig script in local file system. In this case, you don’t need to store the data in Hadoop HDFS file system, instead you can work with the data stored in local file system itself.
MapReduce Mode: In ‘MapReduce mode’, the data needs to be stored in HDFS file system and you can process the data with the help of pig script.
Apache Pig Script in MapReduce Mode
Let us say our task is to read data from a data file and to display the required contents on the terminal as output.

The sample data file contains following data:
Save the text file with the name ‘information.txt’. The sample data file contains five columns FirstNameLastNameMobileNoCity, and Profession separated by tab key. Our task is to read the content of this file from the HDFS and display all the columns of these records.
To process this data using Pig, this file should be present in Apache Hadoop HDFS.
Command: hadoop fs –copyFromLocal /home/dsac/information.txt /dsac
Step 1: Writing a Pig script
Create and open an Apache Pig script file in an editor (e.g. gedit).
Command: sudo gedit /home/dsac/dsac.pig
This command will create a ‘dsac.pig’ file inside the home directory of dsac user.
Let’s write few PIG commands in dsac.pig file.
1
2
3
4
5
A = LOAD ‘/home/dsac/information.txt’ using PigStorage (‘\t’) as (FName: chararray, LName: chararray, MobileNo: chararray, City: chararray, Profession: chararray);

B = FOREACH A generate FName, MobileNo, Profession;

DUMP B;
Save and close the file.
  • The first command loads the file ‘information.txt’ into variable A with indirect schema (FName, LName, MobileNo, City, Profession).
  • The second command loads the required data from variable A to variable B.
  • The third line displays the content of variable B on the terminal/console.
Step 2: Execute the Apache Pig Script
To execute the pig script in HDFS mode, run the following command:
Command: pig /home/dsac/dsac.pig

After the execution finishes, review the result. These below images show the results and their intermediate map and reduce functions.
Below image shows that the Script executed successfully.
Below image shows the result of our script.


Steps to Create UDF in Apache Pig
This post contains the necessary step required to  create UDF in Apache Pig.  All UDF should extend a Filter function and has to contain a method called exec, which contains a Tuple. The logic applied here is that if the Tuple is null or zero, it will give you a Boolean value: True or False. And ‘IsofAge’ is for checking if the age given is correct or not. The logic of the User Defined Function is written in Java  codes, where the JAR file will be created and then exported. The JAR file is later on registered. These JAR files are found in the library files of Apache Pig at the time of loading.
public class IsOfAge extends FilterFunc {
@Override
publicBoolean exec(Tuple tuple) throwsIOException {
if(tuple == null|| tuple.size() == 0) {
returnfalse;
}
try{
Object object= tuple.get(0);
if(object == null) {
returnfalse;
}
inti = (Integer) object;
if(i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
returntrue;
} else{
returnfalse;
}
} catch(ExecExceptione) {
thrownewIOException(e);
}
}
}
How to Call a Pig UDF?
Once a UDF is created, the following command has to be used to register the JAR file.
register myudf.jar;
X = filter A by IsOfAge(age);
 Steps to Create UDF in Pig:
There are multiple predefined functions in Apache Pig. We also have the feature to create our own function that is User Defined Function (UDF). Pig UDF is written in Java and this requires Pig Library to use the predefined classes. The Apache Pig library pig-0.8.0-cdh3u0-core.jar can be downloaded from the internet.

No comments:

Post a comment