Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to handle errors and Exceptions in pig and shell scripts during data transformation ?

Highlighted

how to handle errors and Exceptions in pig and shell scripts during data transformation ?

Explorer

I'am very newer to Big Data. I try to understand some basic concepts in Pig and i need some clarification about transfermation from local to HDFS for analytics

1. I have Excel files in my local directory Ex. /bdata

2. for data transformation using the command hadoop dfs –copyFromlocal /bdata hdfs://192.168.1.xxx:8020/hbdata

  • how can i handle if any error occurs is it possible to mail errors or success messages into a particular mail id

3. After files moved to HDFS.

  • I need to load the data using pig scripts.
  • On load time how to handle errors and mailed to a particular mail id.
  • I need to create set of exception handling with in a pig scripts
4 REPLIES 4
Highlighted

Re: how to handle errors and Exceptions in pig and shell scripts during data transformation ?

Explorer

Thanks in advance

Highlighted

Re: how to handle errors and Exceptions in pig and shell scripts during data transformation ?

Hi @Iyappan Gopalakrishnan,

You have two modes for running these commands:

  • Interactive: if you are running these command by yourself from the shell then no need for sending emails since you will get the result directly. Let say you are using Pig in interactive mode using the Grunt shell then you will get the result or the error of each step after execution.
  • Scheduled: sending emails is interesting in this case because your commands will be run automatically. Let say you write a Pig script and you want to execute it once per day at 3pm. You can write an Oozie workflow with a Pig action to start the script and then decide what to do in case of success or error (see below ok to=*** and error to=****). To send an email you can use Oozie Email action
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.2">
    ...
    <action name="[NODE-NAME]">
        <pig>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               ...
            </prepare>
            <script>[PIG-SCRIPT]</script>
            <param>[PARAM-VALUE]</param>
                ...
        </pig>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app> 

If you need more advanced monitoring options for Oozie you can look to JMS Notification and SLA Monitoring

Highlighted

Re: how to handle errors and Exceptions in pig and shell scripts during data transformation ?

@Iyappan Gopalakrishnan you can schedule your pig job through Falcon. It will help you to handle entire data pipeline management. It supports scheduling pig/hive/oozie workflows and comes with email and SNMP notification.

Refer:http://hortonworks.com/apache/falcon/#section_1

Re: how to handle errors and Exceptions in pig and shell scripts during data transformation ?

Explorer

Thanks @Abdelkrim Hadjidj its working fine for mail.

some of the shell scripts not accepting in to the pig scripts for example

i need to compare and then only copy to the file from local to hdfs so use shell scripts like

  • I have created two folders in the Local File System viz. Bdata1 and Bdata2
  • Bdata1 is FTP folder
  • Compare the two folders to check if all the files match in both the folders. If not, the file names of the files that do not match are stored separately in a text file called compare.txt. using this script
  • sh diff –r Bdata1 Bdata2 | grep Bdata1 | awk ‘{print $4}’ > compare.txt
  • Create a folder in HDFS hbdata.
  • Count the number of files in hbdata and store it in a variable say n1.
  • Count the number of files in compare.txt and store it in a variable say n2.
  • Copy the files mentioned in the compare.txt text file from the local file system to HDFS using the script

sh for i in ‘cat compare.txt’ ; do hadoop dfs –copyFromlocal Bdata1/$i hdfs://192.168.1.xxx:8020/hbdata

  • Count the number of files in hbdata and store it in a variable say n3.
  • If the difference between the variables n3 and n2 is equal to n1, then pass an alert saying the File Has Been Copied.
  • After the files are copied, they are moved to bdata2.
  • sh for i in ‘compare.txt’; do mv Bdata1/$i Bdata2
  • If the difference is not equal as per the above condition, then pass an alert Files Not Copied and display the file names of the files not copied.
  • After all are completed i use pig to load command and need to create Hive ORC table to load the data

Please suggest me how to write scripts and complete the full flow.

Note:I'm not able to find directly compare local and HDFS directories so i use more commands achieve comparison

Don't have an account?
Coming from Hortonworks? Activate your account here