Pig stuck at 0% -- problem configuring

by Community Manager on ‎03-29-2016 09:28 AM - edited on ‎09-27-2016 09:16 AM by Community Manager

Summary

 

In some circumstances, attempting to run the standard Pig examples results in the script getting stuck at 0% progress indefinitely.  This article defines a means to overcome that issue.

 

Applies To

 

CDH 5.4, CDH 5.3.0 (Parcels), Pig, Hue

 

Instructions

 

Given the details of the reported problem, the following procedure will allow you to successfully run the Pig script in Hue (as shown below, including cluster configuration steps):

 

Repro details:

1) Setup test cluster with Cloudera Express CM 5.4.1, CDH 5.3.0 (Parcels)

2) Configured the Core Hadoop services in the following fashion (for testing):
    Master (16GB RAM): CM, NN, SNN, Hue, Sqoop, RM, JHS, Hive Gateway
    Worker 1 (8GB RAM): DN, NM, ZK, Hive Gateway, HMS
    Worker 2 (8GB RAM): DN, NM, ZK, Hive Gateway, Oozie
    Worker 3 (8GB RAM): DN, NM, ZK, Hive Gateway, HS2
3) Installed All Hue application examples as Hue admin user
4) Created regular user account in Hue
5) Logged in as regular (non-admin) user in Hue
6) Ran the test query via Hue -> Query Editors -> Pig -> Pasted the following output:
   data = LOAD '/user/hue/pig/examples/data/midsummer.txt' as (text:CHARARRAY);
   upper_case = FOREACH data GENERATE UPPER(text);
   STORE upper_case INTO '$output' ;
 
7) Clicked on Submit
8) Output filename = test3
9) Confirmed workflow output was successful:
 
2016-02-12 09:02:21,226 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 0% complete
2016-02-12 09:02:43,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 50% complete
Heart beat
2016-02-12 09:02:46,386 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation  - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2016-02-12 09:02:46,435 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 100% complete
 
 
More Information Needed: 
To better understand the cause of the reported behavior, kindly provide responses to the following:
 
1) How are all the services distributed on your cluster, and how much memory is allocated for each? 
 
2) Are you able to run a simple Pi job successfully?  
 
For Parcel installs:
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 5
 
For package-based installs:
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 5
 
3) Can you provide the Resource Manager service log along with the Application Master log of the Pig job that is reportedly hanging?
 
Gathering the above information will help help narrow down components where the culprit resides.  If a simple Pi job does not work, then further attention is needed on the YARN configuration and ensuring that the AM, map (and reduce if applicable) containers are properly launched. 
 
 
 
 

References

Comments
by Scott Person
‎05-04-2016 07:30 PM - edited ‎05-04-2016 07:57 PM

SOLUTION: 

Increase the size of yarn.nodemanager.resource.memory-mb to 10GB using Cloudera Manager/Yarn/Configuration. Note that this is with plain vanilla CDH5.7.

 

Note - it boggles my mind that this doesn't work out of the box. I realize that there are limitations - can't give away RAM that you don't have, but there's got to be a way to configure this so that it doesn't require hours of experimentation to figure it out.

 

Hello folks,

Is there a resolution to this?

 

Here's what I did:

* Install CHD5.7 on a single node (RHEL 7), but it's a big one - 30GB of RAM.

* During install I chose the "everything" option

* Open Hue

* Install the Pig example

* Run the example script in Hue - freezes

* Run just the load of data and dump data in grunt - freezes at 0%

 

This is reproducible 100% everytime. I've used several different EC2 VM types all with the exact same results. I'm sure it is some Yarn setting. I did raise the max memory to 3GB to get rid of the allocation issue. One would think that 15 min of configuration work could get the default settings correct.

 

I can start posting logs, but given that it's reproducible quite easily, digging through logs would seem to be the hard way.

 

All help appreciated. I'll respond quickly with any requested info.

 

Thanks!

Disclaimer: The information contained in this article was generated by third-parties and not by Cloudera or it's personnel. Cloudera cannot guarantee its accuracy or efficacy. Cloudera disclaims all warranties of any kind and users of this information assume all risk associated with it and with following the advice or directions contained herein. By visiting this page, you agree to be bound by the Terms and Conditions of Site Usage , including all disclaimers and limitations contained therein.