09-04-2014 05:09 AM
When running a Pig script on a 3-node CM managed CDH cluster, I get the following error:
2014-09-04 11:56:02,411 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1409824425178_0011_r_000001_3 Info:Error: GC overhead limit exceeded
All 3 nodes (running on Amazon EC2) have 30GB of memory. The datasize is trivial: I'm using 3 CSVs of which the largest is 1GB in size. The data is fetched directly from Amazon S3.
This happens both when running the script in Hue and when running it on the command line.
Some background for the last question: the cluster is running on Amazon EC2. Before setting up a CDH cluster using Cloudera Manager, I ran an Amazon EMR cluster with the same hardware configuration. The same pig script worked perfectly fine then. I switched to CDH so I could use Hue, and be on the cutting edge of Hadoop related technologies. It's a shame I'm running into these kind of problems so quickly...
09-09-2014 04:36 AM
So, I managed to fix my problem. The first hint was the GC overhead limit exceeded message. I quickly found out that this can be cause by lack of heapspace for the JVM. After digging a bit into the YARN configuration in Cloudera Manager, and comparing it to the setting in an Amazon Elastic Mapreduce cluster (where my Pig scripts did work), I found out that, even though each node had 30GB of memory, most YARN components had very low heapspace settings.
I updated the heapspace for the NodeManagers, ResourceManager and Containers and I also set the max heapspace for mappers and reducers somewhat higher, keeping in mind the total amount of memory available on each node (and the other services running there, like Impala) and now my Pig scripts work again!
Two issues I want to mention in case a Cloudera engineer reads this: