Posts: 17
Registered: ‎06-23-2014
Accepted Solution

Pig memory

When running a Pig script on a 3-node CM managed CDH cluster, I get the following error:

2014-09-04 11:56:02,411 [main] ERROR - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1409824425178_0011_r_000001_3 Info:Error: GC overhead limit exceeded

All 3 nodes (running on Amazon EC2) have 30GB of memory. The datasize is trivial: I'm using 3 CSVs of which the largest is 1GB in size. The data is fetched directly from Amazon S3.

This happens both when running the script in Hue and when running it on the command line.


Three questions:

  • Why is this happening?
  • How can I fix this?
  • Isn't the whole purpose of Cloudera Manager to provide a sane configuration, based on the hardware used?

Some background for the last question: the cluster is running on Amazon EC2. Before setting up a CDH cluster using Cloudera Manager, I ran an Amazon EMR cluster with the same hardware configuration. The same pig script worked perfectly fine then. I switched to CDH so I could use Hue, and be on the cutting edge of Hadoop related technologies. It's a shame I'm running into these kind of problems so quickly...

Posts: 17
Registered: ‎06-23-2014

Re: Pig memory

So, I managed to fix my problem. The first hint was the GC overhead limit exceeded message. I quickly found out that this can be cause by lack of heapspace for the JVM. After digging a bit into the YARN configuration in Cloudera Manager, and comparing it to the setting in an Amazon Elastic Mapreduce cluster (where my Pig scripts did work), I found out that, even though each node had 30GB of memory, most YARN components had very low heapspace settings.


I updated the heapspace for the NodeManagers, ResourceManager and Containers and I also set the max heapspace for mappers and reducers somewhat higher, keeping in mind the total amount of memory available on each node (and the other services running there, like Impala) and now my Pig scripts work again!


Two issues I want to mention in case a Cloudera engineer reads this:

  • I find it a bit strange that Cloudera Manager doesn't set saner heapspace amounts, based on the total amount of RAM available
  • The fact that not everything runs under YARN yet, makes it harder to manage memory. You actually have to manage memory manually. If Impala would run under YARN, there would be less memory management I think :)
New solutions