We are facing a tricky situation. We run Pig script from Hadoop tutorial. It works fine on a Sandbox. But fails in real cluster where it complains about insufficient memory for the container.
container is running beyond physical memory limit
message can be seen in the logs.
The tricky part is - Sandbox has way less memory available than real cluster (about 3 times less). Also most memory settings in Sandbox (MapReduce memory, Yarn memory, Yarn container sizes) allow much less memory than corresponding settings in a real cluster. Still it is sufficient for Pig in Sandbox but not sufficient in a real cluster.
Another note - Hive queries doing the similar job also work good, they do not complain about memory.
Apparently there is some setting somewhere, which makes Pig to request too much memory? Can please anybody recommend what parameter should be modified to stop Pig script to request too big memory?
are you running from command line, interactive or Ambari View?
Also this is just with default settings?
Does the data exist?
Any logs? other error messages? can you share any more details.
Is the real cluster HDP 2.5?
That challenge we have in the HCC is usually that we don't get the entire logs and very less input to work with :). Your logs should have had stack as below:
Container [pid=2617,containerID=container_1438923434512_12103_01_000002] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.9 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1438923434512_12103_01_000002..
This will indicate what limit is being set and whats the threshold where its getting the error . As a workaround to test this out from the grunt shell , you can set the following and then test again :
set mapreduce.map.java.opts '-Xmx1024m' set mapreduce.reduce.java.opts '-Xmx1024m' set mapreduce.map.memory.mb '1536' set mapreduce.reduce.memory.mb '1536'
I attached the most relevant (I think) part of the log. You were right in assuming that it was going beyond of the limits of the container.
If I follow your suggestion and increase some memory parameters it may start to work, but then other processes will be suffering because of lack of memory.
I was looking for a different solution though. I wanted to know how you guys at Hortonworks made Sandbox to work perfectly with much smaller memory available? Below are Sandbox's values for the parameters you mentioned in your post
mapreduce.map.java.opts -Xmx200m mapreduce.reduce.java.opts -Xmx200m mapreduce.map.memory.mb 250 mapreduce.reduce.memory.mb 250
As you can see the values are way smaller than those you've suggested.
So how does Sandbox's Pig work with these parameters? Why it is not failing and not complaining about low memory? What is doing the trick in Sandbox?
>are you running from command line, interactive or Ambari View?
Running from Pig View in Ambari
>Also this is just with default settings?
No, some settings were modified (towards reducing of required memory)
>Does the data exist?
Of course it does
>Any logs? other error messages? can you share any more details.
I provide them below in response to Sumesh question
>Is the real cluster HDP 2.5?
Yes it is
12 Gb RAM, 1 Tb hard drive