Created 07-19-2017 07:48 AM
I am facing hive errors intermittently, Garbage Collection Issues indicated in the log: hiveserver2:
@dh01 hive]$ cat hiveserver2.log | grep 'GC' at org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471) 2017-07-17 14:00:22,815 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1913ms GC pool 'PS Scavenge' had collection(s): count=1 time=1961ms 2017-07-17 14:14:28,531 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1452ms GC pool 'PS Scavenge' had collection(s): count=1 time=1701ms 2017-07-17 15:04:32,309 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1838ms GC pool 'PS Scavenge' had collection(s): count=1 time=2195ms 2017-07-17 16:08:45,121 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1568ms GC pool 'PS Scavenge' had collection(s): count=1 time=1707mshivemetastore:
@dh01 hive]$ cat hivemetastore.log | grep -i "GC pool" GC pool 'PS Scavenge' had collection(s): count=1 time=3521ms GC pool 'PS MarkSweep' had collection(s): count=1 time=11097ms GC pool 'PS Scavenge' had collection(s): count=1 time=37ms @dh01 hive]$ cat hivemetastore.log | grep -i "JvmPauseMonitor" 2017-07-19 04:26:50,008 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@4f85aca0]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 3050ms 2017-07-19 11:01:32,392 WARN [org.apache.hadoop.util.JvmPauseMonitor$Monitor@4f85aca0]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(191)) - Detected pause in JVM or host machine (eg GC): pause of approximately 10915ms
HiveServer2 Heap Size = 24210 MB (had been set already) Metastore Heap Size = 12288 MB (changed from 8 GB previously). Client heap Size= 2 GB (changed from 1 GB previously).
I did read the article below and the provided links, which was helpfull:
but after having made the changes to indicated heap sizes , i still had instances were Hiveserver2 or Metastore service would go on alert in ambari for a few seconds and come back healthy.
The logs , did not have any errors in this instance
hive.out
hive.log
hive-server2.out
hive-server2.log
hivemetastore.log
hiveserver2.log
Am i missing something ?, would setting HiveServer2 Heap Size and Metastore Heap Size Same help.. i.e setting (HiveServer2 Heap Size =12288 MB)
Environment:
Hadoop 2.7.1.2.4.0.0-169 hive-meta-store - 2.4.0.0-169 hive-server2 - 2.4.0.0-169 hive-webhcat - 2.4.0.0-169 Ambari 2.2.1.0
Created 07-19-2017 05:38 PM
How many users are connecting to your HiveServer 2 concurrently? That determines your memory. From Hortonworks recommendations, for 20 concurrent users you need a mere 6 GB. If you have 10 concurrent connections, 4 GB is enough. For single connection 2 GB, so definitely you don't wont to go below that.
When you have too much memory, you run into what's called "Stop the world garbage collection pauses". You can google more on this but basically JVM needs to move object and update references to it. Now if you move object before updating the references and application that is running access it from old reference than there is trouble. if you update reference first and than try to move object the updated reference is wrong till object is moved and any access while object has not moved will cause issue.
For both CMS and Parallel collector the young generation collection algorithm is similar and it is stop the world that is, application is stopped when collection is happening.
When you allocate too much memory, like 24 GB, stop the world takes longer time, hence your application fails.
So, your metastore does not need to have same memory as Hive Server 2. They are two different processes. If metastore is also running into similar issues, you can set it to 8 GB or less - that's still a lot of memory for just Metastore.
Created 07-19-2017 05:38 PM
How many users are connecting to your HiveServer 2 concurrently? That determines your memory. From Hortonworks recommendations, for 20 concurrent users you need a mere 6 GB. If you have 10 concurrent connections, 4 GB is enough. For single connection 2 GB, so definitely you don't wont to go below that.
When you have too much memory, you run into what's called "Stop the world garbage collection pauses". You can google more on this but basically JVM needs to move object and update references to it. Now if you move object before updating the references and application that is running access it from old reference than there is trouble. if you update reference first and than try to move object the updated reference is wrong till object is moved and any access while object has not moved will cause issue.
For both CMS and Parallel collector the young generation collection algorithm is similar and it is stop the world that is, application is stopped when collection is happening.
When you allocate too much memory, like 24 GB, stop the world takes longer time, hence your application fails.
So, your metastore does not need to have same memory as Hive Server 2. They are two different processes. If metastore is also running into similar issues, you can set it to 8 GB or less - that's still a lot of memory for just Metastore.
Created 07-21-2017 07:32 AM
@mqureshi .. Thanks for getting back. I have reduced the HiveServer2 Heap Size to 20 GB and observing the behavior, i intend to reduce to 12 GB ,step wise over the coming days.
Created 12-13-2017 02:57 PM