Member since
11-11-2019
6
Posts
2
Kudos Received
0
Solutions
02-21-2020
02:58 AM
Try Setting up "udp" for the Kerberos Clients. /etc/krb5.conf
[libdefaults]
udp_preference_limit = 1
... View more
11-12-2019
04:45 PM
Hi w@leed Thanks for Replying. I did test the Job with all the three Collectors - ParallelGC, CMS and G1GC: I has tested following options with the G1GC: -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps and with CMS: -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseParNewGC -XX:+CMSConcurrentMTEnabled -XX:ParallelCMSThreads=10 -XX:ConcGCThreads=8 -XX:ParallelGCThreads=16 With G1GC defaults, I could see following: Desired survivor size 1041235968 bytes, new threshold 5 (max 15) [PSYoungGen: 1515304K->782022K(3053056K)] 2750361K->2017087K(6371840K), 1.5875321 secs] [Times: user=4.72 sys=0.74, real=1.59 secs] Heap after GC invocations=9 (full 3): PSYoungGen total 3053056K, used 782022K [0x0000000580000000, 0x000000068ef80000, 0x0000000800000000) eden space 2270720K, 0% used [0x0000000580000000,0x0000000580000000,0x000000060a980000) from space 782336K, 99% used [0x000000065f380000,0x000000068ef31ab0,0x000000068ef80000) to space 1016832K, 0% used [0x0000000612d80000,0x0000000612d80000,0x0000000650e80000) ParOldGen total 3318784K, used 1235064K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 37% used [0x0000000080000000,0x00000000cb61e318,0x000000014a900000) Metaspace used 55055K, capacity 55638K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K } {Heap before GC invocations=10 (full 3): PSYoungGen total 3053056K, used 3052742K [0x0000000580000000, 0x000000068ef80000, 0x0000000800000000) eden space 2270720K, 100% used [0x0000000580000000,0x000000060a980000,0x000000060a980000) from space 782336K, 99% used [0x000000065f380000,0x000000068ef31ab0,0x000000068ef80000) to space 1016832K, 0% used [0x0000000612d80000,0x0000000612d80000,0x0000000650e80000) ParOldGen total 3318784K, used 1235064K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 37% used [0x0000000080000000,0x00000000cb61e318,0x000000014a900000) Metaspace used 55108K, capacity 55702K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K 42.412: [GC (Allocation Failure) Desired survivor size 1653080064 bytes, new threshold 4 (max 15) [PSYoungGen: 3052742K->1016800K(3422720K)] 4287807K->2985385K(6741504K), 4.0304873 secs] [Times: user=11.87 sys=1.77, real=4.03 secs] Heap after GC invocations=10 (full 3): PSYoungGen total 3422720K, used 1016800K [0x0000000580000000, 0x0000000727a80000, 0x0000000800000000) eden space 2405888K, 0% used [0x0000000580000000,0x0000000580000000,0x0000000612d80000) from space 1016832K, 99% used [0x0000000612d80000,0x0000000650e78240,0x0000000650e80000) to space 1614336K, 0% used [0x00000006c5200000,0x00000006c5200000,0x0000000727a80000) ParOldGen total 3318784K, used 1968584K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 59% used [0x0000000080000000,0x00000000f8272318,0x000000014a900000) Metaspace used 55108K, capacity 55702K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K With all the Collectors only difference I could see was that, a delayed full GC. I am considering to changing the YoungGen now. Will update if I do see a difference. On a parallel note - 1. I did also see that there are some of the objects in the memory which remain persistent across GC cycles - for example : scala.Tuple2 and java.lang.Long 2. These are Java RDD's Regards
... View more
11-12-2019
11:18 AM
I am adding few more details for further insight into this issue :
... View more
11-11-2019
10:16 PM
Hi,
I am running a Spark Job as below:
/spark-submit --master yarn --deploy-mode client --name sparktest --num-executors 500 --executor-memory 30g --driver-memory 16g --queue apps_aup --conf spark.yarn.executor.memoryOverhead=2000 --conf spark.driver.maxResultSize=12g --conf spark.executor.cores=2 --conf spark.task.cpus=2 --conf spark.network.timeout=600 --conf spark.blacklist.enabled=true --conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" The Job runs successfully, but everytime it runs there are some of the tasks which run into long GC Pauses.
I could see following when troubleshooting GC:
A.)
Observation for GC - Executor ID 2
1. GC options Used:
-verbose:gc -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
2. Executor Details :
spark.executor.memory = 30GB
Storage Memory of Executor = 17GB
spark.executor.instances = 500
Executor ID
2
3. Task Details :
Task Time = 19 min , GC Time = 16 min
Input Data Size = 217.9 MB
Shuffle Write Size = 651.8 MB
The above shows that the Total GC time on Executor ID 2, was very high. it was around 16 minutes.
B.) GC Analysis for Sample Executor ID 2
Full GC time:
The Full GC Count happened 7 times.
The total time for these 7 Full GC was 10 min 38 sec 190 ms.
Hence, the average Full GC time was 1 min 31 sec 170 ms.
The Average Interval between Full GC Interval was 1 min 59 sec 524 ms.
Full GC Size :
The reclaimed data size from the full GC was 1.56 gb
Minor GC :
The Total Minor GC count was 20
The total time for Minor GC was 5 min 32 sec 380 ms
The average time for Minor GC was 16 sec 619 ms
Each of these Minor GC happened at an interval for approximately 1 min 631 ms
During these 20 Minor GC the Total size of Data reclaimed was 33.65 GB
C.)
The Total objects created by application added to be around 45.15 GB.
The Object creation rate by the application was around 40 MB/s.
The Total Objects which got promoted from youngGen to oldGen in each GC cycle was around 8 GB. These objects were promoted on an average rate of 7 MB/s
There was around 18 Allocation Failures, which indicated that the object creation rate was higher than the free heap space in the younGen. This lead to 18 Minor GC cycles (out of 20 total minor GC) to free up younGen and promote objects from younGen to OldGen.
D.) GC Time :
GC Time - real, user, sys
The longest Full GC was around 5 minutes :
722.599: [Full GC (Ergonomics) [PSYoungGen: 1304054K->0K(7388672K)] [ParOldGen: 9194816K->10418042K(14044160K)] 10498870K->10418042K(21432832K), [Metaspace: 56010K->56010K(1099776K)], 303.5027565 secs]
Here I see -
[Times: user=711.36 sys=35.16, real=303.50 secs]
The time spend for GC for system was 35.16 second. The user time was 711.36 seconds. This is the time added by each GC thread. The User + Sys = 746.52 seconds.
The Real Time was only 303.50 seconds (5.05 Minutes). This was the actual time which was spend in GC. This looks good from the Cluster Perspective since the system time was only 35.16. The system looks good (CPU cycles free) to perform the GC.
The GC log analysis is from the GCeasy link :
https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMTEvMTEvLS1nY19zcGFya18zLnR4dC0tMTktNTItNQ==&channel=WEB
The GC logs shows:
426.481: [Full GC (Ergonomics) [PSYoungGen: 1520503K->200302K(5950464K)] [ParOldGen: 6657880K->7368200K(10468864K)] 8178383K->7568503K(16419328K), [Metaspace: 55722K->55722K(1097728K)], 142.3782071 secs] [Times: user=329.05 sys=12.63, real=142.38 secs] 572.138: [GC (Allocation Failure) [PSYoungGen: 3949678K->1623296K(6054400K)] 11317879K->9192449K(16523264K), 40.8130588 secs] [Times: user=99.07 sys=2.21, real=40.81 secs] 633.942: [GC (Allocation Failure) [PSYoungGen: 5372672K->1304054K(7388672K)] 12941825K->10498870K(17857536K), 88.6569824 secs] [Times: user=212.32 sys=6.40, real=88.66 secs] 722.599: [Full GC (Ergonomics) [PSYoungGen: 1304054K->0K(7388672K)] [ParOldGen: 9194816K->10418042K(14044160K)] 10498870K->10418042K(21432832K), [Metaspace: 56010K->56010K(1099776K)], 303.5027565 secs] [Times: user=711.36 sys=35.16, real=303.50 secs] 1060.380: [GC (Allocation Failure) [PSYoungGen: 4968960K->208K(7355392K)] 15387002K->10418258K(21399552K), 0.1422739 secs] [Times: user=0.23 sys=0.50, real=0.14 secs] 1090.725: [GC (Allocation Failure) [PSYoungGen: 4969168K->352K(8035840K)] 15387218K->10418410K(22080000K), 0.0677690 secs] [Times: user=0.19 sys=0.34, real=0.07 secs] 1124.474: [GC (Allocation Failure) [PSYoungGen: 5606752K->256K(5606912K)] 16024810K->10418322K(19651072K), 0.0516994 secs] [Times: user=0.20 sys=0.17, real=0.05 secs] 1157.440: [GC (Allocation Failure) [PSYoungGen: 5606656K->368K(8077312K)] 16024722K->10418434K(22121472K), 0.0649818 secs] [Times: user=0.18 sys=0.33, real=0.07 secs]
The Spark UI shows that the Input Data Size is 217 MB:
Why is the GC Pause so high for a Input data of only 217 MB?
What does the Input refer here?
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN