About PARTOMIA09

PARTOMIA09 · ‎02-21-2020

Try Setting up "udp" for the Kerberos Clients. /etc/krb5.conf [libdefaults] udp_preference_limit = 1

PARTOMIA09 · ‎11-12-2019

Hi w@leed Thanks for Replying. I did test the Job with all the three Collectors - ParallelGC, CMS and G1GC: I has tested following options with the G1GC: -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps and with CMS: -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseParNewGC -XX:+CMSConcurrentMTEnabled -XX:ParallelCMSThreads=10 -XX:ConcGCThreads=8 -XX:ParallelGCThreads=16 With G1GC defaults, I could see following: Desired survivor size 1041235968 bytes, new threshold 5 (max 15) [PSYoungGen: 1515304K->782022K(3053056K)] 2750361K->2017087K(6371840K), 1.5875321 secs] [Times: user=4.72 sys=0.74, real=1.59 secs] Heap after GC invocations=9 (full 3): PSYoungGen total 3053056K, used 782022K [0x0000000580000000, 0x000000068ef80000, 0x0000000800000000) eden space 2270720K, 0% used [0x0000000580000000,0x0000000580000000,0x000000060a980000) from space 782336K, 99% used [0x000000065f380000,0x000000068ef31ab0,0x000000068ef80000) to space 1016832K, 0% used [0x0000000612d80000,0x0000000612d80000,0x0000000650e80000) ParOldGen total 3318784K, used 1235064K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 37% used [0x0000000080000000,0x00000000cb61e318,0x000000014a900000) Metaspace used 55055K, capacity 55638K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K } {Heap before GC invocations=10 (full 3): PSYoungGen total 3053056K, used 3052742K [0x0000000580000000, 0x000000068ef80000, 0x0000000800000000) eden space 2270720K, 100% used [0x0000000580000000,0x000000060a980000,0x000000060a980000) from space 782336K, 99% used [0x000000065f380000,0x000000068ef31ab0,0x000000068ef80000) to space 1016832K, 0% used [0x0000000612d80000,0x0000000612d80000,0x0000000650e80000) ParOldGen total 3318784K, used 1235064K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 37% used [0x0000000080000000,0x00000000cb61e318,0x000000014a900000) Metaspace used 55108K, capacity 55702K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K 42.412: [GC (Allocation Failure) Desired survivor size 1653080064 bytes, new threshold 4 (max 15) [PSYoungGen: 3052742K->1016800K(3422720K)] 4287807K->2985385K(6741504K), 4.0304873 secs] [Times: user=11.87 sys=1.77, real=4.03 secs] Heap after GC invocations=10 (full 3): PSYoungGen total 3422720K, used 1016800K [0x0000000580000000, 0x0000000727a80000, 0x0000000800000000) eden space 2405888K, 0% used [0x0000000580000000,0x0000000580000000,0x0000000612d80000) from space 1016832K, 99% used [0x0000000612d80000,0x0000000650e78240,0x0000000650e80000) to space 1614336K, 0% used [0x00000006c5200000,0x00000006c5200000,0x0000000727a80000) ParOldGen total 3318784K, used 1968584K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 59% used [0x0000000080000000,0x00000000f8272318,0x000000014a900000) Metaspace used 55108K, capacity 55702K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K With all the Collectors only difference I could see was that, a delayed full GC. I am considering to changing the YoungGen now. Will update if I do see a difference. On a parallel note - 1. I did also see that there are some of the objects in the memory which remain persistent across GC cycles - for example : scala.Tuple2 and java.lang.Long 2. These are Java RDD's Regards

PARTOMIA09 · ‎11-12-2019

I am adding few more details for further insight into this issue :

PARTOMIA09 · ‎11-11-2019

Hi, I am running a Spark Job as below: /spark-submit --master yarn --deploy-mode client --name sparktest --num-executors 500 --executor-memory 30g --driver-memory 16g --queue apps_aup --conf spark.yarn.executor.memoryOverhead=2000 --conf spark.driver.maxResultSize=12g --conf spark.executor.cores=2 --conf spark.task.cpus=2 --conf spark.network.timeout=600 --conf spark.blacklist.enabled=true --conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" The Job runs successfully, but everytime it runs there are some of the tasks which run into long GC Pauses. I could see following when troubleshooting GC: A.) Observation for GC - Executor ID 2 1. GC options Used: -verbose:gc -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 2. Executor Details : spark.executor.memory = 30GB Storage Memory of Executor = 17GB spark.executor.instances = 500 Executor ID 2 3. Task Details : Task Time = 19 min , GC Time = 16 min Input Data Size = 217.9 MB Shuffle Write Size = 651.8 MB The above shows that the Total GC time on Executor ID 2, was very high. it was around 16 minutes. B.) GC Analysis for Sample Executor ID 2 Full GC time: The Full GC Count happened 7 times. The total time for these 7 Full GC was 10 min 38 sec 190 ms. Hence, the average Full GC time was 1 min 31 sec 170 ms. The Average Interval between Full GC Interval was 1 min 59 sec 524 ms. Full GC Size : The reclaimed data size from the full GC was 1.56 gb Minor GC : The Total Minor GC count was 20 The total time for Minor GC was 5 min 32 sec 380 ms The average time for Minor GC was 16 sec 619 ms Each of these Minor GC happened at an interval for approximately 1 min 631 ms During these 20 Minor GC the Total size of Data reclaimed was 33.65 GB C.) The Total objects created by application added to be around 45.15 GB. The Object creation rate by the application was around 40 MB/s. The Total Objects which got promoted from youngGen to oldGen in each GC cycle was around 8 GB. These objects were promoted on an average rate of 7 MB/s There was around 18 Allocation Failures, which indicated that the object creation rate was higher than the free heap space in the younGen. This lead to 18 Minor GC cycles (out of 20 total minor GC) to free up younGen and promote objects from younGen to OldGen. D.) GC Time : GC Time - real, user, sys The longest Full GC was around 5 minutes : 722.599: [Full GC (Ergonomics) [PSYoungGen: 1304054K->0K(7388672K)] [ParOldGen: 9194816K->10418042K(14044160K)] 10498870K->10418042K(21432832K), [Metaspace: 56010K->56010K(1099776K)], 303.5027565 secs] Here I see - [Times: user=711.36 sys=35.16, real=303.50 secs] The time spend for GC for system was 35.16 second. The user time was 711.36 seconds. This is the time added by each GC thread. The User + Sys = 746.52 seconds. The Real Time was only 303.50 seconds (5.05 Minutes). This was the actual time which was spend in GC. This looks good from the Cluster Perspective since the system time was only 35.16. The system looks good (CPU cycles free) to perform the GC. The GC log analysis is from the GCeasy link : https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMTEvMTEvLS1nY19zcGFya18zLnR4dC0tMTktNTItNQ==&channel=WEB The GC logs shows: 426.481: [Full GC (Ergonomics) [PSYoungGen: 1520503K->200302K(5950464K)] [ParOldGen: 6657880K->7368200K(10468864K)] 8178383K->7568503K(16419328K), [Metaspace: 55722K->55722K(1097728K)], 142.3782071 secs] [Times: user=329.05 sys=12.63, real=142.38 secs] 572.138: [GC (Allocation Failure) [PSYoungGen: 3949678K->1623296K(6054400K)] 11317879K->9192449K(16523264K), 40.8130588 secs] [Times: user=99.07 sys=2.21, real=40.81 secs] 633.942: [GC (Allocation Failure) [PSYoungGen: 5372672K->1304054K(7388672K)] 12941825K->10498870K(17857536K), 88.6569824 secs] [Times: user=212.32 sys=6.40, real=88.66 secs] 722.599: [Full GC (Ergonomics) [PSYoungGen: 1304054K->0K(7388672K)] [ParOldGen: 9194816K->10418042K(14044160K)] 10498870K->10418042K(21432832K), [Metaspace: 56010K->56010K(1099776K)], 303.5027565 secs] [Times: user=711.36 sys=35.16, real=303.50 secs] 1060.380: [GC (Allocation Failure) [PSYoungGen: 4968960K->208K(7355392K)] 15387002K->10418258K(21399552K), 0.1422739 secs] [Times: user=0.23 sys=0.50, real=0.14 secs] 1090.725: [GC (Allocation Failure) [PSYoungGen: 4969168K->352K(8035840K)] 15387218K->10418410K(22080000K), 0.0677690 secs] [Times: user=0.19 sys=0.34, real=0.07 secs] 1124.474: [GC (Allocation Failure) [PSYoungGen: 5606752K->256K(5606912K)] 16024810K->10418322K(19651072K), 0.0516994 secs] [Times: user=0.20 sys=0.17, real=0.05 secs] 1157.440: [GC (Allocation Failure) [PSYoungGen: 5606656K->368K(8077312K)] 16024722K->10418434K(22121472K), 0.0649818 secs] [Times: user=0.18 sys=0.33, real=0.07 secs] The Spark UI shows that the Input Data Size is 217 MB: Why is the GC Pause so high for a Input data of only 217 MB? What does the Input refer here?

Online	Offline
Last Visited	‎05-14-2020 07:56 AM

Member Since	‎11-11-2019 10:04 PM
Last Visited	‎05-14-2020 07:56 AM
Posts	6
Kudos received	2

Cloudera Community

Re: Kerberos Authentication Wizard failing to gene...

Re: Spark Job long GC pauses

Re: Spark Job long GC pauses

Spark Job long GC pauses