<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Spark Job long GC pauses in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-Job-long-GC-pauses/m-p/282690#M210140</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am running a Spark Job as below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;/spark-submit --master yarn --deploy-mode client --name sparktest &lt;/SPAN&gt;&lt;SPAN&gt;--num-executors 500&lt;/SPAN&gt; &lt;SPAN&gt;--executor-memory 30g&lt;/SPAN&gt; &lt;SPAN&gt;--driver-memory 16g --queue apps_aup --conf&lt;/SPAN&gt; &lt;SPAN&gt;spark.yarn.executor.memoryOverhead=2000&lt;/SPAN&gt; &lt;SPAN&gt;--conf spark.driver.maxResultSize=12g --conf&lt;/SPAN&gt; &lt;SPAN&gt;spark.executor.cores=2&lt;/SPAN&gt; &lt;SPAN&gt;--conf&lt;/SPAN&gt; &lt;SPAN&gt;spark.task.cpus=2&lt;/SPAN&gt; &lt;SPAN&gt;--conf&lt;/SPAN&gt; &lt;A href="http://spark.network.timeout=600/" target="_blank" rel="noopener"&gt;spark.network.timeout=600&lt;/A&gt; &lt;SPAN&gt;--conf spark.blacklist.enabled=true --conf "spark.executor.extraJavaOptions=&lt;/SPAN&gt;&lt;SPAN&gt;-verbose:gc -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;BR /&gt;&lt;BR /&gt;The Job runs successfully, but everytime it runs there are some of the tasks which run into long GC Pauses.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;I could see following when troubleshooting GC:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;A.)&amp;nbsp;&lt;/P&gt;
&lt;DIV class="panel"&gt;
&lt;DIV class="panelHeader"&gt;&lt;STRONG&gt;Observation for GC -&amp;nbsp; Executor ID 2&lt;/STRONG&gt;&lt;/DIV&gt;
&lt;DIV class="panelContent"&gt;
&lt;P&gt;1.&amp;nbsp; GC options Used:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;-verbose:gc -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2. Executor Details :&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;spark.executor.memory = 30GB
Storage Memory of Executor = 17GB
spark.executor.instances = 500&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="table-wrap"&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;Executor ID&lt;/TD&gt;
&lt;TD&gt;2&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;3. Task Details :&amp;nbsp;&lt;/P&gt;
&lt;DIV class="table-wrap"&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;Task Time = 19 min , GC Time =&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;16&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;min&lt;/TD&gt;
&lt;TD&gt;Input Data Size =&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;217.9&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;MB&lt;/TD&gt;
&lt;TD&gt;Shuffle Write Size = 651.8 MB&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;/DIV&gt;
&lt;P&gt;The above shows that the Total GC time&amp;nbsp; on Executor ID 2, was very high. it was around 16 minutes.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;B.) &lt;STRONG&gt;GC Analysis for Sample Executor ID 2&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="panel"&gt;
&lt;DIV class="panelContent"&gt;
&lt;P&gt;&lt;STRONG&gt;Full GC time:&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;The Full GC Count happened 7 times.
The&amp;nbsp;total time &lt;SPAN class="code-keyword"&gt;for&lt;/SPAN&gt; these 7&amp;nbsp;&amp;nbsp;Full GC was&amp;nbsp;10 min 38 sec 190 ms.
Hence, the&amp;nbsp; average Full GC time was&amp;nbsp;1 min 31 sec 170 ms.
The Average Interval between&amp;nbsp;Full GC Interval was&amp;nbsp;1 min 59 sec 524 ms.&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Full GC Size :&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;The reclaimed data size&amp;nbsp; from the full GC was 1.56 gb &lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&lt;STRONG&gt;Minor GC :&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;The Total Minor GC count was 20

The total time &lt;SPAN class="code-keyword"&gt;for&lt;/SPAN&gt; Minor GC was&amp;nbsp;5 min 32 sec 380 ms

The average time &lt;SPAN class="code-keyword"&gt;for&lt;/SPAN&gt; Minor GC was&amp;nbsp;16 sec 619 ms&amp;nbsp;

Each of these Minor GC happened at an interval &lt;SPAN class="code-keyword"&gt;for&lt;/SPAN&gt; approximately&amp;nbsp;1 min 631 ms 

During these 20 Minor GC the Total size of Data reclaimed was&amp;nbsp;33.65 GB&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;C.)&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;The Total objects created by application added to be around 45.15 GB.
The &lt;SPAN class="code-object"&gt;Object&lt;/SPAN&gt; creation rate by the application was around 40 MB/s.
The Total Objects which got promoted from youngGen to oldGen in each GC cycle was around 8 GB. These objects were promoted on an average rate of 7 MB/s

There was around 18 Allocation Failures, which indicated that the object creation rate was higher than the free heap space in the younGen. This lead to 18 Minor GC cycles (out of 20 total minor GC) to free up younGen and promote objects from younGen to OldGen.&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;D.) GC Time :&lt;/P&gt;
&lt;DIV class="panel"&gt;
&lt;DIV class="panelHeader"&gt;&lt;STRONG&gt;GC Time - real, user, sys&lt;/STRONG&gt;&lt;/DIV&gt;
&lt;DIV class="panelContent"&gt;
&lt;P&gt;The longest Full GC was around 5 minutes :&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;722.599: [Full GC (Ergonomics) [PSYoungGen: 1304054K-&amp;gt;0K(7388672K)] [ParOldGen: 9194816K-&amp;gt;10418042K(14044160K)] 10498870K-&amp;gt;10418042K(21432832K), [Metaspace: 56010K-&amp;gt;56010K(1099776K)], 303.5027565 secs] &lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Here I see -&amp;nbsp;&lt;/P&gt;
&lt;DIV class="code panel"&gt;
&lt;DIV class="codeContent panelContent"&gt;
&lt;PRE&gt;[Times: user=711.36 sys=35.16, real=303.50 secs]&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;The time spend for GC for system was 35.16 second.&lt;BR /&gt;The user time was 711.36 seconds. This is the time added by each GC thread.&lt;BR /&gt;The User + Sys = 746.52 seconds.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The Real Time was only 303.50 seconds (5.05 Minutes). This was the actual time which was spend in GC.&amp;nbsp; This looks good from the Cluster Perspective since the system time was only 35.16. The system looks good (CPU cycles free) to perform the GC.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;The GC log analysis is from the GCeasy link :&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE border="1"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;A href="https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMTEvMTEvLS1nY19zcGFya18zLnR4dC0tMTktNTItNQ==&amp;amp;channel=WEB" target="_blank" rel="noopener"&gt;https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMTEvMTEvLS1nY19zcGFya18zLnR4dC0tMTktNTItNQ==&amp;amp;channel=WEB&lt;/A&gt; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;The GC logs shows:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE border="1"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;426.481: [Full GC (Ergonomics) [PSYoungGen: 1520503K-&amp;gt;200302K(5950464K)] [ParOldGen: 6657880K-&amp;gt;7368200K(10468864K)] 8178383K-&amp;gt;7568503K(16419328K), [Metaspace: 55722K-&amp;gt;55722K(1097728K)], 142.3782071 secs] [Times: user=329.05 sys=12.63, real=142.38 secs]&lt;BR /&gt;572.138: [GC (Allocation Failure) [PSYoungGen: 3949678K-&amp;gt;1623296K(6054400K)] 11317879K-&amp;gt;9192449K(16523264K), 40.8130588 secs] [Times: user=99.07 sys=2.21, real=40.81 secs]&lt;BR /&gt;633.942: [GC (Allocation Failure) [PSYoungGen: 5372672K-&amp;gt;1304054K(7388672K)] 12941825K-&amp;gt;10498870K(17857536K), 88.6569824 secs] [Times: user=212.32 sys=6.40, real=88.66 secs]&lt;BR /&gt;722.599: [Full GC (Ergonomics) [PSYoungGen: 1304054K-&amp;gt;0K(7388672K)] [ParOldGen: 9194816K-&amp;gt;10418042K(14044160K)] 10498870K-&amp;gt;10418042K(21432832K), [Metaspace: 56010K-&amp;gt;56010K(1099776K)], 303.5027565 secs] [Times: user=711.36 sys=35.16, real=303.50 secs]&lt;BR /&gt;1060.380: [GC (Allocation Failure) [PSYoungGen: 4968960K-&amp;gt;208K(7355392K)] 15387002K-&amp;gt;10418258K(21399552K), 0.1422739 secs] [Times: user=0.23 sys=0.50, real=0.14 secs]&lt;BR /&gt;1090.725: [GC (Allocation Failure) [PSYoungGen: 4969168K-&amp;gt;352K(8035840K)] 15387218K-&amp;gt;10418410K(22080000K), 0.0677690 secs] [Times: user=0.19 sys=0.34, real=0.07 secs]&lt;BR /&gt;1124.474: [GC (Allocation Failure) [PSYoungGen: 5606752K-&amp;gt;256K(5606912K)] 16024810K-&amp;gt;10418322K(19651072K), 0.0516994 secs] [Times: user=0.20 sys=0.17, real=0.05 secs]&lt;BR /&gt;1157.440: [GC (Allocation Failure) [PSYoungGen: 5606656K-&amp;gt;368K(8077312K)] 16024722K-&amp;gt;10418434K(22121472K), 0.0649818 secs] [Times: user=0.18 sys=0.33, real=0.07 secs]&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;The Spark UI shows that the Input Data Size is 217 MB:&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screen Shot 2019-11-11 at 10.10.39 PM.png" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/25266iD1550C215C4C0DEB/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screen Shot 2019-11-11 at 10.10.39 PM.png" alt="Screen Shot 2019-11-11 at 10.10.39 PM.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;Why is the GC Pause so high for a Input data of only 217 MB?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;What does the Input refer here?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 12 Nov 2019 13:46:33 GMT</pubDate>
    <dc:creator>PARTOMIA09</dc:creator>
    <dc:date>2019-11-12T13:46:33Z</dc:date>
  </channel>
</rss>

