Created on 09-03-2014 04:42 AM - edited 09-16-2022 02:06 AM
Hi All,
our Mapred Job suddenly pause during HDFS Checkpoint occured and get back to normal after HDFS checkpoint done. Is there any relation stand by Namenode activity with Active JobTracker in term of Mapred activity? we've been looking for any file logs to get clues but still useless.
For additional information our cluster using CDH4U3 with HA Namenode and Jobtracker.
regards,
-i9um0-
Created 09-08-2014 06:21 AM
Hi Gautam,
Thank's for quick response. As i told you before we have problem with HDFS block, currently our HDFS block reach to 58 millions with storage occupied equal to 500TB it's not quite ideal. NN memory capacity is about 64GB and we set 32GB for NN heap. Last time we change dfs.namenode.checkpoint.txns at the same time with dfs.image.compress from default 44K to 1000K because, we thought when the system often do checkpoint thats lead namenode service become bad as CM report through email.
About your question JT pause only began when you turned on compression of fsimage?. We not sure about that cause the mapreduce pause never like now until 10 minutes, whether it happens or not we do not notice.
Does only increase NN heap memory or there are other alternatives that we can tune related with hadoop parameters will reduce load HDFS and will bring back mapreduce to normal during checkpoint ?
regards,
i9um0
Created 09-04-2014 04:43 AM
Created 09-04-2014 05:33 AM
Hi Gautam,
roughly about 10-15 minutes, last time we change configuration of dfs.image.compress from false to true cause we have problem with FSImage size that growing fast. did it affect with Mapred process ? . our team will run jstack and inform you later on.
regards,
i9um0
Created 09-04-2014 05:48 AM
Created 09-05-2014 06:01 AM
Hi Gautam,
we segregate those services in different host but we collocate Jobtracker with Journal node services in same host. Following are link our log in pastebin :
-activeNN-jstack-1
http://pastebin.com/zpjFwC4z
-activeNN-jstack-2
-activeNN-jstack-3
http://pastebin.com/0Nuj5zpW
-activeNN-jstack-4
http://pastebin.com/xGj8ZpJa
-activeNN-jstack-5
http://pastebin.com/jCk8BVsS
-standbyNN-jstack-1a
http://pastebin.com/Sx4iZDMD
standbyNN-jstack-1b
http://pastebin.com/WLa4E97N
-standbyNN-jstack-2a
http://pastebin.com/KYnBfntZ
-standbyNN-jstack-2b
http://pastebin.com/JMxatCzU
-standbyNN-jstack-3a
http://pastebin.com/vb4N3AX2
-standbyNN-jstack-3b
http://pastebin.com/UFTkjMGs
-standbyNN-jstack-4a
http://pastebin.com/xMsPKAfD
-standbyNN-jstack-4b
http://pastebin.com/7ReiGmYa
-standbyNN-jstack-5a
http://pastebin.com/SMmyvbkQ
-standbyNN-jstack-5b
http://pastebin.com/cS7Q2p2T
-activeJT-jstack-1a
-activeJT-jstack-1b
http://pastebin.com/ifr2EfYW
-activeJT-jstack-2a
http://pastebin.com/Jbj6YGda
-activeJT-jstack-2b
http://pastebin.com/2bqKG8pE
-activeJT-jstack-3a
http://pastebin.com/FLznnuuj
-activeJT-jstack-3b
http://pastebin.com/5km2MbMC
just your info i reply from email too beside post this message
-i9um0-
Created 09-08-2014 05:11 AM
We see a lot of these in the JobTracker jstack. So the namenode is responding.
"DataStreamer for file /tmp/hadoop-hadoop-user/7418759843_pipe_1371547789813_7CC40A5EC84074F51068D326FE4B44CD/_logs/history/job_201409040312_85799_1409897033005_hadoop-user_%5B3529B6C5248F26FE0B927AADBA7BDA41%2F7E4BD3F9FCBCBE4B block BP-2096330913-10.250.195.101-1373872395153:blk_468657822786954548_993063000" daemon prio=10 tid=0x00007f1f2a96f000 nid=0x7b56 in Object.wait() [0x00007f1ebc9e7000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
- locked <0x0000000625121b00> (a java.util.LinkedList)
Have you noticed a large spike in number of blocks and have you tuned your NN heap to deal with this rise? Did the JT pause only began when you turned on compression of fsimage?
Created 09-08-2014 06:21 AM
Hi Gautam,
Thank's for quick response. As i told you before we have problem with HDFS block, currently our HDFS block reach to 58 millions with storage occupied equal to 500TB it's not quite ideal. NN memory capacity is about 64GB and we set 32GB for NN heap. Last time we change dfs.namenode.checkpoint.txns at the same time with dfs.image.compress from default 44K to 1000K because, we thought when the system often do checkpoint thats lead namenode service become bad as CM report through email.
About your question JT pause only began when you turned on compression of fsimage?. We not sure about that cause the mapreduce pause never like now until 10 minutes, whether it happens or not we do not notice.
Does only increase NN heap memory or there are other alternatives that we can tune related with hadoop parameters will reduce load HDFS and will bring back mapreduce to normal during checkpoint ?
regards,
i9um0