Created 05-27-2014 10:58 AM
Dear Folks, i am seeing "strange" issue with my secondary name node and Checking point is not happening as expected way.
Chekckpoint occurs an every hour But For Past few days, Its not happening hourly, Means, Its Delayed to 8-12 hours.
When i check the secondary name node log , I found this "Exception in doCheckpoint ; java.net.SocketTimeoutException: Read timed out".
I checked Name Node Resource utilization , I did not see any issue, there are plenty of resources.
But In Secondary name-node , I am seeing CPU utilization is 100% However there 80% idle CPU. (these are enterprise hardware , it has 8 CPU's core)
I am suspecting , This issue due to Massive RPC However do we have any utility to Measure RPC in Name Node ?
is there anyway to find what causing this delay in check point ?
Also I am seeing "Name Node & Secondary Name Node" Becomes BAD health very frequently in CM by giving "Cloudera Manager agent is not able to communicate with this role's web server."
(we recently configured FLUME, i am suspecting that would cause the Issue However I am not seeing any abnormal behavior in Name Node)
NOTE: these issues started to visible only for past 5 days, We have running this cluster more than year , CDH 4.1.1 & CM Cloudera Enterprise 4.6.3 )
Best Regards,
Bommuraj
1:31:51.953 PM INFO org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader
replaying edit log: 582444772/110965 transactions completed. (524891%)
11:32:35.978 PM INFO org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader
replaying edit log: 582449593/110965 transactions completed. (524895%)
11:32:36.325 PM INFO org.apache.hadoop.hdfs.server.namenode.FSImage
Edits file /mnt/sda/dfs/snn/current/edits_0000000000582340622-0000000000582451586 of size 14029328 edits # 110965 loaded in 1059 seconds.
11:33:20.847 PM INFO org.apache.hadoop.hdfs.server.namenode.FSImage
Saving image file /mnt/sda/dfs/snn/current/fsimage.ckpt_0000000000582451586 using no compression
5:20:28.285 AM INFO org.apache.hadoop.hdfs.server.namenode.FSImage
Image file of size 2034249037 saved in 20827 seconds.
5:20:28.328 AM INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager
Going to retain 2 images with txid >= 582122380
5:20:28.328 AM INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager
Purging old image FSImageFile(file=/mnt/sda/dfs/snn/current/fsimage_0000000000582003081, cpktTxId=0000000000582003081)
5:20:28.741 AM INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager
Purging old edit log EditLogFile(file=/mnt/sda/dfs/snn/current/edits_0000000000580991759-0000000000581050238,first=0000000000580991759,last=0000000000581050238,inProgress=false,hasCorruptHeader=false)
5:20:28.744 AM INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager
Purging old edit log EditLogFile(file=/mnt/sda/dfs/snn/current/edits_0000000000581050239-0000000000581116511,first=0000000000581050239,last=0000000000581116511,inProgress=false,hasCorruptHeader=false)
5:20:28.769 AM INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage
Opening connection to http://stats-2409.intranet.bit:50070/getimage?putimage=1&txid=582451586&port=50090&storageInfo=-40:9...
5:21:28.802 AM ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
Exception in doCheckpoint
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:244)
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:222)
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:137)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:474)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:331)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:298)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:294)
at java.lang.Thread.run(Thread.java:662)
5:23:33.050 AM INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode 
Image has not changed. Will not download image.
5:23:33.051 AM INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage
Opening connection to http://stats-2409.intranet.bit:50070/getimage?getedit=1&startTxId=582451587&endTxId=582519424&storag...
5:23:33.348 AM INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage
Transfer took 0.30s at 26855.22 KB/s
5:23:33.348 AM INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage
Downloaded file edits_0000000000582451587-0000000000582519424 size 8167790 bytes.
5:23:33.349 AM INFO org.apache.hadoop.hdfs.server.namenode.Checkpointer
Checkpointer about to load edits from 1 stream(s).
5:23:33.349 AM INFO org.apache.hadoop.hdfs.server.namenode.FSImage
Reading /mnt/sda/dfs/snn/current/edits_0000000000582451587-0000000000582519424 expecting start txid #582451587
5:24:11.819 AM INFO org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader
replaying edit log: 582454371/67838 transactions completed. (858596%)
Created 05-29-2014 10:17 AM
Hi, As per HARSH.J suugestion,
i added more HEAP size (5 GB to 8 GB) to Name Node & Secondary Name Node.
issue resolved !!!
Thank you HARSH.
Best Regards,
Bommuraj
Created 05-29-2014 10:17 AM
Hi, As per HARSH.J suugestion,
i added more HEAP size (5 GB to 8 GB) to Name Node & Secondary Name Node.
issue resolved !!!
Thank you HARSH.
Best Regards,
Bommuraj
 
					
				
				
			
		
