Created 01-10-2017 03:08 AM
The default is 6 hours, so for example, can I safely set dfs.blockreport.intervalMsec to 1 week or 1 month?
Also, I'd like to know risk(s) might be caused by this change.
Created 01-10-2017 03:16 AM
@Tomomichi Hirano No you cannot. Block reports serve an essential function that allows Namenode to reconcile the state of the cluster. That is these block reports tell namenode what blocks are needed and what blocks are to be deleted, if the block is under replicated etc. etc. Full block reports are expensive for Namenode process( there are both incremental and full reports) so it is set to a longer value. However, if you set it to really long values like 1 month, your namenode might not work correctly. Typically the only reason to change this value is if your Namenode is under severe load, so if you are not experiencing such a problem, I would suggest that you don't change this parameter.
Created 01-10-2017 03:16 AM
@Tomomichi Hirano No you cannot. Block reports serve an essential function that allows Namenode to reconcile the state of the cluster. That is these block reports tell namenode what blocks are needed and what blocks are to be deleted, if the block is under replicated etc. etc. Full block reports are expensive for Namenode process( there are both incremental and full reports) so it is set to a longer value. However, if you set it to really long values like 1 month, your namenode might not work correctly. Typically the only reason to change this value is if your Namenode is under severe load, so if you are not experiencing such a problem, I would suggest that you don't change this parameter.
Created 01-10-2017 03:52 AM
Thank you very much for your quick and kindly responce. And let me confirm my understandings.
In my understanding, there are two block reports "full block report" (default interval is 6 hours) and "defferential block report" (maybe, synchronized manner?). In normal case, block information between NameNode and DataNode are being syncronize with "differential block report". And "full block report" is to correct inconsistency block information between NameNode and DataNode which might be caused by unexpected behavior or unknown problem.
Actually, we are suffering from "block report storm" on standby NameNode. Our "block report storm" happens sometime after checkpoint activity... We already tried some params which relate to "block report storm" problem such as "dfs.blockreport.initialDelay", but did not get perfect solution for this issue. So, now we are trying to tune "dfs.blockreport.intervalMsec" to decrease load of namenode (especially to decrease load of standby namenode).
We already increased it from default value 6 hours to 12 hours, so, now we are trying to 24 hours next. I agree that 1 month is too longer, but do you have any advices or best practice for this paramaters if someone want to increase?
Created 01-10-2017 06:15 PM
@Tomomichi Hirano Without understanding why you are seeing a "block report storm", it is hard to say if increasing this parameter will help. Typically most clusters -- even very large clusters seem to work fine with this parameter. Would you be able to share -- How many datanodes are in this cluster / how many data blocks it has ?
If you have too many blocks, then block reports might slow down the active namenode. I am surprised that you are impacted by performance issues in the standby namenode. Would it be that you have GC issues and you are seeing some kind of alerts from the standby ?
Created 01-11-2017 04:48 AM
@aengineer Thank you for your help!
Number of DataNodes is around 200 and Number of total blocks of HDFS is over 200,000,000. So we can assume each DataNode has about 1,000,000 blocks.
Actually, block reports don't slow down the active namenode at all. Only the standby NameNode suffers from the block report storm. We can see this problem with some JVM metrics such as RpcQueueTimeAveTime, RpcProcessingTimeAveTime of Standby NameNode.
The problem is, the standby NameNode can not handle block reports from some datanodes sometime after checkpoint activity. Usually no problem, but this block report storm happened only for standby NameNode sometime after checkpointing. Once standby NameNode starts failing to handle block reports from some DataNodes, the standby NameNode is failing to handle continuauslly and at the end, block reports from all of DataNode are continuausly failing until restarting Standby Namenode.
Let me explain some concrete figures.
- Standby NameNode handles block reports from around 16 or 17 DataNodes per one hour (200 DataNodes / 12 hours = 16.67).
- Each DataNode with 12 HDDs reports 12 block reports, so the number of block reports will be about 200 per one hour (16.67 * 12 = 200.04)
Standby NameNode just has to handle block reports from one or two datanodes (12 or 24 block reports) at the same time usually, and it's no problem at all. But, checkpoint on Standby NameNode takes about 20 mins. So, Standby NameNode has to handle block reports from 6 or more DataNodes (+72 block reports) at the same time after 20 mins waiting of checkpointing... Then, Standby NameNode start failing to handle block reports.
My idea is, if we change dfs.blockreport.intervalMsec from 12H to 24H, the number of block reports per certain time will be descreased to half. I think, this change can not solve our problem perfectly, but it would reduce the frequency of our problem.
Created 01-12-2017 02:28 AM
Would it be possible that Standby namenode is having too much Garbage collections going on ? You might want to look for GC's and if you see a GC happening each time your check point is happening, then it might explain why the standby namenode is not able to keep up. If that is the case, then tuning the memory setting of Namenode might be the right approach instead of changing the block report frequency.
When we do a check point we have to decode lot of small objects and that can create memory pressure on Standby Namenode, Can you please check if there is a correlation between GC and checkpointing ?
Created 01-12-2017 08:48 AM
Actually, we are also monitoring counts of full GC(CMS) and minor GC. Yes, actually we had memory problem in the past, but now we have enough JVM heap for namenodes and there is no full GC(CMS) during checkpoint activities, although there are minor GCs continuauslly, of-course. So, I can believe this problem does not relates to JVM heap.
And what I want to confirm now is my understandings below make sense or not.
1. The purpose of full block reporting from DN to NN
There are two block reports "full block report" (default interval is 6 hours) and "defferential block report" (maybe, synchronized manner?). In normal case, block information between NameNode and DataNode are being syncronize with "differential block report". And "full block report" is to correct inconsistency block information between NameNode and DataNode which might be caused by unexpected behavior or unknown problem.
2. The risk of extending "dfs.blockreport.intervalMsec"
The possibility of data loss due to loss of blocks etc. might increase because the detection of inconsistency of block information between NameNode and DataNodes might be delayed.
Do my understandings make sense?
Created 01-12-2017 06:19 PM
@Tomomichi Hirano Yes, it does.
Created 01-13-2017 12:15 AM
Thanks for your answers and for your help!
Created 06-08-2017 05:30 AM
Let me update this issue just for sharing.
Actually, the block report problem was completely solved by updating HDP2.2 to HDP2.4.
Thank you for your support.