Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

A lot of blocks missing in HDFS

avatar
Rising Star

Yesterday I add three more data nodes to my hdfs cluster with hdp 2.6.4.

Few hours later, because of sparking writing error(No lease on...), I increase dfs.datanode.max.xcievers to 65536 and increase the heap size of name node and data node from 5G to 12G. And then restart it.

However, the hdfs restart progress pauses in name node stage. It shows it is always in safe mode, and continues for 10 minutes. I force to leave the safe mode manually, and then hdfs reports a lot blocks are missing(about more than 90%).

I checked the log of datanode and namenode, there are two kinds of error log:

1. In name node: Requested data length ** is longer than maximum configured RPC length **

2. In data node: End of file exception between local host is "***", destination host is "**:8020"

So how can I recovery my missing file? and what's the actual cause of this problem?

1 ACCEPTED SOLUTION

avatar
Contributor

@Junfeng dfs.datanode.max.transfer.threads (i.e dfs.datanode.max.xcievers) and datanode memory both go together up. I feel anything in the range of 4096-8192 should be good enough. By the way, this configuration won't fix your missing block issue but is import to avoid exceptions like thread limit/quota exceeded, datanode running out of memory etc.

In my previous comment, I forgot to mention that you need to tune ipc.maximum.data.length param in order to ensure your namenode receives the datanode block report. Currently, based on your error it seems it being rejected as its most likely crossing default limit of 64MB. Once you tune ipc.maximum.data.length missing blocks should go away most likely.

View solution in original post

5 REPLIES 5

avatar
Contributor

For your first question explore param ipc.maximum.data.length that should help, on the other hand, value of 65536 for dfs.datanode.max.xcievers seems way high.

Basically, I feel your datanode block reports are not reaching namenode because of length limitation and so namenode is missing enough blocks to exit safemode. Thus it makes sense why its reporting missing blocks upon forceful safemode exit.

For namenode heap configuration visit https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_command-line-installation/content/config...

avatar
Rising Star

Thanks @KB

I have reset the dfs.datanode.max.xcievers to 32768, is it still too high?

I increase it to avoid "No lease on file (inode 5425306)" error. So what's the proper value for this property?

If I set the value to a proper value, will the missing block be recovered automatically?

avatar
Contributor

@Junfeng dfs.datanode.max.transfer.threads (i.e dfs.datanode.max.xcievers) and datanode memory both go together up. I feel anything in the range of 4096-8192 should be good enough. By the way, this configuration won't fix your missing block issue but is import to avoid exceptions like thread limit/quota exceeded, datanode running out of memory etc.

In my previous comment, I forgot to mention that you need to tune ipc.maximum.data.length param in order to ensure your namenode receives the datanode block report. Currently, based on your error it seems it being rejected as its most likely crossing default limit of 64MB. Once you tune ipc.maximum.data.length missing blocks should go away most likely.

avatar
Rising Star

I increase the ipc max length according to this https://community.hortonworks.com/questions/101841/issue-requested-data-length-146629817-is-longer-t...

The hdfs service seems back to work.

avatar
Rising Star

Thanks @KB

And another question:

When my spark application writing massive of data to hdfs, it always throws error message like following:

No lease on /user/xx/sample_2016/_temporary/0/_temporary/attempt_201604141035_0058_m_019029_0/part-r-19029-1b93e1fa-9284-4f2c-821a-c83795ad27c1.gz.parquet:File does not exist.HolderDFSClient_NONMAPREDUCE_1239207978_115 does not have any open files.

How to solve this problem? I search online and others said it is related to dfs.datanode.max.xcievers