Support Questions

Find answers, ask questions, and share your expertise

File Descriptor usage in Datanode climbing steadily

avatar
Contributor

Hi,

 

We just started using Cloudera Manager Express 5.9 (the same version for Namenodes and Datanodes) for our HDFS cluster. When our internal client is posting logs to HDFS , we see the usage of file descriptors in Datanodes continually climbing until it reaches to Warning level of 50% and ultimately cross the Critical threshold of 70% (default limits configured in the health tests) .. The only way to bring the usage down is to restart Data Nodes service on each of the Data Nodes. This is really disruptive to our usage of HDFS. 

 

In the past we were using Cloudera Standard 4.6.2 , and with the same setup , we never saw the file descriptors usage so high.

 

I chcked the number of configured file descriptors in both 5.9 and 4.6.2 and it's the same 32k value.

 

Investigation report -:

 

I used ps -ef --cols 9999|grep hdfs to find hdfs pid. Then use /usr/sbin/lsof -p [pid]|wc –l to find how many files open. Here are changes:

 

Datanode1 Datanode2 Datanode3 numbers -:

Before send 16490 16490 16486

After send 16580 16580 16576


java    23757 hdfs  593r   REG             202,81        59   29658600 /opt/dfs/dn/current/BP-832824084-10.189.101.91-1484719231606/current/finalized/subdir0/subdir148/blk_1073779777_70965.metaIn all three data nodes, there are many open files like this:

java    23757 hdfs  594r   REG             202,81       519   29658617 /opt/dfs/dn/current/BP-832824084-10.189.101.91-1484719231606/current/finalized/subdir0/subdir148/blk_1073779790_71007.meta

java    23757 hdfs  595w   REG             202,81       119   29658629 /opt/dfs/dn/current/BP-832824084-10.189.101.91-1484719231606/current/finalized/subdir0/subdir148/blk_1073779801_71047.meta

 

We had the same situation even couple of hours later, and the open file descriptors did not decrease.

 

Has someone else seen the same problem and has a solution to this ? We will be really grateful for your support.

 

Please let me know if you have any questions.

 

Thanks,

Raj

1 ACCEPTED SOLUTION

avatar
Champion
Are the number of datanodes the same? Is the block size the same? How many blocks are on each cluster?

The *.meta files are metadata files for the blocks. This may have been a change compared to Hadoop 1; I am not sure.

It is a bit weird for it to never go down though. I have a cluster with millions of blocks and hundreds of TBs and I'll get spikes but the open FDs are on average around 2k per nodes.

It does depend on how much work the DNs are under as well.

Can you increase the FD limits?

View solution in original post

7 REPLIES 7

avatar
Champion
Are the number of datanodes the same? Is the block size the same? How many blocks are on each cluster?

The *.meta files are metadata files for the blocks. This may have been a change compared to Hadoop 1; I am not sure.

It is a bit weird for it to never go down though. I have a cluster with millions of blocks and hundreds of TBs and I'll get spikes but the open FDs are on average around 2k per nodes.

It does depend on how much work the DNs are under as well.

Can you increase the FD limits?

avatar
Contributor

Hi,

 

I havent seen the file descriptors rising ever since i opened this ticket .. Feel free to close this ticket .. 

 

Thanks for the suggestions though 🙂 

 

Thanks, Raj

avatar
Contributor

This is happening again .. 

 

We have now 4 large machines handling 1/6th of the load similarly 4 data nodes of 4.6.2 version and we did not see file descriptors climbing there , so it has something to do with 5.9 version itself .. 

 

Can someone please confirm next course of action in this case ?

 

Looking forward to your response.

 

Thanks,

Raj

avatar
New Contributor

We have the same issue.
We upgraded from 2.6.0 CDH 5.7.6 to 2.6.0 CDH 5.9.1.
Since then, our data nodes have been leaking open file descriptors to block .meta files.
We didn't have any issues before the upgrade.
The screen shot attached shows the change in behavior after the upgrade for one of our data nodes.
The drop downs occur when we restart the HDFS service.

 

sc.png

avatar
New Contributor

Downgrading from 2.6.0-cdh5.9.1 back to 2.6.0-cdh5.8.4 looks to have fixed the problem.

Our HDFS is back to being usable and stable.

 

 

avatar
Contributor

Hi nmous,

 

How did you downgrade from 5.9 to 5.8.4 ? Can you please tell me is there's a link for the documentation ?

 

Looking forward to your response.

 

Thanks, Raj

avatar
New Contributor

We are only runnng hdfs, so we only need to upgrade that.

Since it was a dev environment, we shut all of hdfs down, download

hadoop-2.6.0-cdh5.8.4.tar.gz from http://archive.cloudera.com/cdh5/cdh/5/

and run with that.

 

(We are actually running with hdfs on mesos, so the artifacts get packaged up into an uberjar with the mesos executor, but there's no real magic there.  I think it just uses the stuff in hadoop/common and hadoop/hdfs and some of the run scripts.)