Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Refresh Data Node Configuration Appears to be Stuck After Data Dir Removal

avatar
New Contributor

Hi All,

 

One of the disks in our data node has gone bad, and I wanted to hotswap the same. I followed the instructions in the Cloudera documentation to do it. (https://www.cloudera.com/documentation/enterprise/latest/topics/admin_dn_swap.html)

 

I removed the `dfs.datanode.data.dir` configuration only for the affected instance of HDFS as per the guide and clicked on Refresh Configuration in the Actions.

 

It has been running for 3 hours now and I'm getting worried. It hasn't shown any progress updates as well. So I can't really comment on whether something is taking place.

 

The role logs look as below and keep getting updated periodically:

Screen Shot 2017-08-09 at 20.02.58.png

 

The only thing I see in the logs for the command itself is below:

Screen Shot 2017-08-09 at 20.03.05.png

 

The Data Node itself is showing a Red Icon, and when I click on it, I see this. Is the lack of connectivity to Name Node causing the Refresh Configuration process to hang?

Screen Shot 2017-08-09 at 20.06.58.png

 

Cloudera Configuration:

Cloudera Community Edition

Cloudera Manager v5.10.1

CDH v5.10.1

Hadoop v2.6 (the one that comes with the above versions of Cloudera)

 

I also see this on my Cluster's homepage:

Screen Shot 2017-08-09 at 20.09.34.png

 

As you can see, there's a Stale Configuration icon next to the HDFS role. However, when I click on it, it tells me that I have no stale configuration and that everything is alright.

 

Additional Information:

When I ran the `hdfs fsck / -files -blocks -locations > dfs-new-fsck-2.log` command, it tells me that the Health Check is OK. However, it only tells me that 3 Data Nodes are connected instead of 4.

 

The cluster is in a sane state otherwise. All other roles are running fine, and HDFS itself is running fine from what I can see. Our Oozie jobs are continuing to run and write data. We are able to use Spark to access the data without any issues as well.

 

I'm guessing I'm in a safe position because the Data Node is not plugged in to the Name Node. However, I can't be sure.

 

I have two questions:

1) How long does it normally take for the Refresh Data Node command to run?

2) Since I can see an Abort button on the page, is it OK for me to safely abort the Refresh Data Node command without any data loss?

 

Any help on this would be greatly appreciated? Thank you in advance.

 

1 REPLY 1

avatar
New Contributor

Okay. I monitored the logs and saw that nothing was really happening on the data node or the data directories on it. So I went ahead and aborted the configuration refresh.

 

I also ended up restarting the entire cluster afterwards and, miraculously, everything is alright now.

 

My Name Node is not reporting any issues, the data node configuration has come into effect and the name node is also able to see the data node now. I can run queries in Hive and Spark is also not complaining.

 

`hdfs fsck / -files -blocks -locations > dfs-new-fsck-4.log` also reported the below findings, which tells me everything is good.

 

Status: HEALTHY
 Total size:    1892542036524 B (Total open files size: 134227921 B)
 Total dirs:    745065
 Total files:   2651007
 Total symlinks:                0 (Files currently being written: 3)
 Total blocks (validated):      2652077 (avg. block size 713607 B) (Total open file blocks (not validated): 4)
 Minimally replicated blocks:   2652077 (100.0 %)
 Over-replicated blocks:        11083 (0.41789886 %)
 Under-replicated blocks:       1 (3.7706297E-5 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.994875
 Corrupt blocks:                0
 Missing replicas:              1 (1.2607865E-5 %)
 Number of data-nodes:          4
 Number of racks:               1
FSCK ended at Wed Aug 09 19:01:18 CEST 2017 in 43196 milliseconds


The filesystem under path '/' is HEALTHY

Apologies for wasting everyone's time here.