Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDFS goes in bad health

avatar

Health test shows the following errors:


     The health test result for HDFS_FREE_SPACE_REMAINING has become bad: Space free in the cluster: 0 B. Capacity of the cluster: 0 B. Percentage of capacity free: 0.00%. Critical threshold: 10.00%.
     The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to create parent directory for /tmp/.cloudera_health_monitoring_canary_files.

 

We manually verified that space isn't an issue. Connectivity testing is success. No issues with kdc or principals.

 

Please help to explain root cause for this error message.

14 REPLIES 14

avatar
New Contributor

could you please explain more, how you fix this issue.

avatar
New Contributor

What permissions was changed to correct the issue 

avatar
Master Guru

@sparkd,

 

While we can't be sure, it is likely that some permissions were changed on the /tmp directory so that the Service Monitor (that executes the HDFS canary health check) could not access the directory.  Service Monitor utilizes the "hue" user and principal to access other resources so it is reasonable to assume that /tmp in HDFS did not allow the hue user or group to write to /tmp.

 

Are you having similar trouble?  If so, check your service monitor log file for stack traces and errors related to the hdfs canary.

avatar
New Contributor

Hi, 

 

I have the same issue and I have looked in the service manager log and it says it is failing to connect to the server - connection refused. 

Also detected pause in JVM or host machine

I am quite new to cloudera manager and hdfs so is there a way I can check the connection and reconnect the server?

 

Thanks,

 

Jess

avatar
Master Guru

Hi @jess ; welcome to the Cloudera Community.

 

In order to be sure we understand what you are seeing, please share a screen shot or two that shows us what you are seeing so that we can have a better understanding of the problem you are seeing.

 

Make sure you click on the HDFS service and then look at the Instances tab to see what HDFS roles are in bad health.  Also look at the "Health Tests" section to see if anything is reported there.  Click on any roles that are in bad health to see more information about what health tests are failing.

 

Also, good job looking at the Service Monitor log for clues.  Can you show us the stack trace or log messages that say "connection refused?"  The Service Monitor makes connections to several servers, so it is important to know to which it was connecting when the connection refused error occurred.

 

Thanks!