Support Questions

Find answers, ask questions, and share your expertise

HDFS goes in bad health

avatar

Health test shows the following errors:


     The health test result for HDFS_FREE_SPACE_REMAINING has become bad: Space free in the cluster: 0 B. Capacity of the cluster: 0 B. Percentage of capacity free: 0.00%. Critical threshold: 10.00%.
     The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to create parent directory for /tmp/.cloudera_health_monitoring_canary_files.

 

We manually verified that space isn't an issue. Connectivity testing is success. No issues with kdc or principals.

 

Please help to explain root cause for this error message.

14 REPLIES 14

avatar
Expert Contributor
Hi,

What did you make a change over the cluster before you see the message,
"Space free in the cluster: 0 B"?
How did you verify that the space is not the case? Can you also verify if
the DataNodes are up?
Are there actual blocks in DNs' local directories?

avatar

Thanks for responding.

 

This is a new cluster. DN is up & running. I verified space through CM as well as logging to the server themselves.

avatar
Expert Contributor

Hi,

 

Have you checked the space in Name node's web UI ? Is it showing fine ?

 

 

Thanks,

Sathish

Thanks,
Sathish (Satz)

avatar
New Contributor

I have the same issue with a brand new Cloudera Manager install on an AWS EC2 4 instance m4.xlarge cluster with 100GiB magnetic disk each.

Cloudera Manager Hosts view shows all 4 instances with a Disk Usage at 10.3-12.1 GiB / 115.6 GiB and "green" status.

 

The cluster is unuseable with HDFS in the resulting RED status.

 

What was the final resoltion on this?

avatar
New Contributor

I verified the space by logging onto the the server and issuing the following command:

 

ubuntu@ip-172-31-29-49:~$ df -h

Filesystem      Size  Used Avail Use% Mounted on

/dev/xvda1       99G  8.3G   86G   9% /

none            4.0K     0  4.0K   0% /sys/fs/cgroup

udev            7.9G   12K  7.9G   1% /dev

tmpfs           1.6G  496K  1.6G   1% /run

none            5.0M     0  5.0M   0% /run/lock

none            7.9G     0  7.9G   0% /run/shm

none            100M     0  100M   0% /run/user

cm_processes    7.9G   14M  7.9G   1% /run/cloudera-scm-agent/process

 

As you can see there is plenty of space available.

 

What do you suggest as a next step?

avatar
Cloudera Employee

Usually this indicates the datanodes are not in contact with the name node.  O bytes means there is no data nodes available to write to.  Check the data node logs under /var/log/hadoo-hdfs

 

There will be some clues there, paste anything that springs to mind in the response here.

avatar
New Contributor
I've had the same issue. Just checked the logs on data nodes and they are successfully registering with NN


avatar
Master Guru

@ditu,

 

This thread is super super old, so it would be best to confirm you are seeing the same issue.  What message do you see regarding the canary test failure?

 

Basically, the Service Monitor will perform a health check of HDFS by writing out a file to make sure that completes.  If it doesn't complete, then that could mean some problems with HDFS that requires review so this triggers a bad health state.

 

The canary test does the following:

 

  1. creates a file
  2. writes to it
  3. reads it back
  4. verifies the data
  5. deletes the file

By default, the file name is:

/tmp/.cloudera_health_monitoring_canary_files

 

It is possible that the Service Monitor log (in /var/log/cloudera-scm-firehose) has some error or exception reflecting the failure.

 

Note that the operation of writing to a file in HDFS requires communication with the NameNode and then the DataNode that the NameNode tells the client to write the file to.  Failures could occur in various places.

 

avatar
New Contributor

It was a user permissions issue.

All fixed now.

 

Thanks 🙂