Created on 05-10-2021 06:02 AM - edited 09-16-2022 07:41 AM
when I check cluster health, I get following error message from time to time, Could anyone help on it?
The health test result for NAME_NODE_DIRECTORY_FAILURES has become unknown: Not enough data to test: Test of whether the NameNode has directory failures.
The health test result for NAME_NODE_SAFE_MODE has become unknown: Not enough data to test: Test of whether the NameNode is in safe mode.
The health test result for NAME_NODE_UPGRADE_STATUS has become unknown: Not enough data to test: Test of whether there is an unfinalized HDFS metadata upgrade.
The health test result for NAME_NODE_LOG_DIRECTORY_FREE_SPACE has become unknown: Not enough data to test: Test of whether this role's log directory has enough free space.
The health test result for NAME_NODE_HEAP_DUMP_DIRECTORY_FREE_SPACE has become unknown: Not enough data to test: Test of whether this role's heap dump directory has enough free space.
The health test result for NAME_NODE_DATA_DIRECTORIES_FREE_SPACE has become unknown: Not enough data to test: Test of whether the NameNode Data Directories have enough free space.
The health test result for HOST_SCM_HEALTH has become bad: This host is in contact with the Cloudera Manager Server. This host is not in contact with the Host Monitor.
The health test result for HOST_AGENT_LOG_DIRECTORY_FREE_SPACE has become unknown: Not enough data to test: Test of whether the Cloudera Manager Agent's log directory has enough free space.
The health test result for HOST_AGENT_PARCEL_DIRECTORY_FREE_SPACE has become unknown: Not enough data to test: Test of whether the Cloudera Manager Agent's parcel directory has enough free space.
The health test result for HOST_AGENT_PROCESS_DIRECTORY_FREE_SPACE has become unknown: Not enough data to test: Test of whether the Cloudera Manager Agent's process directory has enough free space.
Created 05-27-2021 10:43 PM
@Zhaojie ,
when you see these alerts messages on the cluster, have you checked the load on a cluster? Also, in the future when you see alerts related to the Namenode, please check the performance of the NN during these alerts, is Namenode hanging or crushing?
Also, please check the connectivity between JournalNode and name node during the time when you have received those alerts.
you can try to restart the journal node along with the name node once and see if this resolves the issue.
Created 07-22-2021 07:11 PM
I had the maybe the some problem.
fixed by :
1. kill superviord process
ps -ef |grep "supervisord" |grep "python" | awk '{print $2}' | xargs kill -9
2. resteart cloudera-manager-agent
systemctl restart cloudera-scm-agent
Created 07-26-2021 06:16 AM
At what point you are getting these errors
Is this a new cluster? Can you check if below services are up and running fine as well-->
JournalNodes
Standby Namenode
Failover Controllers
Namenode
Datanodes
Created 07-27-2021 04:08 AM
The errors to focus here are the below ones which are causing unknown health alerts. This is being caused because CM agent is not able to connect to Host Monitor.
Please ensure the hosts generating the alerts have space available and are not overloaded in terms of COU and memory. You can also try to hard restart the CM agents if this is feasible option since hard restart will stop the services running on the host and clear any thread deadlocks in cluster services.
The health test result for HOST_SCM_HEALTH has become bad: This host is in contact with the Cloudera Manager Server. This host is not in contact with the Host Monitor.
The health test result for HOST_AGENT_LOG_DIRECTORY_FREE_SPACE has become unknown: Not enough data to test: Test of whether the Cloudera Manager Agent's log directory has enough free space.
Hope this helps,
Paras
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.