Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

DataNode daemon restarted frequently

avatar
Contributor

Hi experts:

 

There is a node on which the DataNode process is restarted frequently by the supervisord. Other nodes in the cluster with same hardware and configurations do not see such issues. We are on the version CDH-5.15.2-1. Could you please advise where to look for the reason? Thank you.

 

In the log file 'hadoop-cmf-hdfs-DATANODE-compute-1-14.local.log.out', for today we see:

bash-4.1# grep -B 2 "STARTUP_MSG: Starting DataNode" hadoop-cmf-hdfs-DATANODE-compute-1-14.local.log.out

2020-09-03 02:49:08,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
--
2020-09-03 03:48:31,912 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
--
2020-09-03 05:25:37,999 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
--
2020-09-03 08:26:25,445 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
--
2020-09-03 08:42:48,882 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode

 

 

These correspond to what is the supervisord log /var/log/cloudera-scm-agent/supervisord.log:

2020-09-03 02:49:06,297 INFO exited: 64450-hdfs-DATANODE (terminated by SIGKILL; not expected)
2020-09-03 02:49:07,300 INFO spawned: '64450-hdfs-DATANODE' with pid 94527
2020-09-03 02:49:07,300 INFO Increased RLIMIT_MEMLOCK limit to 4294967296
2020-09-03 02:49:27,361 INFO success: 64450-hdfs-DATANODE entered RUNNING state, process has stayed up for > than 20 seconds (startsecs)

2020-09-03 03:48:31,094 INFO exited: 64450-hdfs-DATANODE (terminated by SIGKILL; not expected)
2020-09-03 03:48:31,166 INFO spawned: '64450-hdfs-DATANODE' with pid 107591
2020-09-03 03:48:31,166 INFO Increased RLIMIT_MEMLOCK limit to 4294967296
2020-09-03 03:48:51,368 INFO success: 64450-hdfs-DATANODE entered RUNNING state, process has stayed up for > than 20 seconds (startsecs)

2020-09-03 05:25:36,275 INFO exited: 64450-hdfs-DATANODE (terminated by SIGKILL; not expected)
2020-09-03 05:25:37,277 INFO spawned: '64450-hdfs-DATANODE' with pid 127966
2020-09-03 05:25:37,278 INFO Increased RLIMIT_MEMLOCK limit to 4294967296
2020-09-03 05:25:57,338 INFO success: 64450-hdfs-DATANODE entered RUNNING state, process has stayed up for > than 20 seconds (startsecs)

2020-09-03 08:26:23,687 INFO exited: 64450-hdfs-DATANODE (terminated by SIGKILL; not expected)
2020-09-03 08:26:24,690 INFO spawned: '64450-hdfs-DATANODE' with pid 18960
2020-09-03 08:26:24,690 INFO Increased RLIMIT_MEMLOCK limit to 4294967296
2020-09-03 08:26:44,752 INFO success: 64450-hdfs-DATANODE entered RUNNING state, process has stayed up for > than 20 seconds (startsecs)

2020-09-03 08:42:47,139 INFO exited: 64450-hdfs-DATANODE (terminated by SIGKILL; not expected)
2020-09-03 08:42:48,142 INFO spawned: '64450-hdfs-DATANODE' with pid 22506
2020-09-03 08:42:48,142 INFO Increased RLIMIT_MEMLOCK limit to 4294967296
2020-09-03 08:43:08,205 INFO success: 64450-hdfs-DATANODE entered RUNNING state, process has stayed up for > than 20 seconds (startsecs)

 

 

2 REPLIES 2

avatar
Rising Star

@vincentD Please look at DATANODE logs to check for any FATAL/ERRORs before the shutdown. That could shed some light on the root cause of DN failure.

avatar
Expert Contributor

Hello @vincentD Please review the stdout and stderr of the DN which going down frequently.

You can navigate to CM > HDFS > Instance > Select the DN which went down > Processes > click on stdout/stderr atthe bottom of the page.  

 

I am asking to verify stdout/stderr suspecting an OOM error (due to java heap running out of memory) leading to the DN exit/shutdown abruptly.  

 

If the DN exit is due to OOM Error, please increase the DN heap size to adequate value to get rid off teh issue further. DN heap sizing rule of thumb says: 1 GB heap memory for 1Million blocks. You can verify your block counts on each DN by navigating to CM > HDFS > NN Web UI > Active NN > DataNode and you can see the DN stats on that page showing block counts and disk usage etc..