Support Questions
Find answers, ask questions, and share your expertise

Data node rebooting on its own


I had a single datanode spontaneously rebooting itself sometimes more than once a day. Nothing in the syslog gave any clue about what was causing the rebooting. This caused some real trouble with impalad and ETL jobs running over night, so I decommissioned the node, pulled and re-seated all of the memory, performed memory test to confirm all memory was in good shape. Rebooted the server and left it out of the hadoop for a week with zero reboots. A week later after I recommissioned the node, it began rebooting multiple times daily again for no apparently reason so the node was again decommissioned.


Now I'm having this spontaneous rebooting of a datanode with a different node and I'm unable to determine what is causing it. On Jan 2nd, the node rebooted itself five times for reasons that do not seem to be indicated in the system logs.


Has anyone had this happen or have any clue as to what could cause this?


Rising Star

From your description it sounds to me like you may be experiencing operating system crashes on these nodes, possible resulting from a kernel panic. If this is the case then you are not likely to see anything in the logs when the kernel crashes. You may get a clue as to what is going on my looking at the console on the system when it has crashes to see what messages are displayed.

I am unable to say what the cause of the crash could be. It could be anything from faulty memory, a buggy BIOS or device driver or an overheating CPU. It's likely some issues will only manifest when a system is under load, which could explain how it was stable for a week but failed after being added back to the cluster.

I would reach out to your systems team or to you hardware and OS vendors to see if there are any other reports of similar behavior. It seems unlikely that the Cloudera software is causing these crashes so I would suggest looking at your hardware first.

Also check the memory charts for that role to make sure it wasn't just hitting the memory limit, exiting, and auto-restarting. Stderr logs also usually indicate that "" was run when this happens.