Was an interesting issue faced last week. Putting here for bigger audience, might be helpful to others too.
PROBLEM
On one of the node, datanode and nodemanager were not coming up. Below is the error after starting from ambari.
resource_management.core.exceptions.Fail: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start datanode'' returned 1. starting datanode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-datanode-ny-node3.hwxblr.com.out
Error: Could not find or load main class org.apache.hadoop.hdfs.server.datanode.DataNode
As datanode process itself wasn't loaded, so nothing was printed in datanode logs. Only thing we see in .out file is
Error: Could not find or load main class org.apache.hadoop.hdfs.server.datanode.DataNode
@nvadivelu came to rescue. We used below utility to figure out which class was missing.
public class Sample {
public static void main(String[] args) {
try {
org.apache.hadoop.hdfs.server.datanode.DataNode.main(args);
} catch (Throwable ex) {
ex.printStackTrace();
}
}
}
We ran the above code, and it printed the exact class which wasn't able to load.
/usr/jdk64/jdk1.8.0_77/bin/javac -cp `hadoop classpath` Sample.java
Sample.java:5: error: cannot access TraceAdminProtocol
org.apache.hadoop.hdfs.server.datanode.DataNode.main(args);
^
class file for org.apache.hadoop.tracing.TraceAdminProtocol not found
1 error
TraceAdminProtocol clas is present hadoop-common jar. We grep this class in the hadoop-common jar, we didn't find. But on other host, where datanode was running fine, we got below results.
Also we verified size of this jar was less compared to the working one.
RESOLUTION
We copied this jar from the working host and datanode and nodemanager came up fine. We had no clue, from where this jar came, even of same version. But it was a good learning experience.