I have a problem with new nodes added to the cluster HDP 3.0.1, hdfs service is ok but NodeManager service not start with this errors:
/var/lib/ambari-agent/data/errors-26141.txt
resource_management.core.exceptions.ExecutionFailed: Execution of 'ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/3.0.1.0-187/hadoop/libexec && /usr/hdp/3.0.1.0-187/hadoop-yarn/bin/yarn --config /usr/hdp/3.0.1.0-187/hadoop/conf --daemon start nodemanager' returned 1. Usage: grep [OPTION]... PATTERN [FILE]... Try 'grep --help' for more information. Command line is not complete. Try option "help" TERM environment variable not set. ERROR: Cannot set priority of nodemanager process 34389
TERM env variable is set
NodeManager.log
STARTUP_MSG: java = 1.8.0_112
************************************************************/
2020-03-04 15:18:35,735 INFO nodemanager.NodeManager (LogAdapter.java:info(51)) - registered UNIX signal handlers for [TERM, HUP, INT]
2020-03-04 15:18:36,133 INFO recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:openDatabase(1540)) - Using state database at /var/log/hadoop-yarn/n
odemanager/recovery-state/yarn-nm-state for recovery
2020-03-04 15:18:36,143 ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(936)) - Error starting NodeManager
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in ja
va.library.path, Permission denied]
at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
at org.fusesource.leveldbjni.JniDBFactory.<clinit>(JniDBFactory.java:48)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:1543)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:1531)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:353)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:285)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:358)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
2020-03-04 15:18:36,149 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state STOPPED
I have reviewed the execution permissions of the temporary directories.
The java.library.path files were also copied from an old node.
The NodeManager service runs with root user but with yarn user don´t start.
Created on 03-06-2020 11:15 PM - edited 03-06-2020 11:15 PM
Hi san_t_o Thanks for adding more context.
When cleaning the directories indicated, the libraries are copied automatically or it is necessary to copy them manually?
--> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/* and also on /var/lib/ambari-agent/tmp/
I was testing in my local cluster. Apologies, I meant to clear /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/ and sorry for the typo in my previous comment.
Even before clearing off these directories or altering the location it would be best to review with strace once. It traces all system level calls and reviewing the last call prior to failure could give us more clues. To install strace - you can run
yum -y install strace
export HADOOP_LIBEXEC_DIR=/usr/hdp/3.0.1.0-187/hadoop/libexec
strace -f -s 2000 -o problematic_node /usr/hdp/3.0.1.0-187/hadoop-yarn/bin/yarn --debug --config /usr/hdp/3.0.1.0-187/hadoop/conf --daemon start nodemanager
export HADOOP_LIBEXEC_DIR=/usr/hdp/3.0.1.0-187/hadoop/libexec
strace -f -s 2000 -o good_node /usr/hdp/3.0.1.0-187/hadoop-yarn/bin/yarn --debug --config /usr/hdp/3.0.1.0-187/hadoop/conf --daemon start nodemanager
The file problematic_node and good_node would have the traces and can you attach/paste them here.
Created 03-09-2020 11:16 AM
hi @venkatsambath,
I attach the strace output.
problematic_node
good_node
I found this lines:
Problematic Node
49213 stat("/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8", 0x7f19c4ef1800) = -1 ENOENT (No such file or directory)
49213 open("/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8", O_RDWR|O_CREAT|O_EXCL, 0666) = -1 EACCES (Permission denied)
Good Node:
19290 stat("/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-280379409949290123.8", 0x7fbf276ef800) = -1 ENOENT (No such file or directory)
19290 open("/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-280379409949290123.8", O_RDWR|O_CREAT|O_EXCL, 0666) = 354
The permissions to "hadoop_java_io_tmpdir" in the problematic nodo :
# stat hadoop_java_io_tmpdir/
File: ‘hadoop_java_io_tmpdir/’
Size: 8192 Blocks: 24 IO Block: 4096 directory
Device: fd02h/64770d Inode: 29362223 Links: 39
Access: (1777/drwxrwxrwt) Uid: ( 1073/ hdfs) Gid: ( 1051/ hadoop)
Access: 2020-03-09 12:29:40.811107572 -0500
Modify: 2020-03-09 12:29:38.659084671 -0500
Change: 2020-03-09 12:29:38.659084671 -0500
I'll be waiting for your comments about.
Regards.
Created 03-09-2020 11:38 AM
Here you have the strace outputs:
https://drive.google.com/open?id=1aktK1QqOqY0ub5-qd35XSy5bUUj2hyIt
https://drive.google.com/open?id=1cebTRxbcjdkXFtdQ2adEDMBaJlZg3SPP
Regards.
Created 03-10-2020 07:57 PM
49213 open("/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8", O_RDWR|O_CREAT|O_EXCL, 0666) = -1 EACCES (Permission denied)
During this step, the script is trying to open and get file descriptor for this directory and it was denied access
/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8
So far we have inspected its parent directories and haven't seen any issues with. Can we get details of this directory too
ls -ln /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8
stat /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8
id yarn
Created 03-11-2020 10:50 AM
I think and the same, we have analized its parents directories and apparently they are fine.
# ls -ln /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8
ls: cannot access /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/libleveldbjni-64-1-6110205147654050510.8: No such file or directory
# id yarn
uid=1075(yarn) gid=1051(hadoop) groups=1051(hadoop)
I have managed to start the node from the command line, but I have detected that the command that is executed from Ambari, sets the permissions of the path "hadoop_java_io_tmpdir" as owner to the user "hdfs:hadoop", however I do not identify why the user yarn does not have permissions writing and execution, the permissions are 1777 and the yarn user is member of hadoop group.
Regards