Created 08-30-2017 04:27 PM
I'm trying to start all HDP2.6 services via Ambari server 2.5.
Following services are failing to start from Ambari server, but I can start them manually via command prompt first, and then come back to Ambari server and start remaining services from UI. How do I fix the error so that I can start NameNode from Ambari server?
1. NameNode
2. SNameNode
3. NodeManager
I get following error at NameNode startup, how do I fix it?
Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 367, in <module> NameNode().execute() File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute method(env) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 100, in start upgrade_suspended=params.upgrade_suspended, env=env) File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk return fn(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 167, in namenode create_log_dir=True File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py", line 274, in service Execute(daemon_cmd, not_if=process_id_exists_command, environment=hadoop_env_exports) File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__ self.env.run() File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run self.run_action(resource, action) File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_action provider_action() File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 262, in action_run tries=self.resource.tries, try_sleep=self.resource.try_sleep) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner result = function(command, **kwargs) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 102, in checked_call tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper result = _call(command, **kwargs_copy) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 303, in _call raise ExecutionFailed(err_msg, code, out, err) resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode'' returned 1. -bash: line 0: ulimit: core file size: cannot modify limit: Operation not permitted starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-vlmazgrpmaster.fisdev.local.out
Created 08-30-2017 04:44 PM
Can you please try adding the following entries inside the "/etc/security/limits.conf" on the problematic hosts and then try again
* soft core unlimited * hard core unlimited
.
Created 08-30-2017 05:35 PM
I added above 2 lines and restarting all services:
I get below error at NameNode start:
2017-08-30 12:31:21,040 - Retrying after 10 seconds. Reason: Execution of '/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://vlmazgrpmaster.fisdev.local:8020 -safemode get | grep 'Safe mode is OFF'' returned 1. safemode: Call From vlmazgrpmaster.fisdev.local/10.7.192.112 to vlmazgrpmaster.fisdev.local:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused safemode: Call From vlmazgrpmaster.fisdev.local/10.7.192.112 to vlmazgrpmaster.fisdev.local:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2017-08-30 12:31:35,415 - Retrying after 10 seconds. Reason: Execution of '/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://vlmazgrpmaster.fisdev.local:8020 -safemode get | grep 'Safe mode is OFF'' returned 1. safemode: Call From vlmazgrpmaster.fisdev.local/10.7.192.112 to vlmazgrpmaster.fisdev.local:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Created 08-30-2017 05:41 PM
If this is a fresh cluster and you do not have much data on NameNode then you can refer to the following HCC thread to see if NN format (As a quick solution) helps quickly (this will cause data loss)
.
Also it will be good to first check if the Hostname is correct and the 8020 is reachable ... there is no Network issue.
# nc -v vlmazgrpmaster.fisdev.local 802
.
Also the /etc/hosts file has correct entry in all hosts and the following command on every host returns correct FQDN
# hostname -f
.