am attempting to ansibilize the creation of a hdp cluster and have followed the documentation to setup ambari-server and ambari-agent to run as non-root users say ambari specifically
As soon as the ambari-agent detects the ambari-server it connects to it and then fails with the following stacktrace. It seems like a permission issue but cannot figure out what it is. FWIW if ambari-agent is run as root it works fine.
INFO 2017-12-08 18:33:38,479 PingPortListener.py:50 - Ping port listener started on port: 8670 INFO 2017-12-08 18:33:38,480 main.py:437 - Connecting to Ambari server at https://ambari-server.mydomain.com:8440 (192.168.20.20) INFO 2017-12-08 18:33:38,480 NetUtil.py:70 - Connecting to https://ambari-server.mydomain.com:8440/ca INFO 2017-12-08 18:33:38,526 main.py:447 - Connected to Ambari server ambari-server.mydomain.com INFO 2017-12-08 18:33:38,526 hostname.py:67 - agent:hostname_script configuration not defined thus read hostname 'ambari-server.mydomain.com' using socket.getfqdn().INFO 2017-12-08 18:33:38,527 threadpool.py:58 - Started thread pool with 3 core threads and 20 maximum threads WARNING 2017-12-08 18:33:38,527 AlertSchedulerHandler.py:280 - [AlertScheduler] /var/lib/ambari-agent/cache/alerts/definitions.json not found or invalid. No alerts will be scheduled until registration occurs. INFO 2017-12-08 18:33:38,527 AlertSchedulerHandler.py:175 - [AlertScheduler] Starting <ambari_agent.apscheduler.scheduler.Scheduler object at 0x10e93d0>; currently running: False ERROR 2017-12-08 18:33:38,528 Controller.py:506 - Controller thread failed with exception: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 486, in run self.actionQueue = ActionQueue(self.config, controller=self) File "/usr/lib/python2.6/site-packages/ambari_agent/ActionQueue.py", line 79, in __init__ self.statusCommandResultQueue = multiprocessing.Queue() # this queue is filled by StatuCommandsExecutor. File "/usr/lib64/python2.6/multiprocessing/__init__.py", line 213, in Queue return Queue(maxsize) File "/usr/lib64/python2.6/multiprocessing/queues.py", line 37, in __init__ self._rlock = Lock() File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 117, in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 49, in __init__ sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue) OSError: [Errno 13] Permission denied ERROR 2017-12-08 18:33:40,530 main.py:477 - Exiting with exception: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 472, in <module> main(heartbeat_stop_callback) File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 451, in main run_threads(server_hostname, heartbeat_stop_callback) File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 341, in run_threads controller.get_status_commands_executor().kill("AGENT_STOPPED", can_relaunch=False) AttributeError: 'NoneType' object has no attribute 'kill' INFO 2017-12-08 18:33:40,532 ExitHelper.py:56 - Performing cleanup before exiting... INFO 2017-12-08 18:33:40,532 threadpool.py:120 - Shutting down thread pool INFO 2017-12-08 18:33:40,532 scheduler.py:606 - Scheduler has been shut down INFO 2017-12-08 18:33:40,533 threadpool.py:58 - Started thread pool with 3 core threads and 20 maximum threads INFO 2017-12-08 18:33:40,533 AlertSchedulerHandler.py:185 - [AlertScheduler] Stopped the alert scheduler. INFO 2017-12-08 18:33:40,533 threadpool.py:120 - Shutting down thread pool INFO 2017-12-08 18:33:40,550 Controller.py:151 - Server connection disconnected.
As mentioned all the sudo permissions listed here are present in the /etc/sudoers.d/ambari file
Should probably add that the above was when running in lxc containers using vagrant-lxc and seems like to be related to permissions required by the python scripts to shared memory (/dev/shm).
However on vmware this worked fine.
Check your /etc/hosts file and make sure it has all host entries, also check that you have pulled the correct version of Ambari for your OS or did you pull CentOS 6 or 7. Mismatches can create problems with python, it doesn't look like you have that issue, but if you are using satellite this may happen if the wrong link is used. Lastly, check the firewalls.
File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 341, in run_threads controller.get_status_commands_executor().kill("AGENT_STOPPED", can_relaunch=False) AttributeError: 'NoneType' object has no attribute 'kill'
Based on the above error it looks like your ambari-agent might not be installed properly or it might be having some missing / old version of scripts.
So please check the ambari-agent binary version on the problematic host to find out if the agent verison is correct?
If needed then please reinstall the ambari agent.
# rpm -qa | grep ambari-agent # yum clean all # yum reinstall ambari-agent -y