I installed Ranger on our otherwise stable HDP 2.3 cluster yesterday and am now experiencing a slew of problems with Ambari restart, stop and start. The symptoms are unfortunately quite varied:
- Restart button offered after configuration save either does absolutely nothing, or only partially restarts the service
- In some cases, the 'Restart' option on the service action pulldown fails to successfully restart components
- In some cases, the 'Start' option is grayed out on the individual service action pulldown even in the case where the service is definitely not running.
- Clients are randomly marked as requiring restart and cannot be cleared
- The metrics monitor daemon on our edge node cannot be successfully started by Ambari, but works fine when started manually on the command line at that machine.
- Fortunately, 99% of the components can be controlled / restarted by the pulldowns on the individual host pages. However, in a 32 machine cluster, hunting for things that aren't correctly started / re-started is very tedious.
What could be broken? I may not be looking in the right places, but there doesn't seem to be any smoking gun in the logs.
A few things:
root@hambarihost> amabri-server restartAnd:
root@ambariagenthost> ambari-agent restart
Start option graying in my experience has been because when the components register with Ambari the reverse lookup from ambari wasn't successful for the host. Not sure why things would change after installing ranger, but can you check if you could still successfully do a name resolution and reverse name resolution on your hosts?
Tail the /var/log/ambari-agent/amabri-agent.log and see if you could find anything under WARNING or ERROR. Note that sometimes a 'smoking gun' event can be displayed as an INFO. You're also looking for python errors, this will usually have a "Traceback" line. Post the log here and I could take a more indepth look.
I restarted the server and all agents (clustershell to the rescue). Made no difference whatsoever. I see no Python tracebacks or any other obvious errors in the server log or the couple of agent logs I spotchecked (on machines reporting client restart as required). There are no apparent DNS or reverse DNS issues on the hosts I spotchecked. If this were happening I'd expect to see a lot more issues!
What is Ambari looking at to determine that a client needs a restart? And, further, why would a client ever need a restart?
In the case where a service (e.g. Oozie) is reporting 7 clients as needing a restart, the orange 'Restart' --> 'Restart all affected' pulldown does absolutely nothing. No operations are ever queued. A similar request for Mapreduce2 (also showing seven clients as stale) does create an operation, but it disappears after a few seconds with no change in the restart status.
I have restarted the affected services so many times my fingers are numb :-). It's quite painful to cycle them now that the restart button fails to work correctly. Most of the time, the Service Action restart request is ignored and I have to go to each individual machine and cycle components at that level. With 32 hosts this gets old quickly.
I have tried refreshing configs many times. I'm not sure that it's actually doing anything, since no operation is ever queued.
In summary: There appears to be major breakage in the start/stop/restart scheme.
Ideally Ranger installation shouldn't cause issues you described! What is the Ambari Version? What is Ambari DB version? I hope you also tried out another browser / system for these operations. We could clear up the pending requests - but its risky now since its not clear if Ranger is correctly installed. Is Ranger console working good?
PS: may be you should pursue this issue with Hortonworks Technical Support.
How would I determine the Ambari DB version?
I will try from a different browser, that's a good thought although the browser on the master node had never previously caused problems.
Ranger is working perfectly aside from persistent issues with HBase master audit logging for user actions, but that's cosmetic at this point. Access controls are working fine.
Would appreciate knowing how to forcibly clear the notifications.
If I can convince my organization to purchase a service agreement, I'll certainly take this up with Hortonworks support.