Created on 08-08-2018 11:59 AM - edited 09-16-2022 06:34 AM
Hi,
Any idea why I am getting this error when doing the restart after I have installed Druid service via Ambari on fresh HDP 3.0 cluster? The cluster was deployed using Cloudbreak on Azure. Ambari version 2.7.0.0.
stderr: /var/lib/ambari-agent/data/errors-82.txt
Traceback (most recent call last): File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/application_timeline_server.py", line 97, in <module> ApplicationTimelineServer().execute() File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 353, in execute method(env) File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 933, in restart if componentCategory and componentCategory.strip().lower() == 'CLIENT'.lower(): File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/config_dictionary.py", line 73, in __getattr__ raise Fail("Configuration parameter '" + self.name + "' was not found in configurations dictionary!") resource_management.core.exceptions.Fail: Configuration parameter 'roleParams' was not found in configurations dictionary!
stdout: /var/lib/ambari-agent/data/output-82.txt
2018-08-08 11:35:16,639 - Stack Feature Version Info: Cluster Stack=3.0, Command Stack=None, Command Version=3.0.0.0-1334 -> 3.0.0.0-1334 2018-08-08 11:35:16,654 - Using hadoop conf dir: /usr/hdp/3.0.0.0-1334/hadoop/conf 2018-08-08 11:35:16,797 - Stack Feature Version Info: Cluster Stack=3.0, Command Stack=None, Command Version=3.0.0.0-1334 -> 3.0.0.0-1334 2018-08-08 11:35:16,801 - Using hadoop conf dir: /usr/hdp/3.0.0.0-1334/hadoop/conf 2018-08-08 11:35:16,802 - Group['hdfs'] {} 2018-08-08 11:35:16,803 - Group['hadoop'] {} 2018-08-08 11:35:16,803 - Group['users'] {} 2018-08-08 11:35:16,804 - User['hive'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None} 2018-08-08 11:35:16,804 - User['yarn-ats'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None} 2018-08-08 11:35:16,805 - User['druid'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None} 2018-08-08 11:35:16,806 - User['zookeeper'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None} 2018-08-08 11:35:16,807 - User['ams'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None} 2018-08-08 11:35:16,808 - User['ambari-qa'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop', 'users'], 'uid': None} 2018-08-08 11:35:16,808 - User['tez'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop', 'users'], 'uid': None} 2018-08-08 11:35:16,809 - User['hdfs'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hdfs', 'hadoop'], 'uid': None} 2018-08-08 11:35:16,810 - User['yarn'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None} 2018-08-08 11:35:16,811 - User['mapred'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None} 2018-08-08 11:35:16,811 - File['/var/lib/ambari-agent/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555} 2018-08-08 11:35:16,813 - Execute['/var/lib/ambari-agent/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa 0'] {'not_if': '(test $(id -u ambari-qa) -gt 1000) || (false)'} 2018-08-08 11:35:16,817 - Skipping Execute['/var/lib/ambari-agent/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa 0'] due to not_if 2018-08-08 11:35:16,818 - Group['hdfs'] {} 2018-08-08 11:35:16,818 - User['hdfs'] {'fetch_nonlocal_groups': True, 'groups': ['hdfs', 'hadoop', u'hdfs']} 2018-08-08 11:35:16,819 - FS Type: 2018-08-08 11:35:16,819 - Directory['/etc/hadoop'] {'mode': 0755} 2018-08-08 11:35:16,832 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/hadoop-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'} 2018-08-08 11:35:16,833 - Directory['/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir'] {'owner': 'hdfs', 'group': 'hadoop', 'mode': 01777} 2018-08-08 11:35:16,846 - Execute[('setenforce', '0')] {'not_if': '(! which getenforce ) || (which getenforce && getenforce | grep -q Disabled)', 'sudo': True, 'only_if': 'test -f /selinux/enforce'} 2018-08-08 11:35:16,853 - Skipping Execute[('setenforce', '0')] due to not_if 2018-08-08 11:35:16,853 - Directory['/var/log/hadoop'] {'owner': 'root', 'create_parents': True, 'group': 'hadoop', 'mode': 0775, 'cd_access': 'a'} 2018-08-08 11:35:16,855 - Directory['/var/run/hadoop'] {'owner': 'root', 'create_parents': True, 'group': 'root', 'cd_access': 'a'} 2018-08-08 11:35:16,855 - Changing owner for /var/run/hadoop from 1011 to root 2018-08-08 11:35:16,855 - Changing group for /var/run/hadoop from 988 to root 2018-08-08 11:35:16,855 - Directory['/tmp/hadoop-hdfs'] {'owner': 'hdfs', 'create_parents': True, 'cd_access': 'a'} 2018-08-08 11:35:16,859 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/commons-logging.properties'] {'content': Template('commons-logging.properties.j2'), 'owner': 'hdfs'} 2018-08-08 11:35:16,860 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/health_check'] {'content': Template('health_check.j2'), 'owner': 'hdfs'} 2018-08-08 11:35:16,866 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/log4j.properties'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop', 'mode': 0644} 2018-08-08 11:35:16,875 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/hadoop-metrics2.properties'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'} 2018-08-08 11:35:16,875 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/task-log4j.properties'] {'content': StaticFile('task-log4j.properties'), 'mode': 0755} 2018-08-08 11:35:16,876 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/configuration.xsl'] {'owner': 'hdfs', 'group': 'hadoop'} 2018-08-08 11:35:16,880 - File['/etc/hadoop/conf/topology_mappings.data'] {'owner': 'hdfs', 'content': Template('topology_mappings.data.j2'), 'only_if': 'test -d /etc/hadoop/conf', 'group': 'hadoop', 'mode': 0644} 2018-08-08 11:35:16,884 - File['/etc/hadoop/conf/topology_script.py'] {'content': StaticFile('topology_script.py'), 'only_if': 'test -d /etc/hadoop/conf', 'mode': 0755} 2018-08-08 11:35:16,887 - Skipping unlimited key JCE policy check and setup since the Java VM is not managed by Ambari Command failed after 1 tries
Created 08-08-2018 12:13 PM
Hi @Paul Norris ,
I just checked what should be the value of 'roleParams' once you are restarting the Appl Timeline server in my local cluster.
[root@anaik1 data]# cat command-961.json |grep -i -5 roleParams ] }, "clusterName": "asnaik", "commandType": "EXECUTION_COMMAND", "taskId": 961, "roleParams": { "component_category": "MASTER" }, "componentVersionMap": { "HDFS": { "NAMENODE": "3.0.0.0-1634",
It should be as above. Can you please verify what the value in your's using command
cat /var/lib/ambari-agent/data/command-82.json |grep -i -5 roleParams
If its empty.
I would suggest you restart ambari server once and try again.
If that doesnt worked. See in /var/log/ambari-server/ambari-server.log file while restarting the App timeline server. you will get some exceptions which will give clue whats going wrong.
Hope this helps your in troubleshooting.
Created 08-08-2018 12:13 PM
Hi @Paul Norris ,
I just checked what should be the value of 'roleParams' once you are restarting the Appl Timeline server in my local cluster.
[root@anaik1 data]# cat command-961.json |grep -i -5 roleParams ] }, "clusterName": "asnaik", "commandType": "EXECUTION_COMMAND", "taskId": 961, "roleParams": { "component_category": "MASTER" }, "componentVersionMap": { "HDFS": { "NAMENODE": "3.0.0.0-1634",
It should be as above. Can you please verify what the value in your's using command
cat /var/lib/ambari-agent/data/command-82.json |grep -i -5 roleParams
If its empty.
I would suggest you restart ambari server once and try again.
If that doesnt worked. See in /var/log/ambari-server/ambari-server.log file while restarting the App timeline server. you will get some exceptions which will give clue whats going wrong.
Hope this helps your in troubleshooting.
Created 08-08-2018 12:33 PM
Thanks for trying to help. I get no value found when I search in command-82.json for roleParams. I restarted Ambari and also nothing found after. After I restarted Ambari server I have tried restarting the services again and watching the ambari-server.log file I now see that it is looping on a NullPointerException. The restart in Ambari has stalled at 8% "Restart App Timeline Server" on the master node. Worker and compute node have completed restarts.
2018-08-08 12:29:31,329 INFO [agent-register-processor-5] HeartbeatController:105 - java.lang.NullPointerException at org.apache.ambari.server.state.host.HostImpl.calculateHostStatus(HostImpl.java:1259) at org.apache.ambari.server.state.host.HostImpl.restoreComponentsStatuses(HostImpl.java:1230) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:365) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:333) at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:351) at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:293) at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39) at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:437) at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:597) at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:345) at org.apache.ambari.server.agent.stomp.HeartbeatController.lambda$register$0(HeartbeatController.java:100) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Created 08-08-2018 05:34 PM
Hi @Paul Norris ,
This NPE is due to Ambari Agent failed to register with ambari-server(agent-register-processor-5)
I would suggest you Abort the restart operation.
See if there is any heartbeat lost b/w Ambari-agent and ambari-server.
Restart Ambari-agent in the node.
Make sure one service check on the a service on this node works fine.
Then try to perform restart.
without proper logs, I am afraid I cannot sense whats issue 😞 .
It will be suggested you can raise a Case with Hortonworks Support Portal which will help you resolve this issue.
Or you can even analyse
/var/log/ambari-agent/ambari-agent.log<br>and <br>/var/log/ambari-server/ambari-server.log
And update your findings here . i can try to help
This is one similar BUG in Ambari-2.7.0 which I sense might be a root cause: https://issues.apache.org/jira/browse/AMBARI-23882
,
Created 08-09-2018 01:29 PM
Thanks again for the help. Here's where I have got to.
INFO 2018-08-09 13:27:13,492 __init__.py:49 - Event from server at /user/ (correlation_id=0): {u'status': u'OK', u'exitstatus': 1, u'id': -1} INFO 2018-08-09 13:27:13,497 HeartbeatThread.py:128 - Registration response received ERROR 2018-08-09 13:27:13,498 HeartbeatThread.py:104 - Exception in HeartbeatThread. Re-running the registration Traceback (most recent call last): File "/usr/lib/ambari-agent/lib/ambari_agent/HeartbeatThread.py", line 91, in run self.register() File "/usr/lib/ambari-agent/lib/ambari_agent/HeartbeatThread.py", line 131, in register self.handle_registration_response(response) File "/usr/lib/ambari-agent/lib/ambari_agent/HeartbeatThread.py", line 189, in handle_registration_response raise Exception(error_message) Exception: Registration failed INFO 2018-08-09 13:27:13,499 transport.py:358 - Receiver loop ended <br>
There are other errors re cluster id=2 configurations missing but assume that is related to not being able to register. These errors loop endlessly. The master node also showinf status of UNHEALTHY in cloudbreak.
Created 08-09-2018 01:34 PM
Hi @Paul Norris ,
As suggested can you please perform the command
ambari-agent restart
in your master node and See the logs if Agent registration is successful or not?
I hope after restart of Agent node , It will be able to register, then you can restart your services as required.
If it fails to register. please analyse the logs as mentioned in previous comment or attach the log snippet here.
Hope this helps you.
Created 08-09-2018 01:48 PM
Hi @Akhil S Naik ,
Thanks, I have run ambari-agent restart multiple times. Here is what happens:
It would appear ambari-agent is un-killable/stopable. There must be a supervisor forcing a restart on failure and this is causing multiple instances to run at once and causing the ping port 8670 problem. I think I can fix that by killing the 2nd process but any idea what's causing the heartbeatthread excepion?
Here's one loop of the log after I've cleared up the ping port issue: ambari-agent.log.
Update: Looking at the ambari-server and agent logs it appears that the agent is registering with the server as the logs there showing it is but then there is a heartbeatexception on the agent and a nullpointerexception on the server and both then seem to loop without completing what they started. Here's the repeating error on ambari-server again.
2018-08-09 15:07:11,984 INFO [agent-register-processor-2] HeartBeatHandler:312 - agentOsType = centos7 2018-08-09 15:07:11,991 INFO [agent-register-processor-2] HostImpl:345 - Received host registration, host=[hostname=cs-eu-dhp-3-0-0-cluster-m2,fqdn=cs-eu-dhp-3-0-0-cluster-m2.XXX-NODE-FQDN-XXX,domain=XXX-NODE-FQDN-XXX,architecture=x86_64,processorcount=4,physicalprocessorcount=4,osname=centos,osversion=7.5.1804,osfamily=redhat,memory=28801164,uptime_hours=3,mounts=(available=22377356,mountpoint=/,used=8551792,percent=28%,size=30929148,device=/dev/sda2,type=xfs)(available=195727412,mountpoint=/mnt/resource,used=61468,percent=1%,size=206290920,device=/dev/sdb1,type=ext4)(available=97673124,mountpoint=/hadoopfs/fs1,used=148500,percent=1%,size=103080888,device=/dev/sdc,type=ext4)] , registrationTime=1533827231984, agentVersion=2.7.0.0 2018-08-09 15:07:11,991 INFO [agent-register-processor-2] TopologyManager:643 - TopologyManager.onHostRegistered: Entering 2018-08-09 15:07:11,991 INFO [agent-register-processor-2] TopologyManager:697 - Host cs-eu-dhp-3-0-0-cluster-m2.XXX-NODE-FQDN-XXX re-registered, will not be added to the available hosts list 2018-08-09 15:07:11,991 INFO [agent-register-processor-2] HeartbeatController:105 - java.lang.NullPointerException at org.apache.ambari.server.state.host.HostImpl.calculateHostStatus(HostImpl.java:1259) at org.apache.ambari.server.state.host.HostImpl.restoreComponentsStatuses(HostImpl.java:1230) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:365) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:333) at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:351) at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:293) at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39) at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:437) at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:597) at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:345) at org.apache.ambari.server.agent.stomp.HeartbeatController.lambda$register$0(HeartbeatController.java:100) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Created 08-09-2018 03:38 PM
Hi @Paul Norris,
I suspect you are hitting : https://issues.apache.org/jira/browse/AMBARI-23838 ,
Can you confirm you ambari version using command
ambari-server --version
Created 08-09-2018 03:49 PM
Yes, that looks to be the problem I am seeing.
My version no.'s are:
Ambari Server: 2.7.0.0-508
Ambari Agent: 2.7.0.0.
Paul
Created 08-09-2018 03:59 PM
Hi @Paul Norris ,
I just checked my ambari version (shipped with HDP-3.0)
[root@anaik1 ~]# ambari-server --version 2.7.0.0-897
You might need to upgrade your ambari to this version refer to : https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.0.0/bk_ambari-upgrade/content/ambari_upgrade_gu...
Verified it ships with 2.7.0.0-897 which has the fix.
you will get repo info from there I see the issue is fixed there as per code info.
Work around : reading the code i understand uninstall DRUID will help
code reference : https://github.com/kasakrisz/ambari/blob/f55e7277fb2f78e02f6df8a68c063206862ef3a6/ambari-server/src/...
(might be druid doesnt belongs to any catageory)
Please accept answer as helpful if this helps you 🙂