Support Questions

paul_norris · ‎08-08-2018

Hi,

Any idea why I am getting this error when doing the restart after I have installed Druid service via Ambari on fresh HDP 3.0 cluster? The cluster was deployed using Cloudbreak on Azure. Ambari version 2.7.0.0.

stderr: /var/lib/ambari-agent/data/errors-82.txt

Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/application_timeline_server.py", line 97, in <module>
    ApplicationTimelineServer().execute()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 353, in execute
    method(env)
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 933, in restart
    if componentCategory and componentCategory.strip().lower() == 'CLIENT'.lower():
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/config_dictionary.py", line 73, in __getattr__
    raise Fail("Configuration parameter '" + self.name + "' was not found in configurations dictionary!")
resource_management.core.exceptions.Fail: Configuration parameter 'roleParams' was not found in configurations dictionary!

stdout: /var/lib/ambari-agent/data/output-82.txt

2018-08-08 11:35:16,639 - Stack Feature Version Info: Cluster Stack=3.0, Command Stack=None, Command Version=3.0.0.0-1334 -> 3.0.0.0-1334
2018-08-08 11:35:16,654 - Using hadoop conf dir: /usr/hdp/3.0.0.0-1334/hadoop/conf
2018-08-08 11:35:16,797 - Stack Feature Version Info: Cluster Stack=3.0, Command Stack=None, Command Version=3.0.0.0-1334 -> 3.0.0.0-1334
2018-08-08 11:35:16,801 - Using hadoop conf dir: /usr/hdp/3.0.0.0-1334/hadoop/conf
2018-08-08 11:35:16,802 - Group['hdfs'] {}
2018-08-08 11:35:16,803 - Group['hadoop'] {}
2018-08-08 11:35:16,803 - Group['users'] {}
2018-08-08 11:35:16,804 - User['hive'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2018-08-08 11:35:16,804 - User['yarn-ats'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2018-08-08 11:35:16,805 - User['druid'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2018-08-08 11:35:16,806 - User['zookeeper'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2018-08-08 11:35:16,807 - User['ams'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2018-08-08 11:35:16,808 - User['ambari-qa'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop', 'users'], 'uid': None}
2018-08-08 11:35:16,808 - User['tez'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop', 'users'], 'uid': None}
2018-08-08 11:35:16,809 - User['hdfs'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hdfs', 'hadoop'], 'uid': None}
2018-08-08 11:35:16,810 - User['yarn'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2018-08-08 11:35:16,811 - User['mapred'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2018-08-08 11:35:16,811 - File['/var/lib/ambari-agent/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2018-08-08 11:35:16,813 - Execute['/var/lib/ambari-agent/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa 0'] {'not_if': '(test $(id -u ambari-qa) -gt 1000) || (false)'}
2018-08-08 11:35:16,817 - Skipping Execute['/var/lib/ambari-agent/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa 0'] due to not_if
2018-08-08 11:35:16,818 - Group['hdfs'] {}
2018-08-08 11:35:16,818 - User['hdfs'] {'fetch_nonlocal_groups': True, 'groups': ['hdfs', 'hadoop', u'hdfs']}
2018-08-08 11:35:16,819 - FS Type: 
2018-08-08 11:35:16,819 - Directory['/etc/hadoop'] {'mode': 0755}
2018-08-08 11:35:16,832 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/hadoop-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'}
2018-08-08 11:35:16,833 - Directory['/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir'] {'owner': 'hdfs', 'group': 'hadoop', 'mode': 01777}
2018-08-08 11:35:16,846 - Execute[('setenforce', '0')] {'not_if': '(! which getenforce ) || (which getenforce && getenforce | grep -q Disabled)', 'sudo': True, 'only_if': 'test -f /selinux/enforce'}
2018-08-08 11:35:16,853 - Skipping Execute[('setenforce', '0')] due to not_if
2018-08-08 11:35:16,853 - Directory['/var/log/hadoop'] {'owner': 'root', 'create_parents': True, 'group': 'hadoop', 'mode': 0775, 'cd_access': 'a'}
2018-08-08 11:35:16,855 - Directory['/var/run/hadoop'] {'owner': 'root', 'create_parents': True, 'group': 'root', 'cd_access': 'a'}
2018-08-08 11:35:16,855 - Changing owner for /var/run/hadoop from 1011 to root
2018-08-08 11:35:16,855 - Changing group for /var/run/hadoop from 988 to root
2018-08-08 11:35:16,855 - Directory['/tmp/hadoop-hdfs'] {'owner': 'hdfs', 'create_parents': True, 'cd_access': 'a'}
2018-08-08 11:35:16,859 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/commons-logging.properties'] {'content': Template('commons-logging.properties.j2'), 'owner': 'hdfs'}
2018-08-08 11:35:16,860 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/health_check'] {'content': Template('health_check.j2'), 'owner': 'hdfs'}
2018-08-08 11:35:16,866 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/log4j.properties'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop', 'mode': 0644}
2018-08-08 11:35:16,875 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/hadoop-metrics2.properties'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'}
2018-08-08 11:35:16,875 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/task-log4j.properties'] {'content': StaticFile('task-log4j.properties'), 'mode': 0755}
2018-08-08 11:35:16,876 - File['/usr/hdp/3.0.0.0-1334/hadoop/conf/configuration.xsl'] {'owner': 'hdfs', 'group': 'hadoop'}
2018-08-08 11:35:16,880 - File['/etc/hadoop/conf/topology_mappings.data'] {'owner': 'hdfs', 'content': Template('topology_mappings.data.j2'), 'only_if': 'test -d /etc/hadoop/conf', 'group': 'hadoop', 'mode': 0644}
2018-08-08 11:35:16,884 - File['/etc/hadoop/conf/topology_script.py'] {'content': StaticFile('topology_script.py'), 'only_if': 'test -d /etc/hadoop/conf', 'mode': 0755}
2018-08-08 11:35:16,887 - Skipping unlimited key JCE policy check and setup since the Java VM is not managed by Ambari

Command failed after 1 tries

akhilsnaik · ‎08-08-2018

Hi @Paul Norris ,

I just checked what should be the value of 'roleParams' once you are restarting the Appl Timeline server in my local cluster.

[root@anaik1 data]# cat command-961.json |grep -i -5 roleParams
        ]
    },
    "clusterName": "asnaik",
    "commandType": "EXECUTION_COMMAND",
    "taskId": 961,
    "roleParams": {
        "component_category": "MASTER"
    },
    "componentVersionMap": {
        "HDFS": {
            "NAMENODE": "3.0.0.0-1634",

It should be as above. Can you please verify what the value in your's using command

cat /var/lib/ambari-agent/data/command-82.json |grep -i -5 roleParams

If its empty.

I would suggest you restart ambari server once and try again.

If that doesnt worked. See in /var/log/ambari-server/ambari-server.log file while restarting the App timeline server. you will get some exceptions which will give clue whats going wrong.

Hope this helps your in troubleshooting.

View solution in original post

akhilsnaik · ‎08-08-2018

Hi @Paul Norris ,

I just checked what should be the value of 'roleParams' once you are restarting the Appl Timeline server in my local cluster.

[root@anaik1 data]# cat command-961.json |grep -i -5 roleParams
        ]
    },
    "clusterName": "asnaik",
    "commandType": "EXECUTION_COMMAND",
    "taskId": 961,
    "roleParams": {
        "component_category": "MASTER"
    },
    "componentVersionMap": {
        "HDFS": {
            "NAMENODE": "3.0.0.0-1634",

It should be as above. Can you please verify what the value in your's using command

cat /var/lib/ambari-agent/data/command-82.json |grep -i -5 roleParams

If its empty.

I would suggest you restart ambari server once and try again.

If that doesnt worked. See in /var/log/ambari-server/ambari-server.log file while restarting the App timeline server. you will get some exceptions which will give clue whats going wrong.

Hope this helps your in troubleshooting.

paul_norris · ‎08-08-2018

Hi @Akhil S Naik,

Thanks for trying to help. I get no value found when I search in command-82.json for roleParams. I restarted Ambari and also nothing found after. After I restarted Ambari server I have tried restarting the services again and watching the ambari-server.log file I now see that it is looping on a NullPointerException. The restart in Ambari has stalled at 8% "Restart App Timeline Server" on the master node. Worker and compute node have completed restarts.

2018-08-08 12:29:31,329  INFO [agent-register-processor-5] HeartbeatController:105 -
java.lang.NullPointerException
  at org.apache.ambari.server.state.host.HostImpl.calculateHostStatus(HostImpl.java:1259)
  at org.apache.ambari.server.state.host.HostImpl.restoreComponentsStatuses(HostImpl.java:1230)
  at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:365)
  at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:333)
  at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:351)
  at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:293)
  at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39)
  at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:437)
  at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:597)
  at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:345)
  at org.apache.ambari.server.agent.stomp.HeartbeatController.lambda$register$0(HeartbeatController.java:100)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

akhilsnaik · ‎08-08-2018

Hi @Paul Norris ,

This NPE is due to Ambari Agent failed to register with ambari-server(agent-register-processor-5)

I would suggest you Abort the restart operation.

See if there is any heartbeat lost b/w Ambari-agent and ambari-server.

Restart Ambari-agent in the node.

Make sure one service check on the a service on this node works fine.

Then try to perform restart.

without proper logs, I am afraid I cannot sense whats issue 😞 .

It will be suggested you can raise a Case with Hortonworks Support Portal which will help you resolve this issue.

Or you can even analyse

/var/log/ambari-agent/ambari-agent.log<br>and <br>/var/log/ambari-server/ambari-server.log

And update your findings here . i can try to help

This is one similar BUG in Ambari-2.7.0 which I sense might be a root cause: https://issues.apache.org/jira/browse/AMBARI-23882

,

paul_norris · ‎08-09-2018

Hi @Akhil S Naik,

Thanks again for the help. Here's where I have got to.

Build Fresh 3 node HDP 3.0 cluster, using data science blueprint (spark2, zepplin etc.), Ambari 2.7,
Add Druid service to HDP using Amabri wizard, using default settings, installs successfully, Druid starts,
Ambari flags that some services require a restart, I try, I get the roleParams error, restart of ambari server resolves this, services still need a restart,
Try and restart service again, worker and compute node restart fine, master node does not even try to start restarts,
Looking at the ambari-agent log I see errors stating that agent could not register, ambari server log shows it cannot connect to the master node.

INFO 2018-08-09 13:27:13,492 __init__.py:49 - Event from server at /user/ (correlation_id=0): {u'status': u'OK', u'exitstatus': 1, u'id': -1}
INFO 2018-08-09 13:27:13,497 HeartbeatThread.py:128 - Registration response received
ERROR 2018-08-09 13:27:13,498 HeartbeatThread.py:104 - Exception in HeartbeatThread. Re-running the registration
Traceback (most recent call last):
  File "/usr/lib/ambari-agent/lib/ambari_agent/HeartbeatThread.py", line 91, in run
    self.register()
  File "/usr/lib/ambari-agent/lib/ambari_agent/HeartbeatThread.py", line 131, in register
    self.handle_registration_response(response)
  File "/usr/lib/ambari-agent/lib/ambari_agent/HeartbeatThread.py", line 189, in handle_registration_response
    raise Exception(error_message)
Exception: Registration failed
INFO 2018-08-09 13:27:13,499 transport.py:358 - Receiver loop ended
<br>

There are other errors re cluster id=2 configurations missing but assume that is related to not being able to register. These errors loop endlessly. The master node also showinf status of UNHEALTHY in cloudbreak.

akhilsnaik · ‎08-09-2018

Hi @Paul Norris ,

As suggested can you please perform the command

ambari-agent restart

in your master node and See the logs if Agent registration is successful or not?

I hope after restart of Agent node , It will be able to register, then you can restart your services as required.

If it fails to register. please analyse the logs as mentioned in previous comment or attach the log snippet here.

Hope this helps you.

paul_norris · ‎08-09-2018

Hi @Akhil S Naik ,

Thanks, I have run ambari-agent restart multiple times. Here is what happens:

I run the command,
The ambari-agent logs show a problem listening on ping port 8670 saying that another process is using the port, I am assuming the process is another instance of ambari-agent, the PID returns process name of python,
I try killing this process using the PID, almost straight away another process starts and takes it's place,
The agent then goes on and returns the heartbeatthread exception I mentioned in previous post.

It would appear ambari-agent is un-killable/stopable. There must be a supervisor forcing a restart on failure and this is causing multiple instances to run at once and causing the ping port 8670 problem. I think I can fix that by killing the 2nd process but any idea what's causing the heartbeatthread excepion?

Here's one loop of the log after I've cleared up the ping port issue: ambari-agent.log.

ambariagent.txt

Update: Looking at the ambari-server and agent logs it appears that the agent is registering with the server as the logs there showing it is but then there is a heartbeatexception on the agent and a nullpointerexception on the server and both then seem to loop without completing what they started. Here's the repeating error on ambari-server again.

2018-08-09 15:07:11,984  INFO [agent-register-processor-2] HeartBeatHandler:312 - agentOsType = centos7
2018-08-09 15:07:11,991  INFO [agent-register-processor-2] HostImpl:345 - Received host registration, host=[hostname=cs-eu-dhp-3-0-0-cluster-m2,fqdn=cs-eu-dhp-3-0-0-cluster-m2.XXX-NODE-FQDN-XXX,domain=XXX-NODE-FQDN-XXX,architecture=x86_64,processorcount=4,physicalprocessorcount=4,osname=centos,osversion=7.5.1804,osfamily=redhat,memory=28801164,uptime_hours=3,mounts=(available=22377356,mountpoint=/,used=8551792,percent=28%,size=30929148,device=/dev/sda2,type=xfs)(available=195727412,mountpoint=/mnt/resource,used=61468,percent=1%,size=206290920,device=/dev/sdb1,type=ext4)(available=97673124,mountpoint=/hadoopfs/fs1,used=148500,percent=1%,size=103080888,device=/dev/sdc,type=ext4)]
, registrationTime=1533827231984, agentVersion=2.7.0.0
2018-08-09 15:07:11,991  INFO [agent-register-processor-2] TopologyManager:643 - TopologyManager.onHostRegistered: Entering
2018-08-09 15:07:11,991  INFO [agent-register-processor-2] TopologyManager:697 - Host cs-eu-dhp-3-0-0-cluster-m2.XXX-NODE-FQDN-XXX re-registered, will not be added to the available hosts list
2018-08-09 15:07:11,991  INFO [agent-register-processor-2] HeartbeatController:105 -
java.lang.NullPointerException
        at org.apache.ambari.server.state.host.HostImpl.calculateHostStatus(HostImpl.java:1259)
        at org.apache.ambari.server.state.host.HostImpl.restoreComponentsStatuses(HostImpl.java:1230)
        at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:365)
        at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:333)
        at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:351)
        at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:293)
        at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39)
        at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:437)
        at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:597)
        at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:345)
        at org.apache.ambari.server.agent.stomp.HeartbeatController.lambda$register$0(HeartbeatController.java:100)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

akhilsnaik · ‎08-09-2018

Hi @Paul Norris,

I suspect you are hitting : https://issues.apache.org/jira/browse/AMBARI-23838 ,

Can you confirm you ambari version using command

ambari-server --version

paul_norris · ‎08-09-2018

Hi @Akhil S Naik,

Yes, that looks to be the problem I am seeing.

My version no.'s are:

Ambari Server: 2.7.0.0-508

Ambari Agent: 2.7.0.0.

Paul

akhilsnaik · ‎08-09-2018

Hi @Paul Norris ,

I just checked my ambari version (shipped with HDP-3.0)

[root@anaik1 ~]# ambari-server --version
2.7.0.0-897

You might need to upgrade your ambari to this version refer to : https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.0.0/bk_ambari-upgrade/content/ambari_upgrade_gu...

Verified it ships with 2.7.0.0-897 which has the fix.

you will get repo info from there I see the issue is fixed there as per code info.

Work around : reading the code i understand uninstall DRUID will help

code reference : https://github.com/kasakrisz/ambari/blob/f55e7277fb2f78e02f6df8a68c063206862ef3a6/ambari-server/src/...

(might be druid doesnt belongs to any catageory)

Please accept answer as helpful if this helps you 🙂

Cloudera Community

Support Questions

Restart App Timeline Server Error (roleParams not in config dictionary) after Druid Service Install HDP 3.0