Support Questions

Find answers, ask questions, and share your expertise

Ambari - error 500 - getAllRequests after upgrade

avatar
Explorer

Hi All,

I've upgraded Ambari from version 2.1.2-377 to version 2.2.1.0-161. After performing the upgrade on the server, agents, upgrading the database and starting everything up, I keep seeing the following error in the logs on the server:

08 Mar 2016 10:07:05,087  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,088  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 3
08 Mar 2016 10:07:05,120  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,120  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 5
08 Mar 2016 10:07:05,134  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,134  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 8
08 Mar 2016 10:07:05,147  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,148  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 7
08 Mar 2016 10:07:05,158  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,158  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 6
08 Mar 2016 10:07:05,170  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,170  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 2
08 Mar 2016 10:07:05,184  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,185  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 1
08 Mar 2016 10:07:05,194  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending
08 Mar 2016 10:07:05,194  INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 4
08 Mar 2016 10:07:05,290  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-21.node.example
08 Mar 2016 10:07:05,328  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-51.node.example
08 Mar 2016 10:07:05,384  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-11.node.example
08 Mar 2016 10:07:05,428  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-41.node.example
08 Mar 2016 10:07:05,507  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-31.node.example
08 Mar 2016 10:07:05,575  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-53.node.example
08 Mar 2016 10:07:05,627  INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-52.node.example
08 Mar 2016 10:07:05,644  WARN [qtp-ambari-agent-55] ServletHandler:563 - /agent/v1/register/ambdevtestdc2host-group-51.node.example
java.lang.NullPointerException
        at org.apache.ambari.server.topology.PersistedStateImpl.getAllRequests(PersistedStateImpl.java:157)
        at org.apache.ambari.server.topology.TopologyManager.ensureInitialized(TopologyManager.java:131)
        at org.apache.ambari.server.topology.TopologyManager.onHostRegistered(TopologyManager.java:315)
        at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:301)
        at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:266)
        at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354)
        at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294)
        at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39)
        at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440)
        at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:570)
        at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:966)
        at org.apache.ambari.server.agent.rest.AgentResource.register(AgentResource.java:95)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
        at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
        at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
        at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
        at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
        at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:540)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:715)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496)
        at org.apache.ambari.server.security.SecurityFilter.doFilter(SecurityFilter.java:67)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
        at org.apache.ambari.server.api.AmbariPersistFilter.doFilter(AmbariPersistFilter.java:47)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
        at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82)
        at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:294)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:370)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
        at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
        at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
        at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:196)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Thread.java:745)


This is not specific to host group ambdevtestdc2host-group-51.node.example, it is happening for all host groups. On the agents I see the following:

<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 Server Error</title>
</head>
<body>
<h2>HTTP ERROR: 500</h2>
<p>Problem accessing /agent/v1/register/ambdevtestdc2host-group-51.node.example Reason:
<pre>    Server Error</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>


Is there a work around for this? It's just a test cluster, but it would be good to know how to work around this, as I've seen it a number of times now. Is there anything that can be modified in the database to resolve it?

Thanks!

1 ACCEPTED SOLUTION

avatar
Explorer

So I was able to get past this error by running removing rows 9 and 10 from the table below. It appears that when two hosts I deleted came back , in effect totally new hosts but with the same hostname, it created a number of duplicate rows in the various topology tables. I deleted the duplicates from a number of these tables, but deleting the final two rows below fixed it for me...... I don't have a copy of how these looked, but some of them contained duplicate rows with the node names I had deleted and restored listed twice. Perhaps someone can shed some light on what may have caused this?

Just to clarify, I have 7 hosts, so this table should contain 8 rows. 1 for the cluster, the remaining for the hosts. When things were failing it contained 10 rows.

ambari=> select * from topology_request;
 id |  action   | cluster_id |  bp_name   | cluster_properties | cluster_attributes |              description
----+-----------+------------+------------+--------------------+--------------------+---------------------------------------
  1 | PROVISION |          2 | testcluster | {}                 | {}                 | Provision Cluster 'testcluster'
  2 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  3 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  4 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  5 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  6 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  7 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  8 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  9 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
 10 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
(10 rows)

View solution in original post

10 REPLIES 10

avatar
Super Collaborator

That's very odd, especially since the upgrade doesn't touch the topology tables. Are you using MySQL by any chance? If so, can you check to make sure that your database engine is Innodb and not MyISAM. You have an integrity violation here which doesn't seem possible unless you're using a database which doesn't support foreign key constraints.

avatar
Explorer

Hi Jonathan,

Many thanks for getting back to me. I am using postgres, just with the default install which comes with the ambari-server setup.

Just some more information. I am using a blueprint to setup the cluster. I have also destroyed a number of servers in the cluster and re-created them, for testing purposes, to make sure we could recover from node failure. Reinstalling the components using the DELETE/POST method via the API. This all seemed to work fine, everything was nice and green prior to running through the upgrade.

I'm happy to run commands on the database to pull back any info you need if it can help diagnose the state the db is in.

Thanks!

avatar

@CS User

We seem to be off for node ambdevtestdc2host-group-51.node.example. Please check hosts / hoststate and other hosts table for this node and look for discrepancy. It is possible one of the prior API calls to delete / add nodes might have messed up the db.

avatar
Super Collaborator

I don't think this is the result of deleting or creating hosts via the API. What's odd is that the topology manager seems to be wanting to create new work for a request which is already completed. And it's able to do this successfully from other threads, but fails on a particular host.

Can you provide the following database queries:

SELECT * FROM topology_request

SELECT * FROM topology_logical_request

SELECT * FROM topology_logical_task, host_role_command WHERE topology_logical_task.physical_task_id = host_role_command.task_id AND host_role_command.status != 'COMPLETED'

avatar
Explorer

Hi, many thanks for the help once more. As mentioned, this doesn't appear to be limited to node 51, there are errors in the logs for all the nodes. Here are the results of the queries:

ambari=> SELECT * FROM topology_request;
 id |  action   | cluster_id |  bp_name   | cluster_properties | cluster_attributes |              description
----+-----------+------------+------------+--------------------+--------------------+---------------------------------------
  1 | PROVISION |          2 | testcluster | {}                 | {}                 | Provision Cluster 'testcluster'
  2 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  3 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  4 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  5 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  6 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  7 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  8 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  9 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
 10 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
(10 rows)

ambari=> SELECT * FROM topology_logical_request;
 id | request_id |                      description
----+------------+--------------------------------------------------------
  1 |          1 | Logical Request: Provision Cluster 'testcluster'
  4 |          2 | Logical Request: Scale Cluster 'testcluster' (+1 hosts)
  5 |          3 | Logical Request: Scale Cluster 'testcluster' (+1 hosts)
  6 |          4 | Logical Request: Scale Cluster 'testcluster' (+1 hosts)
  7 |          5 | Logical Request: Scale Cluster 'testcluster' (+1 hosts)
  8 |          6 | Logical Request: Scale Cluster 'testcluster' (+1 hosts)
 19 |          7 | Logical Request: Scale Cluster 'testcluster' (+1 hosts)
 20 |          8 | Logical Request: Scale Cluster 'testcluster' (+1 hosts)
(8 rows)


ambari=> SELECT * FROM topology_logical_task, host_role_command WHERE topology_logical_task.physical_task_id = host_role_command.task_id AND host_role_command.status != 'COMPLETED';
 id | host_task_id | physical_task_id | component | task_id | attempt_count | retry_allowed | event | exitcode | host_id | last_attempt_time | request_id | role | stage_id | start_time | end_time | status | auto_skip_on_failure | std_err
or | std_out | output_log | error_log | structured_out | role_command | command_detail | custom_command_name
----+--------------+------------------+-----------+---------+---------------+---------------+-------+----------+---------+-------------------+------------+------+----------+------------+----------+--------+----------------------+--------
---+---------+------------+-----------+----------------+--------------+----------------+---------------------
(0 rows)


I checked and these tables look pretty much exactly the same as a second cluster we have, which is working perfectly fine. To test, I stopped the management server in the working cluster, and restarted all the agents. All still seems fine....

Thanks!

avatar
Explorer

Few differences between the db's on the working and non working clusters.

Broken:

ambari=> select * from requestoperationlevel;
 operation_level_id | request_id | level_name | cluster_name | service_name | host_component_name | host_id
--------------------+------------+------------+--------------+--------------+---------------------+---------
                  2 |         38 | Host       | testcluster   |              |                     |
                  3 |         49 | Host       | testcluster   |              |                     |
                  4 |         51 | Service    | testcluster   | HDFS         |                     |
                  5 |         52 | Service    | testcluster   | MAPREDUCE2   |                     |
                  6 |         63 | Service    | testcluster   | HDFS         |                     |
(5 rows)

Working:

ambari=> select * from requestoperationlevel;
 operation_level_id | request_id | level_name | cluster_name | service_name | host_component_name | host_id
--------------------+------------+------------+--------------+--------------+---------------------+---------
(0 rows)


Also in the broken cluster, this table has a bunch of rows in:

requestresourcefilter

But it's empty on the working db.

Thanks!

avatar
Explorer

More info, digging a little deeper. Looks like it has scheduled a restart of everything. I might try and delete everything from these two tables to see if it will start correctly.

ambari=> select request_context,encode(a.hosts,'escape') from requestresourcefilter a,request b where a.request_id = b.request_id AND a.request_id IN (select request_id from requestoperationlevel);
                          request_context                          |                                                                                                                                                    encode


-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example
 Restart all components for MAPREDUCE2                             | ambdevtestdc2host-group-21.node.example
 Restart all components for MAPREDUCE2                             | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example
 Restart all components for HDFS                                   | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example
(24 rows)



If you copy and paste the above into a text editor it will look a bit prettier 🙂

avatar
Explorer

So I was able to get past this error by running removing rows 9 and 10 from the table below. It appears that when two hosts I deleted came back , in effect totally new hosts but with the same hostname, it created a number of duplicate rows in the various topology tables. I deleted the duplicates from a number of these tables, but deleting the final two rows below fixed it for me...... I don't have a copy of how these looked, but some of them contained duplicate rows with the node names I had deleted and restored listed twice. Perhaps someone can shed some light on what may have caused this?

Just to clarify, I have 7 hosts, so this table should contain 8 rows. 1 for the cluster, the remaining for the hosts. When things were failing it contained 10 rows.

ambari=> select * from topology_request;
 id |  action   | cluster_id |  bp_name   | cluster_properties | cluster_attributes |              description
----+-----------+------------+------------+--------------------+--------------------+---------------------------------------
  1 | PROVISION |          2 | testcluster | {}                 | {}                 | Provision Cluster 'testcluster'
  2 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  3 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  4 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  5 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  6 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  7 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  8 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
  9 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
 10 | SCALE     |          2 | testcluster | {}                 | {}                 | Scale Cluster 'testcluster' (+1 hosts)
(10 rows)

avatar
Contributor

I had a similar issue where ambari server got stuck in a weird state. It was technically running but could not collect any stats from the agents and in-turn it showed that the nodes are not running on the UI. I spent couple of days looking for a solution. Then based on the suggestion by @CS User above, I took a leap of faith and deleted all requests and corresponding data on ambari schema. Upon restarting the ambari-server, everything came back to normal. Thank you for the tip.

Error on ambar-server.log

--------

07 Nov 2016 19:09:56,536 ERROR [qtp-ambari-agent-253] ContainerResponse:419 - The RuntimeException could not be mapped to a response, re-throwing to the HTTP container java.lang.NullPointerException at java.lang.String.replace(String.java:2240) at org.apache.ambari.server.topology.HostRequest.getLogicalTasks(HostRequest.java:303) at org.apache.ambari.server.topology.LogicalRequest.getCommands(LogicalRequest.java:158) at org.apache.ambari.server.topology.LogicalRequest.getRequestStatus(LogicalRequest.java:231) at org.apache.ambari.server.topology.TopologyManager.isLogicalRequestFinished(TopologyManager.java:812) at org.apache.ambari.server.topology.TopologyManager.replayRequests(TopologyManager.java:766) at org.apache.ambari.server.topology.TopologyManager.ensureInitialized(TopologyManager.java:150) at org.apache.ambari.server.topology.TopologyManager.onHostRegistered(TopologyManager.java:407) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:313) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:275) at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354) at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294) at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39) at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440) at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:584) at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:464) at org.apache.ambari.server.agent.rest.AgentResource.register(AgentResource.java:95) at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1507) at org.apache.ambari.server.security.SecurityFilter.doFilter(SecurityFilter.java:67) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478) at org.apache.ambari.server.api.AmbariPersistFilter.doFilter(AmbariPersistFilter.java:47) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478) at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82) at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:294) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:427) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:984) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1045) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:236) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:196) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745)

--------

Error on ambari-agent:

-----------

Unable to connect to: https://<ambari-server-fqdn>:8441/agent/v1/register/<ambari-agent-fqdn>; Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 165, in registerWithServer ret = self.sendRequest(self.registerUrl, data) File "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 499, in sendRequest + '; Response: ' + str(response))

<head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> <title>Error 500 Server Error</title> </head> <body> <h2>HTTP ERROR: 500</h2> <p>Problem accessing /agent/v1/register/<ambari-agent-fqdn>. Reason: <pre> Server Error</pre></p> <hr /><i><small>Powered by Jetty:// 8.1.19.v20160209</small></i>

-----------

Solution:

Here is the (postgresql) query that I used. You have to run this as "ambari" user on the Ambari db.

HDP version: 2.5

Ambari version: 2.4.0.1

Caution: Since you are touching ambari db directly, you are on your own.

Note: I had to write individual queries to delete records from dependent tables first because CASCADE DELETE option was not turned on them.

  • delete from execution_command where task_id in (select task_id from host_role_command where stage_id in (select stage_id from stage where request_id in (select request_id from request)));
  • delete from topology_logical_task where physical_task_id in (select task_id from host_role_command where stage_id in (select stage_id from stage where request_id in (select request_id from request)));
  • delete from host_role_command where stage_id in (select stage_id from stage where request_id in (select request_id from request));
  • delete from stage where request_id in (select request_id from request); delete from requestresourcefilter where request_id in (select request_id from request);
  • delete from requestoperationlevel where request_id in (select request_id from request);
  • delete from request;