Created 03-08-2016 02:37 PM
Hi All,
I've upgraded Ambari from version 2.1.2-377 to version 2.2.1.0-161. After performing the upgrade on the server, agents, upgrading the database and starting everything up, I keep seeing the following error in the logs on the server:
08 Mar 2016 10:07:05,087 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,088 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 3 08 Mar 2016 10:07:05,120 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,120 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 5 08 Mar 2016 10:07:05,134 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,134 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 8 08 Mar 2016 10:07:05,147 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,148 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 7 08 Mar 2016 10:07:05,158 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,158 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 6 08 Mar 2016 10:07:05,170 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,170 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 2 08 Mar 2016 10:07:05,184 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,185 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 1 08 Mar 2016 10:07:05,194 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: Host Assignment Pending 08 Mar 2016 10:07:05,194 INFO [qtp-ambari-agent-55] LogicalRequest:420 - LogicalRequest.createHostRequests: created new outstanding host request ID = 4 08 Mar 2016 10:07:05,290 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-21.node.example 08 Mar 2016 10:07:05,328 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-51.node.example 08 Mar 2016 10:07:05,384 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-11.node.example 08 Mar 2016 10:07:05,428 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-41.node.example 08 Mar 2016 10:07:05,507 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-31.node.example 08 Mar 2016 10:07:05,575 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-53.node.example 08 Mar 2016 10:07:05,627 INFO [qtp-ambari-agent-55] HostRequest:125 - HostRequest: Successfully recovered host request for host: ambdevtestdc2host-group-52.node.example 08 Mar 2016 10:07:05,644 WARN [qtp-ambari-agent-55] ServletHandler:563 - /agent/v1/register/ambdevtestdc2host-group-51.node.example java.lang.NullPointerException at org.apache.ambari.server.topology.PersistedStateImpl.getAllRequests(PersistedStateImpl.java:157) at org.apache.ambari.server.topology.TopologyManager.ensureInitialized(TopologyManager.java:131) at org.apache.ambari.server.topology.TopologyManager.onHostRegistered(TopologyManager.java:315) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:301) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:266) at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354) at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294) at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39) at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440) at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:570) at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:966) at org.apache.ambari.server.agent.rest.AgentResource.register(AgentResource.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:540) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:715) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496) at org.apache.ambari.server.security.SecurityFilter.doFilter(SecurityFilter.java:67) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.apache.ambari.server.api.AmbariPersistFilter.doFilter(AmbariPersistFilter.java:47) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82) at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:294) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:196) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745)
This is not specific to host group ambdevtestdc2host-group-51.node.example, it is happening for all host groups. On the agents I see the following:
<head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> <title>Error 500 Server Error</title> </head> <body> <h2>HTTP ERROR: 500</h2> <p>Problem accessing /agent/v1/register/ambdevtestdc2host-group-51.node.example Reason: <pre> Server Error</pre></p> <hr /><i><small>Powered by Jetty://</small></i>
Is there a work around for this? It's just a test cluster, but it would be good to know how to work around this, as I've seen it a number of times now. Is there anything that can be modified in the database to resolve it?
Thanks!
Created 03-09-2016 07:08 PM
So I was able to get past this error by running removing rows 9 and 10 from the table below. It appears that when two hosts I deleted came back , in effect totally new hosts but with the same hostname, it created a number of duplicate rows in the various topology tables. I deleted the duplicates from a number of these tables, but deleting the final two rows below fixed it for me...... I don't have a copy of how these looked, but some of them contained duplicate rows with the node names I had deleted and restored listed twice. Perhaps someone can shed some light on what may have caused this?
Just to clarify, I have 7 hosts, so this table should contain 8 rows. 1 for the cluster, the remaining for the hosts. When things were failing it contained 10 rows.
ambari=> select * from topology_request; id | action | cluster_id | bp_name | cluster_properties | cluster_attributes | description ----+-----------+------------+------------+--------------------+--------------------+--------------------------------------- 1 | PROVISION | 2 | testcluster | {} | {} | Provision Cluster 'testcluster' 2 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 3 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 4 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 5 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 6 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 7 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 8 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 9 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 10 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) (10 rows)
Created 03-08-2016 02:58 PM
That's very odd, especially since the upgrade doesn't touch the topology tables. Are you using MySQL by any chance? If so, can you check to make sure that your database engine is Innodb and not MyISAM. You have an integrity violation here which doesn't seem possible unless you're using a database which doesn't support foreign key constraints.
Created 03-08-2016 03:28 PM
Hi Jonathan,
Many thanks for getting back to me. I am using postgres, just with the default install which comes with the ambari-server setup.
Just some more information. I am using a blueprint to setup the cluster. I have also destroyed a number of servers in the cluster and re-created them, for testing purposes, to make sure we could recover from node failure. Reinstalling the components using the DELETE/POST method via the API. This all seemed to work fine, everything was nice and green prior to running through the upgrade.
I'm happy to run commands on the database to pull back any info you need if it can help diagnose the state the db is in.
Thanks!
Created 03-08-2016 04:00 PM
We seem to be off for node ambdevtestdc2host-group-51.node.example. Please check hosts / hoststate and other hosts table for this node and look for discrepancy. It is possible one of the prior API calls to delete / add nodes might have messed up the db.
Created 03-08-2016 05:32 PM
I don't think this is the result of deleting or creating hosts via the API. What's odd is that the topology manager seems to be wanting to create new work for a request which is already completed. And it's able to do this successfully from other threads, but fails on a particular host.
Can you provide the following database queries:
SELECT * FROM topology_request SELECT * FROM topology_logical_request SELECT * FROM topology_logical_task, host_role_command WHERE topology_logical_task.physical_task_id = host_role_command.task_id AND host_role_command.status != 'COMPLETED'
Created 03-09-2016 07:42 AM
Hi, many thanks for the help once more. As mentioned, this doesn't appear to be limited to node 51, there are errors in the logs for all the nodes. Here are the results of the queries:
ambari=> SELECT * FROM topology_request; id | action | cluster_id | bp_name | cluster_properties | cluster_attributes | description ----+-----------+------------+------------+--------------------+--------------------+--------------------------------------- 1 | PROVISION | 2 | testcluster | {} | {} | Provision Cluster 'testcluster' 2 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 3 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 4 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 5 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 6 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 7 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 8 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 9 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 10 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) (10 rows)
ambari=> SELECT * FROM topology_logical_request; id | request_id | description ----+------------+-------------------------------------------------------- 1 | 1 | Logical Request: Provision Cluster 'testcluster' 4 | 2 | Logical Request: Scale Cluster 'testcluster' (+1 hosts) 5 | 3 | Logical Request: Scale Cluster 'testcluster' (+1 hosts) 6 | 4 | Logical Request: Scale Cluster 'testcluster' (+1 hosts) 7 | 5 | Logical Request: Scale Cluster 'testcluster' (+1 hosts) 8 | 6 | Logical Request: Scale Cluster 'testcluster' (+1 hosts) 19 | 7 | Logical Request: Scale Cluster 'testcluster' (+1 hosts) 20 | 8 | Logical Request: Scale Cluster 'testcluster' (+1 hosts) (8 rows)
ambari=> SELECT * FROM topology_logical_task, host_role_command WHERE topology_logical_task.physical_task_id = host_role_command.task_id AND host_role_command.status != 'COMPLETED'; id | host_task_id | physical_task_id | component | task_id | attempt_count | retry_allowed | event | exitcode | host_id | last_attempt_time | request_id | role | stage_id | start_time | end_time | status | auto_skip_on_failure | std_err or | std_out | output_log | error_log | structured_out | role_command | command_detail | custom_command_name ----+--------------+------------------+-----------+---------+---------------+---------------+-------+----------+---------+-------------------+------------+------+----------+------------+----------+--------+----------------------+-------- ---+---------+------------+-----------+----------------+--------------+----------------+--------------------- (0 rows)
I checked and these tables look pretty much exactly the same as a second cluster we have, which is working perfectly fine. To test, I stopped the management server in the working cluster, and restarted all the agents. All still seems fine....
Thanks!
Created 03-09-2016 10:04 AM
Few differences between the db's on the working and non working clusters.
Broken:
ambari=> select * from requestoperationlevel; operation_level_id | request_id | level_name | cluster_name | service_name | host_component_name | host_id --------------------+------------+------------+--------------+--------------+---------------------+--------- 2 | 38 | Host | testcluster | | | 3 | 49 | Host | testcluster | | | 4 | 51 | Service | testcluster | HDFS | | 5 | 52 | Service | testcluster | MAPREDUCE2 | | 6 | 63 | Service | testcluster | HDFS | | (5 rows)
Working:
ambari=> select * from requestoperationlevel; operation_level_id | request_id | level_name | cluster_name | service_name | host_component_name | host_id --------------------+------------+------------+--------------+--------------+---------------------+--------- (0 rows)
Also in the broken cluster, this table has a bunch of rows in:
requestresourcefilter
But it's empty on the working db.
Thanks!
Created 03-09-2016 12:05 PM
More info, digging a little deeper. Looks like it has scheduled a restart of everything. I might try and delete everything from these two tables to see if it will start correctly.
ambari=> select request_context,encode(a.hosts,'escape') from requestresourcefilter a,request b where a.request_id = b.request_id AND a.request_id IN (select request_id from requestoperationlevel); request_context | encode -------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all clients on ambdevtestdc2host-group-11.node.example | ambdevtestdc2host-group-11.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example Restart all components for MAPREDUCE2 | ambdevtestdc2host-group-21.node.example Restart all components for MAPREDUCE2 | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example Restart all components for HDFS | ambdevtestdc2host-group-11.node.example,ambdevtestdc2host-group-21.node.example,ambdevtestdc2host-group-31.node.example,ambdevtestdc2host-group-41.node.dc2.consul,ambdevtestdc2host-group-51.node.example,ambdevtestdc2host-group-52.node.example,ambdevtestdc2host-group-53.node.example (24 rows)
If you copy and paste the above into a text editor it will look a bit prettier 🙂
Created 03-09-2016 07:08 PM
So I was able to get past this error by running removing rows 9 and 10 from the table below. It appears that when two hosts I deleted came back , in effect totally new hosts but with the same hostname, it created a number of duplicate rows in the various topology tables. I deleted the duplicates from a number of these tables, but deleting the final two rows below fixed it for me...... I don't have a copy of how these looked, but some of them contained duplicate rows with the node names I had deleted and restored listed twice. Perhaps someone can shed some light on what may have caused this?
Just to clarify, I have 7 hosts, so this table should contain 8 rows. 1 for the cluster, the remaining for the hosts. When things were failing it contained 10 rows.
ambari=> select * from topology_request; id | action | cluster_id | bp_name | cluster_properties | cluster_attributes | description ----+-----------+------------+------------+--------------------+--------------------+--------------------------------------- 1 | PROVISION | 2 | testcluster | {} | {} | Provision Cluster 'testcluster' 2 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 3 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 4 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 5 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 6 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 7 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 8 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 9 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) 10 | SCALE | 2 | testcluster | {} | {} | Scale Cluster 'testcluster' (+1 hosts) (10 rows)
Created 11-08-2016 12:46 AM
I had a similar issue where ambari server got stuck in a weird state. It was technically running but could not collect any stats from the agents and in-turn it showed that the nodes are not running on the UI. I spent couple of days looking for a solution. Then based on the suggestion by @CS User above, I took a leap of faith and deleted all requests and corresponding data on ambari schema. Upon restarting the ambari-server, everything came back to normal. Thank you for the tip.
Error on ambar-server.log
--------
07 Nov 2016 19:09:56,536 ERROR [qtp-ambari-agent-253] ContainerResponse:419 - The RuntimeException could not be mapped to a response, re-throwing to the HTTP container java.lang.NullPointerException at java.lang.String.replace(String.java:2240) at org.apache.ambari.server.topology.HostRequest.getLogicalTasks(HostRequest.java:303) at org.apache.ambari.server.topology.LogicalRequest.getCommands(LogicalRequest.java:158) at org.apache.ambari.server.topology.LogicalRequest.getRequestStatus(LogicalRequest.java:231) at org.apache.ambari.server.topology.TopologyManager.isLogicalRequestFinished(TopologyManager.java:812) at org.apache.ambari.server.topology.TopologyManager.replayRequests(TopologyManager.java:766) at org.apache.ambari.server.topology.TopologyManager.ensureInitialized(TopologyManager.java:150) at org.apache.ambari.server.topology.TopologyManager.onHostRegistered(TopologyManager.java:407) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:313) at org.apache.ambari.server.state.host.HostImpl$HostRegistrationReceived.transition(HostImpl.java:275) at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354) at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294) at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39) at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440) at org.apache.ambari.server.state.host.HostImpl.handleEvent(HostImpl.java:584) at org.apache.ambari.server.agent.HeartBeatHandler.handleRegistration(HeartBeatHandler.java:464) at org.apache.ambari.server.agent.rest.AgentResource.register(AgentResource.java:95) at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1507) at org.apache.ambari.server.security.SecurityFilter.doFilter(SecurityFilter.java:67) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478) at org.apache.ambari.server.api.AmbariPersistFilter.doFilter(AmbariPersistFilter.java:47) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478) at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82) at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:294) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:427) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:984) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1045) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:236) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:196) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745)
--------
Error on ambari-agent:
-----------
Unable to connect to: https://<ambari-server-fqdn>:8441/agent/v1/register/<ambari-agent-fqdn>; Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 165, in registerWithServer ret = self.sendRequest(self.registerUrl, data) File "/usr/lib/python2.6/site-packages/ambari_agent/Controller.py", line 499, in sendRequest + '; Response: ' + str(response))
<head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> <title>Error 500 Server Error</title> </head> <body> <h2>HTTP ERROR: 500</h2> <p>Problem accessing /agent/v1/register/<ambari-agent-fqdn>. Reason: <pre> Server Error</pre></p> <hr /><i><small>Powered by Jetty:// 8.1.19.v20160209</small></i>
-----------
Solution:
Here is the (postgresql) query that I used. You have to run this as "ambari" user on the Ambari db.
HDP version: 2.5
Ambari version: 2.4.0.1
Caution: Since you are touching ambari db directly, you are on your own.
Note: I had to write individual queries to delete records from dependent tables first because CASCADE DELETE option was not turned on them.