Created 12-24-2018 10:51 AM
I have HDP-126.96.36.199-258 running across 8 VMs on 2 Dell G10 servers. (Each G10 server has 4 VMs - services are duplicated across the 2 servers for high availability...) We have a spark streaming job that takes data from kafka and inserts it into HBase.
Its a requirement of the system that it be fully fault tolerant and that any one of the G10 servers can be shutdown (or in prod fail...) and the system continue running as normal.but we have encountered the following issue:
(1) If we shutdown the servcies on the VMs, but leave the VMs running (By either using the kill command or using the ambari ui to shut it down). Everything is fine and the remaining server continues as normal - fully fault tolerant with no loss of data or no delay in processing new data etc.
(2) But if we shutdown the entire VM (Which causes its IP to become invalid as it no longer exists on the network) - This causes the whole HDP stack to hang with NoRouteToHost exceptions. The Spark Streaming job just hangs on the insert to Hbase. I tracked down the NoRouteToHost Exceptions in the HDFS logs.
One hack I did during testing was to edit the now invalid ip address in the /etc/hosts file and changed it to a valid IP of another server - Once the current timeout occurred, the new IP was picked up and the system actually returned to normal (it didnt matter that the new IP didn't actually have any HDP services running on it)
Is there a proper solution to this issue other than the hacking I just mentioned above - when a server or VM is shutdown and an IP address becomes invalid - is it possible for the remaining servers to handle this NoRouteToHost exception instead of just hanging?
Hi @Pat ODowd,
what exactly means the IP address becomes invalid? Will there be another machine getting that IP address? And the machine gets after a restart a new IP address? No route to host is what IP is supposed to respond when the host with that IP disappears from the network.
In any case putting the machine down, will have an impact on currently running jobs, which is different to shutting down the service in the Ambari UI, where running jobs will finish and new jobs will be handled by other nodes. Bringing down the machine will interrupt currently active jobs, though it is supposed to be recovered after the TCP timeout.
The cluster control needs to become aware of the 'lost' node as well before letting other nodes handle the jobs. UDP communication will simply be lost, while TCP communication will take 3-5 frames before setting the connection to broken. This takes some seconds (dependent on the settings typically between 30 and 90 seconds). Determining that a port isn't listening on an active machine while trying to connect is much faster than determining a TCP connection is broken, due to the specifics of TCP.
So I guess the main point is why doesn't your jobs recover without intervention? How long did you wait for it?
Thanks for your very quick reply
When the 2 servers/8 VMS are running - everything is fine.
The system needs to be fully fault tolerant if one servers fails - the testers simulate this by shutting down one of the G10 servers.
The system needs to continue as normal for an extended period of time while the server is shutdown (It will not be rebooted straight away - this is to simulate a failure in prod, where the server might be offline for a few hours for what ever reason)
When the G10 server is shutdown, the IPs of the 4 VMs that were shutdown no longer exist on the network and that is what I mean by invalid IP - the HDFS services on the remaining 4 VMs on the remaining server hang with NoRouteToHost exceptions as they can no longer route TCP traffic to the VMs that were shutdown.
I left it for a few hours, so at the moment the system does not recover from these NoRouteToHost exceptions.
When I edited the IPs in /etc/hosts file and changed them to valid ips (other test server running on the network) it varied between 3 and 15 minutes for the system to return to normal.
so the Spark jobs have not finished during the few hours? They have all been hanging? Or did they finish, though there are errors with NoRouteToHost logged?
And you shut down a full physical server, which means you 'loose' 4 VMs of 8 VMs at the same time and not just one VM on the server? To make sure that you don't loose data in this event (its a 50% loss), you will need to make your cluster 'rack' aware, so that the replications are ensured to be created in both racks (set the rack correctly in your host overview in Ambari, don't leave it with 'default-rack'). Otherwise the default cluster config is supposed to continue without data loss when max. 2 data nodes are lost as the replication factor is 3. Loosing 4 machines can leave you with data losses that the cluster can't recover, some files/data might be fully located on the machines being shut down.
Thanks again for replying.
Its a spark streaming job, so it runs 24 7.
Also there is no issue with data loss - before the shutdown the replication is working correctly between the 2 servers and this has been verified.
The 4 VMS of each of the G10 servers are running different services - not the same! So for example there are only 2 HBase VMs, one on each G10 server - so if one server is shutdown you will only loose one HBase VM and the data is fully replicated across both HBase VMs.
Again every service on each G10 server is also replicated on the other - its a deliberate design that ALL services are fully replicated across both G10 servers for high availability.
It hangs on a hbase insert in the spark streaming job and just doesn’t move any further - unless I turn back on the shutdown VMs and everything returns to normal and the job actually "unhangs" & continues from where it was (which is great). There will obviously be a back log of spark batches to process.
When the VMs are shutdown and the spark streaming job is hanging - there are no errors being logged in either the spark logs or the hbase logs.
But I tracked the NoRouteToHostExceptions down to the HDFS logs - it is not just logged once, but it keeps logging them at certain intervals until I turn the shutdown VM back on.
I was trying to keep my description concise, to avoid making it sound overly complicated, but you don’t have to shutdown the entire G10 server to replicate the issue, if you just shutdown one of the HBase VMs the same thing happens.
I'm not actually encountering any loss of data, only issue is that when a node in the hadoop cluster is shutdown and its IP no longer exists in the network - the remaining nodes are hanging indefinitely with NoRouteToHostExceptions
So my issue can be summarised down to this:
When a node in a Hadoop(HDFS) cluster is shutdown (& its IP address no longer exists) and HDFS on the remaining node is hanging indefinitely with NoRouteToHostExceptions - is there a way to fix this?
(Or maybe not and that is ok too, as at least then I can explain what the issue is and why it cant be fixed)
Answering my own question here, as it might help others
Turns out there was a solution:
On the server that was being shutdown - I needed to shutdown all the services via the Ambari UI first and then issue the shutdown command after that. Once I did this - hdfs is no longer hanging after the server was shutdown. Strange that not shutting down the services correctly on the server that was shutting down was what was causing the remaining server to hang afterwards - but that was the case.
So I'm down to just one issue now:
The data is written correctly to hbase but the insert to Elasticsearch is still encountering the no route to host exceptions:
This line of code:
Is it possible to update spark.es.nodes at run-time to remove the ip of the failed node? Or is there away to get "saveToEs" work even when 1 IP in the list is no longer valid?