Member since
12-20-2018
7
Posts
1
Kudos Received
0
Solutions
01-15-2019
03:54 PM
Answering my own question here, as it might help others Turns out there was a solution: On the server that was being shutdown - I needed to shutdown all the services via the Ambari UI first and then issue the shutdown command after that. Once I did this - hdfs is no longer hanging after the server was shutdown. Strange that not shutting down the services correctly on the server that was shutting down was what was causing the remaining server to hang afterwards - but that was the case. So I'm down to just one issue now: The data is written correctly to hbase but the insert to Elasticsearch is still encountering the no route to host exceptions: This line of code: JavaEsSpark.saveToEs Is it possible to update spark.es.nodes at run-time to remove the ip of the failed node? Or is there away to get "saveToEs" work even when 1 IP in the list is no longer valid?
... View more
12-28-2018
03:49 AM
Thanks again for replying. Its a spark streaming job, so it runs 24 7. Also there is no issue with data loss - before the shutdown the replication is working correctly between the 2 servers and this has been verified. The 4 VMS of each of the G10 servers are running different services - not the same! So for example there are only 2 HBase VMs, one on each G10 server - so if one server is shutdown you will only loose one HBase VM and the data is fully replicated across both HBase VMs. Again every service on each G10 server is also replicated on the other - its a deliberate design that ALL services are fully replicated across both G10 servers for high availability. It hangs on a hbase insert in the spark streaming job and just doesn’t move any further - unless I turn back on the shutdown VMs and everything returns to normal and the job actually "unhangs" & continues from where it was (which is great). There will obviously be a back log of spark batches to process. When the VMs are shutdown and the spark streaming job is hanging - there are no errors being logged in either the spark logs or the hbase logs. But I tracked the NoRouteToHostExceptions down to the HDFS logs - it is not just logged once, but it keeps logging them at certain intervals until I turn the shutdown VM back on. I was trying to keep my description concise, to avoid making it sound overly complicated, but you don’t have to shutdown the entire G10 server to replicate the issue, if you just shutdown one of the HBase VMs the same thing happens. I'm not actually encountering any loss of data, only issue is that when a node in the hadoop cluster is shutdown and its IP no longer exists in the network - the remaining nodes are hanging indefinitely with NoRouteToHostExceptions So my issue can be summarised down to this: When a node in a Hadoop(HDFS) cluster is shutdown (& its IP address no longer exists) and HDFS on the remaining node is hanging indefinitely with NoRouteToHostExceptions - is there a way to fix this? (Or maybe not and that is ok too, as at least then I can explain what the issue is and why it cant be fixed) Thanks
... View more
12-24-2018
11:52 AM
Thanks for your very quick reply When the 2 servers/8 VMS are running - everything is fine. The
system needs to be fully fault tolerant if one servers fails - the
testers simulate this by shutting down one of the G10 servers. The system needs to continue as normal for an extended period of time while the server is shutdown (It will not be rebooted straight away - this is to simulate a failure in prod, where the server might be offline for a few hours for what ever reason) When
the G10 server is shutdown, the IPs of the 4 VMs that were shutdown no
longer exist on the network and that is what I mean by invalid IP - the
HDFS services on the remaining 4 VMs on the remaining server hang with
NoRouteToHost exceptions as they can no longer route TCP traffic to the
VMs that were shutdown. I left it for a few hours, so at the moment the system does not recover from these NoRouteToHost exceptions. When I edited the IPs in /etc/hosts file and changed them to valid
ips (other test server running on the network) it varied between 3 and
15 minutes for the system to return to normal.
... View more
12-24-2018
10:51 AM
1 Kudo
I have HDP-2.4.2.0-258 running across 8 VMs on 2 Dell G10 servers. (Each G10 server has 4 VMs - services are duplicated across the 2 servers for high availability...) We have a spark streaming job that takes data from kafka and inserts it into HBase. Its a requirement of the system that it be fully fault tolerant and that any one of the G10 servers can be shutdown (or in prod fail...) and the system continue running as normal.but we have encountered the following issue: (1) If we shutdown the servcies on the VMs, but leave the VMs running (By either using the kill command or using the ambari ui to shut it down). Everything is fine and the remaining server continues as normal - fully fault tolerant with no loss of data or no delay in processing new data etc. (2) But if we shutdown the entire VM (Which causes its IP to become invalid as it no longer exists on the network) - This causes the whole HDP stack to hang with NoRouteToHost exceptions. The Spark Streaming job just hangs on the insert to Hbase. I tracked down the NoRouteToHost Exceptions in the HDFS logs. https://wiki.apache.org/hadoop/NoRouteToHost One hack I did during testing was to edit the now invalid ip address in the /etc/hosts file and changed it to a valid IP of another server - Once the current timeout occurred, the new IP was picked up and the system actually returned to normal (it didnt matter that the new IP didn't actually have any HDP services running on it) Is there a proper solution to this issue other than the hacking I just mentioned above - when a server or VM is shutdown and an IP address becomes invalid - is it possible for the remaining servers to handle this NoRouteToHost exception instead of just hanging?
... View more
Labels:
12-21-2018
10:07 AM
Thanks for quick reply. I meant to call a script to shutdown the ambari components after the server is issued a shutdown command, but before it actually shutdowns! But I found a solution to the issue anyway - I just needed to add this command "ExecStop=" to the systemd service files and all seems to work fine now. Thanks again for your quick reply..
... View more
12-20-2018
02:49 PM
Is it possible to call a script on the host directly to do this shutdown? When my server is issue a shutdown command I want to called a script directly to shutdown all ambari components on that host. Thanks
... View more