I have a 3 node HBase/HDFS cluster - there are 2 name nodes and 3 data nodes with full replication. The cluster should be able to continue running as normal if any 1 node fails (after a failure there will at least be 1 name node and 2 data nodes remaining). I have a spark streaming job that inserts data into HBase in 1 minute batches - the batch execution time is always under 1 minute, when they are zero events it can vary from 5 to 20 seconds.
When I simulate a failure - HBase does continue to work as normal and I can verify the replication has also worked correctly, but one issue is the inserts into HBase from Spark streaming slow way down (again it stays working correctly just very slow) - even with 0 events the batch execution time goes up to around 1.5 minutes. As the batch interval is only 1 minute, you can see that a huge backlog will quickly build up. I can see from the spark UI that the delay is with the inserts into HBase:
Is there a way to fix this? I cant seem to find what is slowing HBase down like this - I can't see any errors in the HBase logs. In the name node log I can see alright that it keeps trying to connect to the failed node and outputting messages - I set the max tries to 10.But varying this value does not seem to have any affect.