This is a general "best practices" type of question.
We have a requirement to be able to withstand, within our code, any outages that may happen to HDFS.
Our code performs N different types of operations via the HDFS API, and any one of these N method calls may throw an exception. Presumably, what we want to do is catch any IOException and retry the operation. I assume there are a few very specific subclasses of IOException which we may want to specifically look for.
We're on CDH 5.3 (HDFS 2.5.0).
So, I have code which try/catches exceptions coming from the N operations in our code which invoke HDFS API methods such as isFile(), exists(), getOutputStream() etc.
The first problem is, if I just retry the operation, I get exactly the same error. Does that imply that the HDFS client is in a specific cached state and is not aware that HDFS may have come online? The code retries for quite some time but is never able to recover.
The second problem is, let's say I want to refresh the client. I get a fresh instance and now it really is retrying. However, now I hit the other issue I wrote about, which is this one: http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-append-files-to-HDFS-with-Java-qu....
Now that my client is a new client, the files I append data to are still the same. And to HDFS, these files have current leases by the previous instance of the client.
Is it possible to work around the issue #1 and/or the issue #2?
Better yet, are there built-in ways in the HDFS API to recover, or attempt to recover, from a transient network failure, or a way to keep retrying an operation - perhaps retry X number of times with Y milliseconds in between retries, or retry indefinitely, with Y milliseconds in between retries?
Last but not least, has anyone else dealt with these issues? Is any of this fixed / does any of this work differently in later versions of Hadoop/HDFS?
I may have alleviated Issue 2 I described, by holding on to the file writers and not refreshing them.
However, I just bounced HDFS and I keep refreshing my HDFS client instances, but I keep getting these exceptions:
java.io.IOException: All datanodes XXX.XX.XXX.XX:50010 are bad. Aborting...
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1137) ~[hadoop-hdfs-2.5.0.jar:?]
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933) ~[hadoop-hdfs-2.5.0.jar:?]
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487) ~[hadoop-hdfs-2.5.0.jar:?]