What should be the strategy for loading files (Volume per file is more than 1 TB) in a reliable, fail-safe manner into HDFS?
Flume provides the fail-safety and reliability, but it is ideally meant for regularly-generated files into HDFS, my understanding is that it works fine for large no of file ingestion into HDFS ideally suitable for scenarios where data is generated in mini batches, but might not be efficient for single large file transfer into HDFS, please let me know if I am wrong here.
Also hadoop fs -put command cannot provide the fail safety, in case the transfer fails it won't restart the process.
Can apache NiFi be considered for such scenarios?
You can write a java program using hdfs api to upload files to hdfs with your own retry logic. Nifi has good reputation but i dont have information on the size of files you're speaking of. Here's a sample http://tutorials.techmytalk.com/2014/08/16/hadoop-hdfs-java-api/
This might be the easiest way. You get 300GB/s and you can just restart in case of failure. Of course this is a risk of redoing 3 hours work. However a local hadoop client is definitely the most robust way to do it with the least likelyhood of something going wrong.
I am not sure if Flume or Nifi help you. Flume expects files to be immutable and cannot restart reading from a file AFAIK, Not sure about nifi. Anybody knows?
Just realized had a wrong link there all along, fixed the link
Please look into WebHDFS https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
A Complete HDFS Interface: WebHDFS supports all HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. In contrast, HFTP (a previous version of HTTP protocol heavily used at Yahoo!) only supports the read operations but not the write operations. Read operations are the operations that do not change the status of HDFS including the namespace tree and the file contents.
HTTP REST API: WebHDFS defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop. You can use common tools like curl/wget to access HDFS.
Wire Compatibility: the REST API will be maintained for wire compatibility. Thus, WebHDFS clients can talk to clusters with different Hadoop versions.
Secure Authentication: The core Hadoop uses Kerberos and Hadoop delegation tokens for security. WebHDFS also uses Kerberos (SPNEGO) and Hadoop delegation tokens for authentication.
Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.
A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install
Apache Open Source: All the source code and documentation have been committed to the Hadoop code base. It will be released with Hadoop 1.0. For more information, read the preliminary version of the WebHDFS REST API specification.