Reply
New Contributor
Posts: 5
Registered: ‎06-03-2016

File load strategy for large files (per file volume greater than 1 TB)

Hi,

What should be the strategy for loading files (Volume per file is more than 1 TB) in a reliable, fail-safe manner into HDFS?

 

Flume provides the fail-safety and reliability, but it is ideally meant for regularly-generated files into HDFS, my understanding is that it works fine for large no of file ingestion into HDFS ideally suitable for scenarios where data is generated in mini batches, but might not be efficient for single large file transfer into HDFS, please let me know if I am wrong here.

Also hadoop fs -put command cannot provide the fail safety, in case the transfer fails it won't restart the process

 

Regards,

Rajib

Cloudera Employee
Posts: 275
Registered: ‎01-09-2014

Re: File load strategy for large files (per file volume greater than 1 TB)

I would suggest using oozie with an ssh or shell action (depending on where these files are), you can create an ssh script that will allow you to push these files into hdfs with an 'hdfs dfs -put' command, and if that fails, you can set up the oozie workflow to send notifications.

Alternatively you could mount hdfs via nfs, and have a cron job that copies the files, and put all of your retry logic in there.

-pd
New Contributor
Posts: 5
Registered: ‎06-03-2016

Re: File load strategy for large files (per file volume greater than 1 TB)

Thanks for your reply.Can we use Apache NiFi for  data loading? Do you  foresee any performance issue if we use Aapche NiFi?

Cloudera Employee
Posts: 39
Registered: ‎01-07-2019

Re: File load strategy for large files (per file volume greater than 1 TB)

As mentioned before, if you need to operate on the whole file, a flow with a few retries/notifications something like oozie on hadfoop fs -put makes sense.

 

If you have more flexibility you could look into a NiFi based solution where you grab the file piece by piece with TailFile as it is written. (NiFi can scale to any volume of files, but shines most with files/pieces that are somewhat smaller than 1 TB).

Announcements
New solutions