Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Copy large number of massive files from local file server to HDFS

avatar
Expert Contributor

Hello Team,

Looking for your advice for this issue.

We have an use case where almost 50TB of files need to be moved from local file server to HDFS. The files are kept under multiple folders in local file system and we need to maintain a similar HDFS folder structure. Looking for suggestions, any utilities through which we achieve this ojective.

Let me know in case you need any more information.

Thanks and Regards,

Rajdip

4 REPLIES 4

avatar

Hi @rajdip chaudhuri

Have you considered NiFi? you have out of the box processors to list/fetch files and to write to HDFS. You can also use a NiFi cluster if you want to distribute the load on several nodes.

avatar

@rajdip chaudhuri As mentioned by @Abdelkrim Hadjidj, NiFi is a great candidate to solve these kinds of issues! He talked about processors like list & fetch files. As the name would have suggested, they list and fetch the data without you having to write any code and also gives you the properties attached to those files. For example the directory structure, a very important need you have mentioned in your use case.

A bunch of advantages that you can have using NiFi for this use case.

  1. No need to write any code
  2. Advanced functionality that will help you "maintain the state". That is the files which are only new will be listed. This will help you run your "file server" functionality in the real time as well if needed.
  3. Very high throughput with low latency.
  4. Rapid data acquisition pipeline development without writing a lot of code.
  5. Provides a highly concurrent model without a developer having to worry about the typical complexities of concurrency.
  6. Is inherently asynchronous which allows for very high throughput and natural buffering even as processing and flow rates fluctuate
  7. The resource-constrained connections make critical functions such as back-pressure and pressure release very natural and intuitive.
  8. The points at which data enters and exits the system as well as how it flows through are well understood and easily tracked
  9. And biggest of all, OPEN SOURCE.

Let know if you need some any other help!

avatar
@rajdip chaudhuri

Did the answer help in the resolution of your query? Please close the thread by marking the answer as Accepted!

avatar

If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server