Support Questions

rajdip_chaudhur · ‎03-22-2018

Hello Team,

Looking for your advice for this issue.

We have an use case where almost 50TB of files need to be moved from local file server to HDFS. The files are kept under multiple folders in local file system and we need to maintain a similar HDFS folder structure. Looking for suggestions, any utilities through which we achieve this ojective.

Let me know in case you need any more information.

Thanks and Regards,

Rajdip

ahadjidj · ‎03-22-2018

Hi @rajdip chaudhuri

Have you considered NiFi? you have out of the box processors to list/fetch files and to write to HDFS. You can also use a NiFi cluster if you want to distribute the load on several nodes.

RahulSoni · ‎03-22-2018

@rajdip chaudhuri As mentioned by @Abdelkrim Hadjidj, NiFi is a great candidate to solve these kinds of issues! He talked about processors like list & fetch files. As the name would have suggested, they list and fetch the data without you having to write any code and also gives you the properties attached to those files. For example the directory structure, a very important need you have mentioned in your use case.

A bunch of advantages that you can have using NiFi for this use case.

No need to write any code
Advanced functionality that will help you "maintain the state". That is the files which are only new will be listed. This will help you run your "file server" functionality in the real time as well if needed.
Very high throughput with low latency.
Rapid data acquisition pipeline development without writing a lot of code.
Provides a highly concurrent model without a developer having to worry about the typical complexities of concurrency.
Is inherently asynchronous which allows for very high throughput and natural buffering even as processing and flow rates fluctuate
The resource-constrained connections make critical functions such as back-pressure and pressure release very natural and intuitive.
The points at which data enters and exits the system as well as how it flows through are well understood and easily tracked
And biggest of all, OPEN SOURCE.

Let know if you need some any other help!

RahulSoni · ‎04-01-2018

@rajdip chaudhuri

Did the answer help in the resolution of your query? Please close the thread by marking the answer as Accepted!

stevel · ‎04-02-2018

If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server

Cloudera Community

Support Questions

Copy large number of massive files from local file server to HDFS

Analyze Small FIle in HDFS

Flume: HDFS sink: Can't write large files

Error while copying 38GB of file from Local to HDF...

Write / Read Parquet File in Spark

Coping files from Remote server to HDFS

list HDFS files

Force closing a HDFS file still open (because unco...

NiFi: How to detect updates to S3 files and insert...

How to Recover HDFS Files after Accidental Deletio...

Installation CM Agent failed. Failed to copy insta...