Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to speed up GetHDFS in DataFlow

How to speed up GetHDFS in DataFlow

New Contributor

I'm reading about a million small files (< 10 kB) from HDFS to have them merged by NiFi. NiFi is going through the directories (about 20) recursively and it takes several hours for NiFi to completely work through all the files. Once NiFi is done reading them, these files are removed (i.e. "Keep source file" is set to false).

Is there a way to speed up GetHDFS? The batch size does not seem to have a huge effect as I increased it from 100 to 1000 without any significant difference in the stats.

Thanks!

6 REPLIES 6

Re: How to speed up GetHDFS in DataFlow

Mentor

Reading a directory with millions of files in hdfs will take a long time, with or without nifi. You need to either use HAR Hadoop archive first or figure out on the hdfs side how to keep your file count to a minimum. I don't think nifi processor can help you here. It's probably better to write a mapreduce job to merge the files.

Re: How to speed up GetHDFS in DataFlow

Do you have a NiFi cluster or single node?

Sounds like a single node, but if you had a cluster you can parallelize the process using ListHDFS + FetchHDFS, rather than using GetHDFS. It still may take a while to perform a listing of 1 million files though, not really sure.

The parallelized fetching in a cluster is described here:

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

Re: How to speed up GetHDFS in DataFlow

New Contributor

Problem is: I need to delete the original files, which ListHDFS and FetchHDFS don't do AFAIK.

Re: How to speed up GetHDFS in DataFlow

Mentor

They have a property to choose where you want to delete or keep

Re: How to speed up GetHDFS in DataFlow

New Contributor

No they don't:

ListHDFS also states in the description: Unlike GetHDFS, this Processor does not delete any data from HDFS. Quite clear, I'd say. Moreover, in FetchHDFS you can read: The file in HDFS is left intact without any changes being made to it.

So, only GetHDFS is an option.

Re: How to speed up GetHDFS in DataFlow

Mentor

@Ian Hellström you are correct, I misread ListFile FetchFile with ListHDFS and FetchHDFS, FetchFile has a Completion Strategy that can be set to Delete File, FetchHDFS does not. Talking to Nifi team to add that in a future release, no promises yet.