I'm reading about a million small files (< 10 kB) from HDFS to have them merged by NiFi. NiFi is going through the directories (about 20) recursively and it takes several hours for NiFi to completely work through all the files. Once NiFi is done reading them, these files are removed (i.e. "Keep source file" is set to false).
Is there a way to speed up GetHDFS? The batch size does not seem to have a huge effect as I increased it from 100 to 1000 without any significant difference in the stats.
Reading a directory with millions of files in hdfs will take a long time, with or without nifi. You need to either use HAR Hadoop archive first or figure out on the hdfs side how to keep your file count to a minimum. I don't think nifi processor can help you here. It's probably better to write a mapreduce job to merge the files.
Do you have a NiFi cluster or single node?
Sounds like a single node, but if you had a cluster you can parallelize the process using ListHDFS + FetchHDFS, rather than using GetHDFS. It still may take a while to perform a listing of 1 million files though, not really sure.
The parallelized fetching in a cluster is described here:
No they don't:
ListHDFS also states in the description: Unlike GetHDFS, this Processor does not delete any data from HDFS. Quite clear, I'd say. Moreover, in FetchHDFS you can read: The file in HDFS is left intact without any changes being made to it.
So, only GetHDFS is an option.
@Ian Hellström you are correct, I misread ListFile FetchFile with ListHDFS and FetchHDFS, FetchFile has a Completion Strategy that can be set to Delete File, FetchHDFS does not. Talking to Nifi team to add that in a future release, no promises yet.