Created 07-08-2020 05:32 AM
The GetHDFSFileInfo processor will produce one output FlowFile containing a complete listing of all files/directories found based upon the configured "Full path" and "Recurse Subdirectories" properties settings when the property "Group Results" is set to "All".
Since you want a single FlowFile for each object listed from HDFS, you will want to set the "Group Results" property to "None". You should then see a separate FlowFile produced fro each object found.
Then in your FetchHDFS processor you would need to set the property "HDFS Filename" to "${hdfs.path}/${hdfs.objectName}".
You may also find that you need to insert a RouteOnAttribute processor between your GetHDFSFileInfo and FetchHDFS processors to route out any FlowFIles produced by the GetHDFSFileInfo processor that are for directory objects only (not a file). You simply add a dynamic property to route any FlowFile produced from GetHDFSFileInfo processor that has the attribute "hdfs.type" set to "file" on to the fetchHDFSprocessor and send all other FlowFiles to the unmatched relationship which you can just auto-terminate.
Other things to consider:
1. Keep in mind that the GetHDFSFileInfo processor does not maintain any state, so every time it executes it will list all files/directories from the target regardless of whether they were listed before or not. The ListHDFS processor uses state.
2. If you are running your dataflow in a NIFi multi-node cluster, every node in your cluster will be performing the same listing (which may not be what you want). If you only want the list of target files/directories listed by one node, you should configure the GetHDFSFileInfo processor to "execute" on "Primary node" only (configured from processors "scheduling" tab). You can use load balancing configuration on the connection out of the GetHDFSFileInfo processor to redistribute the produced FlowFiles across all nodes in your cluster before they are processed by the FetchHDFS processor.
Hope this helps,
Matt