Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to get latest loaded filename in HDFS directory in NIFI

How to get latest loaded filename in HDFS directory in NIFI

New Contributor

Hello ,

I am trying to get latest loaded filename  in HDFS directory only once. Can anyone help me to find best approach for this using NIFI processors.

 

Thanks and regards,

 

1 REPLY 1
Highlighted

Re: How to get latest loaded filename in HDFS directory in NIFI

Master Guru

@Pr1 

 

Not sure how often new files are written in to your HDFS directory, but you may want to look in to using the listHDFS and FetchHDFS processors.

 

The ListHDFS processor will list all FlowFiles based on processor configuration.  The processor then retains state based on timestamp of timestamp of the last execution.  Then it uses this timestamp so that only new Files since previous execution are listed during the next execution.  The ListHDFS processor only creates a 0 byte FlowFile for each File it lists.  The FlowFile includes attributes/metatdata about the HDFS file so that its content can later be fetched by the FetchHDFS processor.

 

The FlowFile produced by the ListHDFS processor then need to be routed to a FetchHDFS processor in order to retrieve the actual content from HDFS.   This model is designed so that the ListHDFS can be configured to run on the Primary Node only in a NiFi cluster and then you can distributed these 0 byte files across all nodes in your cluster before actually fetching the content (provides better performance by spreading the load across multiple servers)

 

The first run, you may get more than you wanted, but there are some configurations you can set to avoid or limit this in the listHDFS processor.  Consider using the "Maximum File Age" property.

 

Hope this helps,

Matt

Don't have an account?
Coming from Hortonworks? Activate your account here