Support Questions

Pavitran · ‎03-02-2021

I have folders with date as their name:

How do I get files from the latest folder based on the names above?

I know this command

find . ! -path . -type d | sort -nr | head -1

works but how do I do it with a NiFi processor?

MattWho · ‎03-03-2021

@Pavitran

This does present a challenge. Typically the ListFile is used to list files from a local file system. That processor component is designed record state (default based on last modified timestamp) so that only newer files are consumed. But the first run would result in listing all files by default. Also looking at your example, your latest directory does not correspond to current day. The listFile (does not actually consume to content) generates a 0 byte FlowFile for each file listed along with some attributes/metadata about the source file. The FetchFile processor would then be used to fetch the actual content, this allows with large listing to redistribute these 0 byte FlowFiles across all nodes in your cluster before consuming the content (provided same local file system is mounted across all nodes. If different files per node, do not load balance between processors). So you could make a first run which lists everything and just delete those 0 byte files. That would establish state. Then from that point on the ListFile would only list the newest files created.
Pros:
1. State allows this processor to be unaffected by outages, the processor will still consume latest all non previously listed files after an outage.
Cons:
1. You have this initial run which would create potentially a lot of 0 byte FlowFiles to get rid of in order to establish state.
2. With an extended outage, on restart of the flow it may consume more than just the latest since it will be consuming all files with newer timestamps than timestamp last stored in state.

Other options:
A: The ListFile processor has an optional property that sets the "Maximum File Age" which limits the listing of Files to those not olde then x set amount of time.
Pros to. setting this property:
1. Reduces or eliminates the massive listing on first run
Cons to setting this property:
2. Under an extended outage, where outage exceeds configured "Maximum File Age", a file you wanted listed may be skipped.

B: Since the FetchFile uses attributes/metadata from to fetch the actual content, you could craft a source FlowFile on your own and send it to the FetchFile processor. For example, use a ExecuteStreamCommand processor execute a bash file on disk to get the list of Files only from the latest directory. Then use UpdateAttribute to add the other required attributes needed by FetchFile to get the actual content. Then use SplitFile to split that listing of files in to individual FlowFiles before the FetchFile processor.
Pros:
1. You are in control of what is being listed.
Cons:
1. Depending on how often a new directory is created and how often you run your ExecuteStreamCommand processor, you may end up listing the same source files over again since you will not have a state option with ExecuteStreamCommand. But you may be able to handle this via detectDuplicate processor in your flow design.
2. If the listed Directory has some new file added to it post previous listing by ExecuteStreamCommand, next run will list all previous files again along with new ones for same directory. Again, might be able to handle this with detectDuplicate processor.

Hope this helps give you some ideas,
Matt

View solution in original post

Pavitran · ‎03-07-2021

Thank you, the first solution was the best one for me

View solution in original post

MattWho · ‎03-03-2021

@Pavitran

This does present a challenge. Typically the ListFile is used to list files from a local file system. That processor component is designed record state (default based on last modified timestamp) so that only newer files are consumed. But the first run would result in listing all files by default. Also looking at your example, your latest directory does not correspond to current day. The listFile (does not actually consume to content) generates a 0 byte FlowFile for each file listed along with some attributes/metadata about the source file. The FetchFile processor would then be used to fetch the actual content, this allows with large listing to redistribute these 0 byte FlowFiles across all nodes in your cluster before consuming the content (provided same local file system is mounted across all nodes. If different files per node, do not load balance between processors). So you could make a first run which lists everything and just delete those 0 byte files. That would establish state. Then from that point on the ListFile would only list the newest files created.
Pros:
1. State allows this processor to be unaffected by outages, the processor will still consume latest all non previously listed files after an outage.
Cons:
1. You have this initial run which would create potentially a lot of 0 byte FlowFiles to get rid of in order to establish state.
2. With an extended outage, on restart of the flow it may consume more than just the latest since it will be consuming all files with newer timestamps than timestamp last stored in state.

Other options:
A: The ListFile processor has an optional property that sets the "Maximum File Age" which limits the listing of Files to those not olde then x set amount of time.
Pros to. setting this property:
1. Reduces or eliminates the massive listing on first run
Cons to setting this property:
2. Under an extended outage, where outage exceeds configured "Maximum File Age", a file you wanted listed may be skipped.

B: Since the FetchFile uses attributes/metadata from to fetch the actual content, you could craft a source FlowFile on your own and send it to the FetchFile processor. For example, use a ExecuteStreamCommand processor execute a bash file on disk to get the list of Files only from the latest directory. Then use UpdateAttribute to add the other required attributes needed by FetchFile to get the actual content. Then use SplitFile to split that listing of files in to individual FlowFiles before the FetchFile processor.
Pros:
1. You are in control of what is being listed.
Cons:
1. Depending on how often a new directory is created and how often you run your ExecuteStreamCommand processor, you may end up listing the same source files over again since you will not have a state option with ExecuteStreamCommand. But you may be able to handle this via detectDuplicate processor in your flow design.
2. If the listed Directory has some new file added to it post previous listing by ExecuteStreamCommand, next run will list all previous files again along with new ones for same directory. Again, might be able to handle this with detectDuplicate processor.

Hope this helps give you some ideas,
Matt

Pavitran · ‎03-07-2021

Thank you, the first solution was the best one for me

Cloudera Community

Support Questions

How to get files from latest directory based on name