Support Questions

thierry_vernhet · ‎06-14-2017

Hi everybody,

I use Nifi 1.0.0 on AIX server.

My ListFile processor gives the same file in two different dataflows. It schedules every 15 seconds.

The file O27853044.1135 begins to fill at 11:35 and ends at 11:45.

Is it normal that the processor creates a dataflow at 11:42 ?

How avoid ListFile processor to create a dataflow before the end of file's update ?

Thanks for you help

MattWho · ‎06-14-2017

@Thierry Vernhet

The ListFile processor will list all non-hidden file it sees in the target directory. It then will record the latest timestamp of batch of files it listed in state management. This timestamp is what is used to determine what new files to list in next run. Since the timestamp has changed, the same file will be listed again.

A few suggestion in preferred order would be:

1. Change how files are being written to this directory.

- The ListFile processor will ignore and hidden files. So File being written as ".myfile.txt" will be ignored until the filename has changed to just "myfile.txt".

2. Change the "Minimum File Age" setting on the processor to a high enough value to allows source system to complete file writes to this directory.

3. Add a detectDuplicate processor after your listFile processor to detect duplicate listed files and remove them from the your dataflow before the FetchFile processor.

Thanks,

Matt

View solution in original post

MattWho · ‎06-14-2017

@Thierry Vernhet

The ListFile processor will list all non-hidden file it sees in the target directory. It then will record the latest timestamp of batch of files it listed in state management. This timestamp is what is used to determine what new files to list in next run. Since the timestamp has changed, the same file will be listed again.

A few suggestion in preferred order would be:

1. Change how files are being written to this directory.

- The ListFile processor will ignore and hidden files. So File being written as ".myfile.txt" will be ignored until the filename has changed to just "myfile.txt".

2. Change the "Minimum File Age" setting on the processor to a high enough value to allows source system to complete file writes to this directory.

3. Add a detectDuplicate processor after your listFile processor to detect duplicate listed files and remove them from the your dataflow before the FetchFile processor.

Thanks,

Matt

thierry_vernhet · ‎06-14-2017

@Matt Clarke

Thanks for these suggestions.

I'm going to try number 2.

And could you give me an example of properties for the number 3 and detectduplicate processor ?

Thanks, TV

MattWho · ‎06-14-2017

@Thierry Vernhet

With number 3, I am assuming that every file has a unique filename from which to determine if the same filename has ever been listed more then once. If that is not the case, then you would need to use detectDuplicate after fetching the actual data (less desirable since you will have wasted the resources to potential fetch the same files twice before deleting the duplicate.

Let assume every file has a unique filename. If so the detect duplicate flow would look like this:

with the DetectDuplicate configured as follows:

You will also need to add two controller services to your NiFi:

- DistributedMapCacheServer

- DistributedMapCacheClientService

The value associated to the "filename" attribute on the FlowFile is checked against entries in the DistributedMapCacheServer. If filename does not exist, it is added. If it exists already then FlowFile is routed to duplicate relationship.

In scenario 2 where filenames may be reused we need to detect if the content after fetch is a duplicate or not. IN this case the flow may look like this:

After fetching the content of a FlowFile, the "HashContent" processor is used to create a hash of the content and write it to a FlowFile attribute (default is hash.value). The detectDuplicate processor then configured to look for FlowFile with the same hash.value to determine if they are duplicates.

FlowFiles where the content hash already exist in the distributedMapCacheServer, those FlowFile are routed to duplicate where you can delete them if you like.

If you found this answer addressed your original question, please mark it as accepted by clicking under the answer.

Thanks,

Matt

tarun_kumar1 · ‎12-18-2017

Thank You Matt, I too was facing similar issue and your suggestion worked.

thierry_vernhet · ‎06-14-2017

@Matt Clarke

The second suggestion works as well.

I kepp the third one for a next usage.

Thanks for all Matt

TV.

Cloudera Community

Support Questions

Files detected twice with ListFile processor