<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Files detected twice with ListFile processor in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210041#M62891</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10685/thierryvernhet.html" nodeid="10685"&gt;@Thierry Vernhet&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The ListFile processor will list all non-hidden file it sees in the target directory.  It then will record the latest timestamp of batch of files it listed in state management.  This timestamp is what is used to determine what new files to list in next run. Since the timestamp has changed, the same file will be listed again.&lt;/P&gt;&lt;P&gt;A few suggestion in preferred order would be:&lt;/P&gt;&lt;P&gt;1. Change how files are being written to this directory.&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;- The ListFile processor will ignore and hidden files.  So File being written as ".myfile.txt" will be ignored until the filename has changed to just "myfile.txt".&lt;/P&gt;&lt;P&gt;2. Change the "Minimum File Age" setting on the processor to a high enough value to allows source system to complete file writes to this directory.&lt;/P&gt;&lt;P&gt;3. Add a detectDuplicate processor after your listFile processor to detect duplicate listed files and remove them from the your dataflow before the FetchFile processor.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Matt&lt;/P&gt;</description>
    <pubDate>Wed, 14 Jun 2017 19:30:40 GMT</pubDate>
    <dc:creator>MattWho</dc:creator>
    <dc:date>2017-06-14T19:30:40Z</dc:date>
    <item>
      <title>Files detected twice with ListFile processor</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210040#M62890</link>
      <description>&lt;P&gt;Hi everybody,&lt;/P&gt;&lt;P&gt;I use Nifi 1.0.0 on AIX server.&lt;/P&gt;&lt;P&gt;My ListFile processor gives the same file in two different dataflows. It schedules every 15 seconds. &lt;/P&gt;&lt;P&gt;The file O27853044.1135 begins to fill at 11:35 and ends at 11:45. &lt;/P&gt;&lt;P&gt;Is it normal that the processor creates a dataflow at 11:42 ?&lt;/P&gt;&lt;P&gt;How avoid ListFile processor to create a dataflow before the end of file's update ?&lt;/P&gt;&lt;P&gt;Thanks for you help&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16371-im01.png" style="width: 689px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16762iD90B148208C8E6DB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16371-im01.png" alt="16371-im01.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16372-im02.png" style="width: 691px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16763i15D58B5C03DA7839/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16372-im02.png" alt="16372-im02.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16373-im03.png" style="width: 539px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16764iC6F47D25F151B3F1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16373-im03.png" alt="16373-im03.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 04:11:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210040#M62890</guid>
      <dc:creator>thierry_vernhet</dc:creator>
      <dc:date>2019-08-18T04:11:02Z</dc:date>
    </item>
    <item>
      <title>Re: Files detected twice with ListFile processor</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210041#M62891</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10685/thierryvernhet.html" nodeid="10685"&gt;@Thierry Vernhet&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The ListFile processor will list all non-hidden file it sees in the target directory.  It then will record the latest timestamp of batch of files it listed in state management.  This timestamp is what is used to determine what new files to list in next run. Since the timestamp has changed, the same file will be listed again.&lt;/P&gt;&lt;P&gt;A few suggestion in preferred order would be:&lt;/P&gt;&lt;P&gt;1. Change how files are being written to this directory.&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;- The ListFile processor will ignore and hidden files.  So File being written as ".myfile.txt" will be ignored until the filename has changed to just "myfile.txt".&lt;/P&gt;&lt;P&gt;2. Change the "Minimum File Age" setting on the processor to a high enough value to allows source system to complete file writes to this directory.&lt;/P&gt;&lt;P&gt;3. Add a detectDuplicate processor after your listFile processor to detect duplicate listed files and remove them from the your dataflow before the FetchFile processor.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Matt&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jun 2017 19:30:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210041#M62891</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2017-06-14T19:30:40Z</dc:date>
    </item>
    <item>
      <title>Re: Files detected twice with ListFile processor</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210042#M62892</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/525/mclark.html" nodeid="525"&gt;@Matt Clarke&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/525/mclark.html" nodeid="525"&gt;&lt;/A&gt;Thanks for  these suggestions. &lt;/P&gt;&lt;P&gt;I'm going to try number 2. &lt;/P&gt;&lt;P&gt;And could you give me an example of properties for the number 3 and detectduplicate processor ?&lt;/P&gt;&lt;P&gt;Thanks, TV&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jun 2017 19:59:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210042#M62892</guid>
      <dc:creator>thierry_vernhet</dc:creator>
      <dc:date>2017-06-14T19:59:19Z</dc:date>
    </item>
    <item>
      <title>Re: Files detected twice with ListFile processor</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210043#M62893</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10685/thierryvernhet.html" nodeid="10685" target="_blank"&gt;@Thierry Vernhet&lt;/A&gt; &lt;/P&gt;&lt;P&gt;With number 3, I am assuming that every file has a unique filename from which to determine if the same filename has ever been listed more then once.  If that is not the case, then you would need to use detectDuplicate after fetching the actual data (less desirable since you will have wasted the resources to potential fetch the same files twice before deleting the duplicate.&lt;/P&gt;&lt;P&gt;Let assume every file has a unique filename. If so the detect duplicate flow would look like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16362-screen-shot-2017-06-14-at-94637-am.png" style="width: 502px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16757i3F4D887296323D62/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16362-screen-shot-2017-06-14-at-94637-am.png" alt="16362-screen-shot-2017-06-14-at-94637-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;with the DetectDuplicate configured as follows:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16363-screen-shot-2017-06-14-at-94703-am.png" style="width: 528px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16758iEBE52B741EEAFEB4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16363-screen-shot-2017-06-14-at-94703-am.png" alt="16363-screen-shot-2017-06-14-at-94703-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;You will also need to add two controller services to your NiFi:&lt;/P&gt;&lt;P&gt;- DistributedMapCacheServer&lt;/P&gt;&lt;P&gt;- DistributedMapCacheClientService&lt;/P&gt;&lt;P&gt;The value associated to the "filename" attribute on the FlowFile is checked against entries in the DistributedMapCacheServer.  If filename does not exist, it is added.  If it exists already then FlowFile is routed to duplicate relationship.&lt;/P&gt;&lt;P&gt;In scenario 2 where filenames may be reused we need to detect if the content after fetch is a duplicate or not.  IN this case the flow may look like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16364-screen-shot-2017-06-14-at-95255-am.png" style="width: 512px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16759iEFADD80666946D6F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16364-screen-shot-2017-06-14-at-95255-am.png" alt="16364-screen-shot-2017-06-14-at-95255-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;After fetching the content of a FlowFile, the "HashContent" processor is used to create a hash of the content and write it to a FlowFile attribute (default is hash.value).  The detectDuplicate processor then configured to look for FlowFile with the same hash.value to determine if they are duplicates.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16365-screen-shot-2017-06-14-at-95617-am.png" style="width: 506px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16760i95FA5A77CE2D3BC2/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16365-screen-shot-2017-06-14-at-95617-am.png" alt="16365-screen-shot-2017-06-14-at-95617-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;FlowFiles where the content hash already exist in the distributedMapCacheServer, those FlowFile are routed to duplicate where you can delete them if you like.&lt;/P&gt;&lt;P&gt;If you found this answer addressed your original question, please mark it as accepted by clicking &lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="16366-accept.png" style="width: 69px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16761i6637AF9D9494D337/image-size/medium?v=v2&amp;amp;px=400" role="button" title="16366-accept.png" alt="16366-accept.png" /&gt;&lt;/span&gt;under the answer.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Matt
&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 04:10:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210043#M62893</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2019-08-18T04:10:43Z</dc:date>
    </item>
    <item>
      <title>Re: Files detected twice with ListFile processor</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210044#M62894</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/525/mclark.html" nodeid="525"&gt;@Matt Clarke&lt;/A&gt; &lt;/P&gt;&lt;P&gt;The second suggestion works as well.&lt;/P&gt;&lt;P&gt;I kepp the third one for a next usage.&lt;/P&gt;&lt;P&gt;Thanks for all Matt&lt;/P&gt;&lt;P&gt;TV.&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jun 2017 21:35:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210044#M62894</guid>
      <dc:creator>thierry_vernhet</dc:creator>
      <dc:date>2017-06-14T21:35:36Z</dc:date>
    </item>
    <item>
      <title>Re: Files detected twice with ListFile processor</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210045#M62895</link>
      <description>&lt;P&gt;Thank You Matt, I too was facing similar issue and your suggestion worked.&lt;/P&gt;</description>
      <pubDate>Mon, 18 Dec 2017 22:00:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Files-detected-twice-with-ListFile-processor/m-p/210045#M62895</guid>
      <dc:creator>tarun_kumar1</dc:creator>
      <dc:date>2017-12-18T22:00:18Z</dc:date>
    </item>
  </channel>
</rss>

