<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Listsftp taking a long time, in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125727#M88471</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15395/blimbu.html" nodeid="15395"&gt;@bhumi limbu&lt;/A&gt;&lt;/P&gt;&lt;P&gt;NiFi FlowFile attributes/metadata lives in heap.  The List based processors return a complete listing from the target and then creates a FlowFile for each File in that returned listing. The FlowFiles being created are not committed to the list processor's success relationship until all have been created.  So you end up running out of NiFi JVM heap memory before that can happen because of the size of your listing.&lt;/P&gt;&lt;P&gt;As NiFi stands now, the only option is to use multiple list processors with each producing a listing of on a subset of the total files from your source system.  You could use the "Remote Path", "Path Filter Regex" and/or "File Filter Regex" properties in the listSFTP to filter what data is selected to help reduce the heap usage.&lt;/P&gt;&lt;P&gt;
You could also increase the available heap to your NiFi's JVM in the bootstrap.conf file; however, I find it unlikely considering the number of FlowFiles you are listing that you will not still run out of heap memory.&lt;/P&gt;&lt;P&gt;I logged a Jira in Apache NiFi with a suggested change to how these processors produce FlowFiles from the listing returned by these types of processors:&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/NIFI-3423" target="_blank"&gt;https://issues.apache.org/jira/browse/NIFI-3423&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Matt&lt;/P&gt;</description>
    <pubDate>Wed, 01 Feb 2017 03:44:49 GMT</pubDate>
    <dc:creator>MattWho</dc:creator>
    <dc:date>2017-02-01T03:44:49Z</dc:date>
    <item>
      <title>Listsftp taking a long time,</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125723#M88467</link>
      <description>&lt;P&gt;There is a need to load 3 terabyte of historical unix files into hdfs. I am using listsftp, fetchsftp, update attribute and puthdfs processors for this. There are 16 directories with 3 subdirectories each with 350 subdirectories each. I have set the search recursively to true in the listsftp. The dataflow works for a smaller dataset when i point to a specific directory/subdirectory/subdirectory but when i try to do for the whole parent directory the listsftp processor doesn't perform. This is a one time historical load. Is there a way i could only process one directory/subdirectory/subdirectory at one time. Has anyone come across this issue. Thank you for your help. &lt;/P&gt;&lt;P&gt;,&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jan 2017 07:13:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125723#M88467</guid>
      <dc:creator>blimbu</dc:creator>
      <dc:date>2017-01-12T07:13:40Z</dc:date>
    </item>
    <item>
      <title>Re: Listsftp taking a long time,</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125724#M88468</link>
      <description>&lt;P&gt;do you get an error?   error logs?&lt;/P&gt;&lt;P&gt;you may need more error&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jan 2017 07:16:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125724#M88468</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2017-01-12T07:16:00Z</dc:date>
    </item>
    <item>
      <title>Re: Listsftp taking a long time,</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125725#M88469</link>
      <description>&lt;P&gt;It seems to me that it get stuck in the first processor itself for a long time because i don't see any data being pushed over to the next processor fetchsftp; but I don't see any errors.&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jan 2017 23:14:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125725#M88469</guid>
      <dc:creator>blimbu</dc:creator>
      <dc:date>2017-01-12T23:14:13Z</dc:date>
    </item>
    <item>
      <title>Re: Listsftp taking a long time,</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125726#M88470</link>
      <description>&lt;P&gt;Hi Timothy, this is the following error i get: &lt;/P&gt;&lt;P&gt;ERROR [Timer-Driven Process Thread-2] o.a.nifi.processors.standard.ListSFTP 
java.lang.OutOfMemoryError: Java heap space&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jan 2017 03:34:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125726#M88470</guid>
      <dc:creator>blimbu</dc:creator>
      <dc:date>2017-01-13T03:34:19Z</dc:date>
    </item>
    <item>
      <title>Re: Listsftp taking a long time,</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125727#M88471</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15395/blimbu.html" nodeid="15395"&gt;@bhumi limbu&lt;/A&gt;&lt;/P&gt;&lt;P&gt;NiFi FlowFile attributes/metadata lives in heap.  The List based processors return a complete listing from the target and then creates a FlowFile for each File in that returned listing. The FlowFiles being created are not committed to the list processor's success relationship until all have been created.  So you end up running out of NiFi JVM heap memory before that can happen because of the size of your listing.&lt;/P&gt;&lt;P&gt;As NiFi stands now, the only option is to use multiple list processors with each producing a listing of on a subset of the total files from your source system.  You could use the "Remote Path", "Path Filter Regex" and/or "File Filter Regex" properties in the listSFTP to filter what data is selected to help reduce the heap usage.&lt;/P&gt;&lt;P&gt;
You could also increase the available heap to your NiFi's JVM in the bootstrap.conf file; however, I find it unlikely considering the number of FlowFiles you are listing that you will not still run out of heap memory.&lt;/P&gt;&lt;P&gt;I logged a Jira in Apache NiFi with a suggested change to how these processors produce FlowFiles from the listing returned by these types of processors:&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/NIFI-3423" target="_blank"&gt;https://issues.apache.org/jira/browse/NIFI-3423&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Matt&lt;/P&gt;</description>
      <pubDate>Wed, 01 Feb 2017 03:44:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Listsftp-taking-a-long-time/m-p/125727#M88471</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2017-02-01T03:44:49Z</dc:date>
    </item>
  </channel>
</rss>

