Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Listsftp taking a long time,

avatar
Contributor

There is a need to load 3 terabyte of historical unix files into hdfs. I am using listsftp, fetchsftp, update attribute and puthdfs processors for this. There are 16 directories with 3 subdirectories each with 350 subdirectories each. I have set the search recursively to true in the listsftp. The dataflow works for a smaller dataset when i point to a specific directory/subdirectory/subdirectory but when i try to do for the whole parent directory the listsftp processor doesn't perform. This is a one time historical load. Is there a way i could only process one directory/subdirectory/subdirectory at one time. Has anyone come across this issue. Thank you for your help.

,

1 ACCEPTED SOLUTION

avatar
Super Mentor

@bhumi limbu

NiFi FlowFile attributes/metadata lives in heap. The List based processors return a complete listing from the target and then creates a FlowFile for each File in that returned listing. The FlowFiles being created are not committed to the list processor's success relationship until all have been created. So you end up running out of NiFi JVM heap memory before that can happen because of the size of your listing.

As NiFi stands now, the only option is to use multiple list processors with each producing a listing of on a subset of the total files from your source system. You could use the "Remote Path", "Path Filter Regex" and/or "File Filter Regex" properties in the listSFTP to filter what data is selected to help reduce the heap usage.

You could also increase the available heap to your NiFi's JVM in the bootstrap.conf file; however, I find it unlikely considering the number of FlowFiles you are listing that you will not still run out of heap memory.

I logged a Jira in Apache NiFi with a suggested change to how these processors produce FlowFiles from the listing returned by these types of processors:

https://issues.apache.org/jira/browse/NIFI-3423

Thanks,

Matt

View solution in original post

4 REPLIES 4

avatar
Master Guru

do you get an error? error logs?

you may need more error

avatar
Contributor

It seems to me that it get stuck in the first processor itself for a long time because i don't see any data being pushed over to the next processor fetchsftp; but I don't see any errors.

avatar
Contributor

Hi Timothy, this is the following error i get:

ERROR [Timer-Driven Process Thread-2] o.a.nifi.processors.standard.ListSFTP java.lang.OutOfMemoryError: Java heap space

avatar
Super Mentor

@bhumi limbu

NiFi FlowFile attributes/metadata lives in heap. The List based processors return a complete listing from the target and then creates a FlowFile for each File in that returned listing. The FlowFiles being created are not committed to the list processor's success relationship until all have been created. So you end up running out of NiFi JVM heap memory before that can happen because of the size of your listing.

As NiFi stands now, the only option is to use multiple list processors with each producing a listing of on a subset of the total files from your source system. You could use the "Remote Path", "Path Filter Regex" and/or "File Filter Regex" properties in the listSFTP to filter what data is selected to help reduce the heap usage.

You could also increase the available heap to your NiFi's JVM in the bootstrap.conf file; however, I find it unlikely considering the number of FlowFiles you are listing that you will not still run out of heap memory.

I logged a Jira in Apache NiFi with a suggested change to how these processors produce FlowFiles from the listing returned by these types of processors:

https://issues.apache.org/jira/browse/NIFI-3423

Thanks,

Matt