Created 01-11-2017 11:13 PM
There is a need to load 3 terabyte of historical unix files into hdfs. I am using listsftp, fetchsftp, update attribute and puthdfs processors for this. There are 16 directories with 3 subdirectories each with 350 subdirectories each. I have set the search recursively to true in the listsftp. The dataflow works for a smaller dataset when i point to a specific directory/subdirectory/subdirectory but when i try to do for the whole parent directory the listsftp processor doesn't perform. This is a one time historical load. Is there a way i could only process one directory/subdirectory/subdirectory at one time. Has anyone come across this issue. Thank you for your help.
,
Created 01-31-2017 07:44 PM
NiFi FlowFile attributes/metadata lives in heap. The List based processors return a complete listing from the target and then creates a FlowFile for each File in that returned listing. The FlowFiles being created are not committed to the list processor's success relationship until all have been created. So you end up running out of NiFi JVM heap memory before that can happen because of the size of your listing.
As NiFi stands now, the only option is to use multiple list processors with each producing a listing of on a subset of the total files from your source system. You could use the "Remote Path", "Path Filter Regex" and/or "File Filter Regex" properties in the listSFTP to filter what data is selected to help reduce the heap usage.
You could also increase the available heap to your NiFi's JVM in the bootstrap.conf file; however, I find it unlikely considering the number of FlowFiles you are listing that you will not still run out of heap memory.
I logged a Jira in Apache NiFi with a suggested change to how these processors produce FlowFiles from the listing returned by these types of processors:
https://issues.apache.org/jira/browse/NIFI-3423
Thanks,
Matt
Created 01-11-2017 11:16 PM
do you get an error? error logs?
you may need more error
Created 01-12-2017 03:14 PM
It seems to me that it get stuck in the first processor itself for a long time because i don't see any data being pushed over to the next processor fetchsftp; but I don't see any errors.
Created 01-12-2017 07:34 PM
Hi Timothy, this is the following error i get:
ERROR [Timer-Driven Process Thread-2] o.a.nifi.processors.standard.ListSFTP java.lang.OutOfMemoryError: Java heap space
Created 01-31-2017 07:44 PM
NiFi FlowFile attributes/metadata lives in heap. The List based processors return a complete listing from the target and then creates a FlowFile for each File in that returned listing. The FlowFiles being created are not committed to the list processor's success relationship until all have been created. So you end up running out of NiFi JVM heap memory before that can happen because of the size of your listing.
As NiFi stands now, the only option is to use multiple list processors with each producing a listing of on a subset of the total files from your source system. You could use the "Remote Path", "Path Filter Regex" and/or "File Filter Regex" properties in the listSFTP to filter what data is selected to help reduce the heap usage.
You could also increase the available heap to your NiFi's JVM in the bootstrap.conf file; however, I find it unlikely considering the number of FlowFiles you are listing that you will not still run out of heap memory.
I logged a Jira in Apache NiFi with a suggested change to how these processors produce FlowFiles from the listing returned by these types of processors:
https://issues.apache.org/jira/browse/NIFI-3423
Thanks,
Matt