Support Questions

leroy_p33 · ‎08-16-2017

Hello,

I am trying to split a file of 2 GB with Nifi 1.3 with SplitText processor.

I have not error but it's not working and i have to restart nifi (freeze).

When i execute "service nifi status" from my server i have the following message :

2"017-08-16 14:13:21,844 ERROR [main] org.apache.nifi.bootstrap.Command Failed to send shutdown command to port 54120 due to java.net.SocketTimeoutException: Read timed out. Will kill the NiFi Process with PID 14752."

Do you know if the file is too large ?

Thank you

MattWho · ‎08-16-2017

@Pierre Leroy

Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.

What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads

When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.

Thanks,

Matt

View solution in original post

MattWho · ‎08-16-2017

@Pierre Leroy

Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.

What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads

When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.

Thanks,

Matt

mburgess · ‎10-30-2017

In later versions of NiFi, you may also consider using the "record-aware" processors and their associated Record Readers/Writers, these were developed to avoid this multiple-split problem as well as the volume of associated provenance generated by each split flow file in the flow.

leroy_p33 · ‎08-16-2017

Thank you for the explanation,

Use multiple splitText processors in series do the job.

rsiva · ‎09-07-2017

Thanks, using multiple split is good to have with convenient back pressure.

msabri · ‎10-30-2017

I did something similar for pushing data to Kafka using few million rows CSV file by the same concept of multiple splits:
https://community.hortonworks.com/content/kbentry/144771/ingesting-a-big-csv-file-into-kafka-using-a...

Cloudera Community

Support Questions

Nifi SplitText Big File