Support Questions

Find answers, ask questions, and share your expertise

Nifi SplitText Big File

avatar
Explorer

Hello,

I am trying to split a file of 2 GB with Nifi 1.3 with SplitText processor.

28402-nifi.pngI have not error but it's not working and i have to restart nifi (freeze).

When i execute "service nifi status" from my server i have the following message :

2"017-08-16 14:13:21,844 ERROR [main] org.apache.nifi.bootstrap.Command Failed to send shutdown command to port 54120 due to java.net.SocketTimeoutException: Read timed out. Will kill the NiFi Process with PID 14752."

Do you know if the file is too large ?

Thank you

1 ACCEPTED SOLUTION

avatar
Master Mentor
@Pierre Leroy

Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.

What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads

When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.

Thanks,

Matt

View solution in original post

5 REPLIES 5

avatar
Master Mentor
@Pierre Leroy

Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.

What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads

When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.

Thanks,

Matt

avatar
Master Guru

In later versions of NiFi, you may also consider using the "record-aware" processors and their associated Record Readers/Writers, these were developed to avoid this multiple-split problem as well as the volume of associated provenance generated by each split flow file in the flow.

avatar
Explorer

Thank you for the explanation,

Use multiple splitText processors in series do the job.

avatar
New Contributor

Thanks, using multiple split is good to have with convenient back pressure.

avatar
Contributor

I did something similar for pushing data to Kafka using few million rows CSV file by the same concept of multiple splits:
https://community.hortonworks.com/content/kbentry/144771/ingesting-a-big-csv-file-into-kafka-using-a...