Created on 08-16-2017 12:47 PM - edited 08-17-2019 07:14 PM
Hello,
I am trying to split a file of 2 GB with Nifi 1.3 with SplitText processor.
I have not error but it's not working and i have to restart nifi (freeze).
When i execute "service nifi status" from my server i have the following message :
2"017-08-16 14:13:21,844 ERROR [main] org.apache.nifi.bootstrap.Command Failed to send shutdown command to port 54120 due to java.net.SocketTimeoutException: Read timed out. Will kill the NiFi Process with PID 14752."
Do you know if the file is too large ?
Thank you
Created 08-16-2017 01:08 PM
Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.
What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads
When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.
Thanks,
Matt
Created 08-16-2017 01:08 PM
Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.
What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads
When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.
Thanks,
Matt
Created 10-30-2017 06:23 PM
In later versions of NiFi, you may also consider using the "record-aware" processors and their associated Record Readers/Writers, these were developed to avoid this multiple-split problem as well as the volume of associated provenance generated by each split flow file in the flow.
Created 08-16-2017 01:24 PM
Thank you for the explanation,
Use multiple splitText processors in series do the job.
Created 09-07-2017 11:37 AM
Thanks, using multiple split is good to have with convenient back pressure.
Created 10-30-2017 11:29 AM
I did something similar for pushing data to Kafka using few million rows CSV file by the same concept of multiple splits:
https://community.hortonworks.com/content/kbentry/144771/ingesting-a-big-csv-file-into-kafka-using-a...