Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Nifi SplitText Big File

Solved Go to solution
Highlighted

Nifi SplitText Big File

Explorer

Hello,

I am trying to split a file of 2 GB with Nifi 1.3 with SplitText processor.

28402-nifi.pngI have not error but it's not working and i have to restart nifi (freeze).

When i execute "service nifi status" from my server i have the following message :

2"017-08-16 14:13:21,844 ERROR [main] org.apache.nifi.bootstrap.Command Failed to send shutdown command to port 54120 due to java.net.SocketTimeoutException: Read timed out. Will kill the NiFi Process with PID 14752."

Do you know if the file is too large ?

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Nifi SplitText Big File

Master Guru
@Pierre Leroy

Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.

What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads

When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.

Thanks,

Matt

View solution in original post

5 REPLIES 5
Highlighted

Re: Nifi SplitText Big File

Master Guru
@Pierre Leroy

Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space.

What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads

When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here.

Thanks,

Matt

View solution in original post

Highlighted

Re: Nifi SplitText Big File

Super Guru

In later versions of NiFi, you may also consider using the "record-aware" processors and their associated Record Readers/Writers, these were developed to avoid this multiple-split problem as well as the volume of associated provenance generated by each split flow file in the flow.

Highlighted

Re: Nifi SplitText Big File

Explorer

Thank you for the explanation,

Use multiple splitText processors in series do the job.

Highlighted

Re: Nifi SplitText Big File

New Contributor

Thanks, using multiple split is good to have with convenient back pressure.

Re: Nifi SplitText Big File

Explorer

I did something similar for pushing data to Kafka using few million rows CSV file by the same concept of multiple splits:
https://community.hortonworks.com/content/kbentry/144771/ingesting-a-big-csv-file-into-kafka-using-a...

Don't have an account?
Coming from Hortonworks? Activate your account here