Created on 10-30-201708:05 AM - edited 08-17-201910:25 AM
Introduction
The
purpose of this article is to use Apache NiFi to ingest a huge csv file (few
million rows) into a Kafka topic.
The
tricky point when dealing with big file (millions of rows) is that processing
the file by NiFi on row-level will generate flow files for every row, this can produce
java memory error. As a work around, we need to limit the generated number of
rows by splitting the rows on multiple stages. In this article, the file was
around 7 million rows, and 6 dividing stages (100K, 10K, 1K, 100, 10 then 1) were
used to limit number of generated flow files.
The other trick is to increase the “Splits”
queue’s “Back Pressure Data Size Threshold” to an adequate size for handling
file size, by default it’s 1GB. In this article, 2GB is used.
Assumptions & Design
A standalone NiFi and Kafka instance
is to be used for this exercise.
The
following NiFi flow will be used to split the workload of the multi-million row
csv file to be ingested by dividing the ingestion into multi-stages.
Figure 1: the NiFi flow
Figure 2: Properties for “SplitText-100000”
Figure 3: Properties for “SplitText-10000”
Figure 4: Properties for “SplitText-1000”
Figure 5: Properties for “SplitText-100”
Figure 6: Properties for “SplitText-10”
Figure 7: Properties for “SplitText-1”
Figure 8: Properties for the six “splits” queues
Results
The
csv rows were ingested properly to the Kafka topic. The only drawback for this
flow is that it took almost 30 minutes for ingesting a csv of around 7 million
rows. This can be enhanced by using multi NiFi instances and a clustered Kafka,
which should be tested in the future.
Future Work
I’ll try doing the same exercise using bigger file size
using bigger NiFi and Kafka clusters with higher hardware specs to validate the
same conclusion.
Nice article! You could also use the "Message Demarcator" property in PublishKafka (set to a new-line) and this way you never have to split up your flow file, it will stream the large flow file and read based on the demarcator so you still get each line sent as an individual message to Kafka.
A couple things. 1. I have no idea what Kafka is. 2. 30 minutes for 7 million records is great as my flow if 40 minutes for a meager 70k records.
In regards to the above multi SplitText usage my question is regarding the Settings tab for the SplitText. How should it be set?
My flow does not execute PutFile still until everything has gone through. I have 2 SplitTexts currently and am about to put in a 3rd to see if that helps but it is just slow processing of the data.