Community Articles

msabri · ‎10-30-2017

Introduction

The purpose of this article is to use Apache NiFi to ingest a huge csv file (few million rows) into a Kafka topic.

The tricky point when dealing with big file (millions of rows) is that processing the file by NiFi on row-level will generate flow files for every row, this can produce java memory error. As a work around, we need to limit the generated number of rows by splitting the rows on multiple stages. In this article, the file was around 7 million rows, and 6 dividing stages (100K, 10K, 1K, 100, 10 then 1) were used to limit number of generated flow files.

The other trick is to increase the “Splits” queue’s “Back Pressure Data Size Threshold” to an adequate size for handling file size, by default it’s 1GB. In this article, 2GB is used.

Assumptions & Design

A standalone NiFi and Kafka instance is to be used for this exercise.

The following NiFi flow will be used to split the workload of the multi-million row csv file to be ingested by dividing the ingestion into multi-stages.

Figure 1: the NiFi flow

Figure 2: Properties for “SplitText-100000”

Figure 3: Properties for “SplitText-10000”

Figure 4: Properties for “SplitText-1000”

Figure 5: Properties for “SplitText-100”

Figure 6: Properties for “SplitText-10”

Figure 7: Properties for “SplitText-1”

Figure 8: Properties for the six “splits” queues

Results

The csv rows were ingested properly to the Kafka topic. The only drawback for this flow is that it took almost 30 minutes for ingesting a csv of around 7 million rows. This can be enhanced by using multi NiFi instances and a clustered Kafka, which should be tested in the future.

Future Work

I’ll try doing the same exercise using bigger file size using bigger NiFi and Kafka clusters with higher hardware specs to validate the same conclusion.

References:

https://community.hortonworks.com/questions/122858/nifi-splittext-big-file.html

https://kafka.apache.org/documentation/

bbende · ‎10-30-2017

Nice article! You could also use the "Message Demarcator" property in PublishKafka (set to a new-line) and this way you never have to split up your flow file, it will stream the large flow file and read based on the demarcator so you still get each line sent as an individual message to Kafka.

edjm1971 · ‎11-06-2018

A couple things. 1. I have no idea what Kafka is. 2. 30 minutes for 7 million records is great as my flow if 40 minutes for a meager 70k records.

In regards to the above multi SplitText usage my question is regarding the Settings tab for the SplitText. How should it be set?

My flow does not execute PutFile still until everything has gone through. I have 2 SplitTexts currently and am about to put in a 3rd to see if that helps but it is just slow processing of the data.

Overall flow i have is

Source -> SplitText (5000) -> SplitText (250) -> Processing -> PutFile

Any tips greatly appreciated.

Cloudera Community

Community Articles

Ingesting a Big CSV file into Kafka using a multi-stages SplitText NiFi Processor

Apache Kafka

Apache NiFi

Re: Ingesting a Big CSV file into Kafka using a multi-stages SplitText NiFi Processor

Re: Ingesting a Big CSV file into Kafka using a multi-stages SplitText NiFi Processor