We have huge files in FTP 400MB arriving at 5 minutes interval. I would like to process the files and check each event for anomaly. What would be the best toolset for this? Nifi and Flink are some of tool I am using.
I tried splitting the file but it takes more than 5 minutes to split. Is there faster way?
So is the getfile & splitfile processors taking 5 min ?
What is the use pattern of the data after the file is split.
Also it took me a secound to get a 400 mb split it in 5 files(split by lines number)
Hi @Adrian Oprea, Actually I am splitting the file into 10,0000 and then to 10000 and then to 1000 so on. As, the number of lines should be small to be sent to kafka. A single file would take 5 minutes or more. What is the most appropriate way to convert file into streams.
Your bottle neck is the flowfile repo configuration parameters.
Things to look for:
1 - is your repos sharing the same disk volume ?
I have for each major repo their own disk so i wont have to fight for IO and they are not on the same disk as the nifi install.
2 - what is the value of the nifi.queue.swap.threshold parameter, default is 20000 after this your NiFi will swap even with RAM available
Eg: if your max queue would be 4000000 lines make sure you have this set in the
Note: You JVM memory settings in the bootstrap.conf have to copmply with the volume allowed in your nifi.queue.swap.threshold, so if you have multiple data flows with 400 mb + what ever else you need to sum this to fit in your JVM Heap
3 - what is the Concurrent Tasks value on the SplitText by 10 ? (make sure this is 25% of Split by 1) eg:2
4 - what is the Concurrent Tasks value on the SplitText by 1 ? eg:8
I do a similar data flow and i can do 5000000 rows in 40 sec. This does not consider Kafka Push.
i hope this helps