Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to process Huge files from FTP arriving in high velocity?

How to process Huge files from FTP arriving in high velocity?

Contributor

We have huge files in FTP 400MB arriving at 5 minutes interval. I would like to process the files and check each event for anomaly. What would be the best toolset for this? Nifi and Flink are some of tool I am using.

I tried splitting the file but it takes more than 5 minutes to split. Is there faster way?

3 REPLIES 3

Re: How to process Huge files from FTP arriving in high velocity?

Contributor

Hi,

So is the getfile & splitfile processors taking 5 min ?

What is the use pattern of the data after the file is split.

Also it took me a secound to get a 400 mb split it in 5 files(split by lines number)

Re: How to process Huge files from FTP arriving in high velocity?

Contributor

Hi @Adrian Oprea, Actually I am splitting the file into 10,0000 and then to 10000 and then to 1000 so on. As, the number of lines should be small to be sent to kafka. A single file would take 5 minutes or more. What is the most appropriate way to convert file into streams.

Highlighted

Re: How to process Huge files from FTP arriving in high velocity?

Contributor

Hi,

Your bottle neck is the flowfile repo configuration parameters.

Things to look for:

1 - is your repos sharing the same disk volume ?

I have for each major repo their own disk so i wont have to fight for IO and they are not on the same disk as the nifi install.

2 - what is the value of the nifi.queue.swap.threshold parameter, default is 20000 after this your NiFi will swap even with RAM available

Eg: if your max queue would be 4000000 lines make sure you have this set in the

Note: You JVM memory settings in the bootstrap.conf have to copmply with the volume allowed in your nifi.queue.swap.threshold, so if you have multiple data flows with 400 mb + what ever else you need to sum this to fit in your JVM Heap

3 - what is the Concurrent Tasks value on the SplitText by 10 ? (make sure this is 25% of Split by 1) eg:2

4 - what is the Concurrent Tasks value on the SplitText by 1 ? eg:8

I do a similar data flow and i can do 5000000 rows in 40 sec. This does not consider Kafka Push.

i hope this helps

Don't have an account?
Coming from Hortonworks? Activate your account here