I am currently using an Oozie workflow on my cluster, but
would like to migrate to NiFi. My current workflow is as follows: Sqoop
queries a DB2 server every 15 minutes, and the result is placed in a directory
on HDFS. A Hive table points to that directory, and analysts will make queries
to that table.
I was thinking running NiFi on the namenode of the cluster,
and using QueryDatabaseTable processor to get the data, and PutHDFS, respectively.
But what will happen, if the QDT processor gets a huge batch that will use up
CPU/Memory/Disk of the namenode? Will this result in unexpected behavior,
because the datanodes won’t be able to communicate with the namenode/have their
communication delayed, which will stop/delay the whole cluster? - I have been experimenting in my sandbox with this setup, getting a batch of 1.1GB. Ambari Metrics tells me, that this uses 50% of the CPU.
I’m aware of this new processor in the making; GenerateTableFetch.
Would a good solution be to fetch data in small portions using GenerateTableFetch,
and then ExecuteSQL and PutHDFS (on the namenode)?
If you are ingesting lot of data I would recommend running NiFi on a dedicated host or at least on edge node.
Also, if you will ingest lot of data for a single NiFi instance, you can use GenerateTableFetch (coming in NiFi 1.0) to divide your import into several chunks, and distribute them on several NiFi nodes. This processor will generate several FlowFiles based on the Partition Size property where each FlowFile is a query to get a part of the data.