Support Questions

Find answers, ask questions, and share your expertise

Using NiFi to quey RDBMS

avatar
New Contributor

Hello

I am currently using an Oozie workflow on my cluster, but would like to migrate to NiFi. My current workflow is as follows: Sqoop queries a DB2 server every 15 minutes, and the result is placed in a directory on HDFS. A Hive table points to that directory, and analysts will make queries to that table.

I was thinking running NiFi on the namenode of the cluster, and using QueryDatabaseTable processor to get the data, and PutHDFS, respectively. But what will happen, if the QDT processor gets a huge batch that will use up CPU/Memory/Disk of the namenode? Will this result in unexpected behavior, because the datanodes won’t be able to communicate with the namenode/have their communication delayed, which will stop/delay the whole cluster? - I have been experimenting in my sandbox with this setup, getting a batch of 1.1GB. Ambari Metrics tells me, that this uses 50% of the CPU.

I’m aware of this new processor in the making; GenerateTableFetch. Would a good solution be to fetch data in small portions using GenerateTableFetch, and then ExecuteSQL and PutHDFS (on the namenode)?

1 ACCEPTED SOLUTION

avatar
Master Guru
2 REPLIES 2

avatar
Master Guru

avatar

Hi @Andread B,

Why do you want to run NiFi on the NameNode ?

If you are ingesting lot of data I would recommend running NiFi on a dedicated host or at least on edge node.

Also, if you will ingest lot of data for a single NiFi instance, you can use GenerateTableFetch (coming in NiFi 1.0) to divide your import into several chunks, and distribute them on several NiFi nodes. This processor will generate several FlowFiles based on the Partition Size property where each FlowFile is a query to get a part of the data.

You can try this by downloading NiFi 1.0 Beta : https://nifi.apache.org/download.html