Support Questions

Find answers, ask questions, and share your expertise

Nifi as complete data flow, start to end? Is it possible?

avatar
Explorer

Am pretty new to Nifi. The soltion we have designed is using Nifi for data ingestion and oozie for scheduling. Using oozie, hive tables are loaded, merged and hive-to-hive schema copy is performed.

1) Is it possible/ recommended using Nifi to completely achieve it.

2) What are the areas that needs to be taken care while using this approach?

3) The flow after developed using Nifi has so many common processors (updateattribute,putmail etc). Is there any way to set common processors as a separate template and make use of it wherever necessary)

Note: The data flow is scheduled to run once in a day.

1 ACCEPTED SOLUTION

avatar
Master Guru

yes you can do this but first you will need to install oozie client on the nifi nodes. This will become easier once there is 1 ambari managing HDF and HDP. However I would recommend using NiFi to ingest and stream data (using hive streaming processor) into hive tables or just use the putHiveQL. why? the operational capabilities in nifi (back pressure, data linage, event replay, stats on performance) you simply don't get with oozie. Lastly You can reuse this common processing logic or isolate in other terms but using nifi process group.

View solution in original post

4 REPLIES 4

avatar
Master Guru

yes you can do this but first you will need to install oozie client on the nifi nodes. This will become easier once there is 1 ambari managing HDF and HDP. However I would recommend using NiFi to ingest and stream data (using hive streaming processor) into hive tables or just use the putHiveQL. why? the operational capabilities in nifi (back pressure, data linage, event replay, stats on performance) you simply don't get with oozie. Lastly You can reuse this common processing logic or isolate in other terms but using nifi process group.

avatar
Explorer

@Sunile Manjee

Thanks for the recommendations. Am planning to make use of Nifi for data ingestion alone and rest of the flow using oozie.we are building a system with similar to ETL (staging,refine layer). So there will be multiple hive tables created and merged & resulting data will be sent to external systems. Planning to go to Nifi (data ingestion) and (workflow schedule) oozie. Coming to common processing logic, correct me if am wrong. lets say PutMail will be in one separate process group and entire flow will have a link to that group whenever failure occurs.

avatar
Expert Contributor

This might be something you are interested. As Sunile pointed out, you might use NiFi to get data loaded into your Hadoop cluster, then use NiFi ExecuteScript processor or create a custom NiFi processor to launch your Oozie workflow job. Think this way, use NiFi to get data from sources outside of your Hadoop cluster, then use Falcon processes or Oozie workflow jobs to handle work scheduling inside of cluster.

avatar
Explorer

@dsun Thanks. ExecuteStreamCommand is the processor which we are going to make use of. Nifi role will be til creating a data mart from all source and oozie will take care of the rest of the flow.