Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

workflow scheduler for ETL

avatar
New Contributor

Hi,

I have been using Oozie as workflow scheduler for a while and I would like to switch to a more modern one.

I like the Airflow since it has a nicer UI, task dependency graph, and a programatic scheduler.

The Spring XD is also interesting by the number of connector and standardisation it offers.

What are your experiences this tools and what would you recommend for a generic ETL pipeline.

Thanks!

1 ACCEPTED SOLUTION

avatar
Super Guru

@Andi Chirita

If you are already experienced with Oozie and your work is in Hadoop then look at Apache Falcon. It is part of Hortonworks Data Platform as well. I really like a tool to do more than just scheduling. I want the tool to execute various tasks or delegate execution not only on time but also on event or when conditions are met. Both, Apache Falcon and Apache NiFi can help with that. They are not specialized schedulers, just more than that. Falcon can satisfy your requirements if you live in Hadoop. If you want to do more than that, then look at Apache NiFi.

"Falcon simplifies the development and management of data processing pipelines with a higher layer of abstraction, taking the complex coding out of data processing applications by providing out-of-the-box data management services. This simplifies the configuration and orchestration of data motion, disaster recovery and data retention workflows. The Falcon framework can also leverage other HDP components, such as Pig, HDFS, and Oozie. Falcon enables this simplified management by providing a framework to define, deploy, and manage data pipelines".

Check here for what Falcon does, how it works etc: http://hortonworks.com/apache/falcon

If you want to know more about this project backed by Hortonworks, go to: https://falcon.apache.org/

Apache NiFi is a tool to build a dataflow pipeline (flow of data from edge devices to the datacenter). NiFi has a lot of inbuilt connectors (known as processors in NiFi world) so it can Get/Put data from/to HDFS, Hive, RDBMS, Kafka etc. out of the box. It also has really cool & user friendly interface which can be used to build the dataflow in minutes by dragging and dropping processors. NiFi is an alternative with more support and customer adoption. It has been used heavily at NSA and it is part of Hortonworks Data Flow.

To learn more about Apache NiFi go here: https://nifi.apache.org/

NiFi tutorials here: http://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi/

Falcon tutorials here: http://hortonworks.com/apache/falcon/#tutorials

They do much more than a scheduler. They help build true pipelines which is the usual use case. I evaluated AirFlow and while is a promising project. It is still an incubator phase, not enterprise ready - still many issues, and it is more like a traditional scheduler. It depends on your use case, but if you want other Falcon or Nifi, you can achieve what scheduler does or more. I just love NiFi because I can use it for Hadoop and non-Hadoop.

Let me know if you want to see a demo of NiFi and I can set you up.

View solution in original post

3 REPLIES 3

avatar
Super Guru

@Andi Chirita

If you are already experienced with Oozie and your work is in Hadoop then look at Apache Falcon. It is part of Hortonworks Data Platform as well. I really like a tool to do more than just scheduling. I want the tool to execute various tasks or delegate execution not only on time but also on event or when conditions are met. Both, Apache Falcon and Apache NiFi can help with that. They are not specialized schedulers, just more than that. Falcon can satisfy your requirements if you live in Hadoop. If you want to do more than that, then look at Apache NiFi.

"Falcon simplifies the development and management of data processing pipelines with a higher layer of abstraction, taking the complex coding out of data processing applications by providing out-of-the-box data management services. This simplifies the configuration and orchestration of data motion, disaster recovery and data retention workflows. The Falcon framework can also leverage other HDP components, such as Pig, HDFS, and Oozie. Falcon enables this simplified management by providing a framework to define, deploy, and manage data pipelines".

Check here for what Falcon does, how it works etc: http://hortonworks.com/apache/falcon

If you want to know more about this project backed by Hortonworks, go to: https://falcon.apache.org/

Apache NiFi is a tool to build a dataflow pipeline (flow of data from edge devices to the datacenter). NiFi has a lot of inbuilt connectors (known as processors in NiFi world) so it can Get/Put data from/to HDFS, Hive, RDBMS, Kafka etc. out of the box. It also has really cool & user friendly interface which can be used to build the dataflow in minutes by dragging and dropping processors. NiFi is an alternative with more support and customer adoption. It has been used heavily at NSA and it is part of Hortonworks Data Flow.

To learn more about Apache NiFi go here: https://nifi.apache.org/

NiFi tutorials here: http://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi/

Falcon tutorials here: http://hortonworks.com/apache/falcon/#tutorials

They do much more than a scheduler. They help build true pipelines which is the usual use case. I evaluated AirFlow and while is a promising project. It is still an incubator phase, not enterprise ready - still many issues, and it is more like a traditional scheduler. It depends on your use case, but if you want other Falcon or Nifi, you can achieve what scheduler does or more. I just love NiFi because I can use it for Hadoop and non-Hadoop.

Let me know if you want to see a demo of NiFi and I can set you up.

avatar
Contributor

@Andi Chirita : We evaluated SpringXD vs Nifi few months back, at that time both were on GA license but not enterprise ready.

SpringXD:

SpringXD comes with multiple connector esp. in the field of IoT (MQTT protocol) and REST APIs, also it has module to perform transformations on the fly, but the short coming is it has many issue while running on YARN. UI was not mature enough and still beta phase.

Nifi+Falcon:

As @Constantin Stanca mentioned, Apache Nifi + Apache Falcon is another good option, we tried it with HDP 2.4.x version.

It was not production ready. Also Falcon has multiple issue and limitation with lower versions.

I hope this might help you in deciding the tool for ETL pipeline.

avatar
New Contributor

Airflow maintainer here. I know th is question is a bit dated, but it still turns up in the searches. Airflow and Nifi both have their strengths and weaknesses. Let me list some of the great things of Airflow that set it apart.

1. Configuration as code. Airflow uses python for the definitions of DAGs (I.e. workflows). This gives you the full power and flexibility of a programming language with a wealth of modules.

2. DAGs are testable and versionable. As they are in code you can integrate your workflow definitions into your CI/CD pipeline.

3. Ease of setup, local development. While Airflow gives you horizontal and vertical scaleability it also allows your developers to test and run locally, all from a single pip install Apache-airflow. This greatly enhances productivity and reproducibility.

4. Real Data sucks Airflow knows that so we have features for retrying and SLAs

5. Changing history. After a year you find out that you need to put a task into a dag, but it needs to run ‘in the past’. Airflow allows you to do backfills giving you the opportunity to rewrite history. And guess what, you more often need it than you think.

6. Great debugability. There are logs for everything, but nicely tied to the unit of work they are doing. Scheduler logs, DAG parsing/professing logs, task logs. Being in python the hurdle is quite low to jump in and do a fix yourself if needed.

7. A wealth of connectors that allow you to run tasks on kubernetes, Docker, spark, hive, presto, Druid, etc etc.

8. A very active community.