Has anyone integrated Apache Airflow and HDP?
It looks interesting.
Is there any best practice or installation guide out there by hortonworks to set up airflow within hdp and start random jobs?
I have seen there are some operators available and the rest could be managed via shell.
Hortonworks does not support Airflow as of yet. It's in pretty early incubation.
Perhaps @Chris Nauroth can shed some light.
You might want to try out HDF (Apache NiFi) for job running
Anything that works with Apache Hadoop will work with Hortonworks as HDP is pure 100% open source Apache Hadoop.
This is Airbnb's project for the most part, so check out their info.
macros.random might assist you
What's your use case?
Thanks for your wuick reply. The point is that I am quite ungappy with oozie. Well, it does its job but handling the xmls is not my favourite. So i was looking for something more sophisticated where i can have a dependency between dofferent job packages (i.e. a coordinator in oozie).
I thought airflow cod be my solution.
1. Coordinate the Jobs inside Spark
2. Coordinate the Jobs with Apache NiFi (I have done Sqoop, Hive, HBase, Pig, Spark, Python and Deep Learning jobs with it)
3. Manage Oozie with Falcon http://hortonworks.com/apache/falcon/
4. HUE is part of HDP, http://gethue.com/scheduling/
5. Luigi - I used it a few times, seemed okay https://blog.kupstaitis-dunkler.com/2016/07/19/how-to-create-a-data-pipeline-using-luigi/
Thanks I will have a lookninto it. Especially controlling jobs with spark sounds interesting. I haven't heared of it before. Do you have a source? Thanks again!
Okay I had a closer look into it. For me it looks like ApacheNiFi (Hortonworks' DataFlow) is more or less a tool piping your data from a non Hadoopsystem (RDMBS, IoT,...) into Hadoop. Thereafter, an other tool is needed to manage data. Here, Apache Falcon has its strength.
Airflow, Luigi, Azkaban are solutions for broader scheduling tasks and need more effort to be installed (next) to your cluster.
Quickly dipping my toe into scheduling with Spark I didn't come up with many resources.
Last but not least Oozie (e.g. managed via Hue) seems like the easiest fit to manage all kind of workflows (Sqoop, Hive, Shell, Spark,...) within a cluster. Of course, I have dependencies between single action whereas dependencies between single coordinators is missing. In my humble opinion this funcitonality can be added with flagfiles.
I think, Oozie is still the best fit although it is cumbersome to handle via xml files. Of course there is the Eclipse plugin to visualize workflows and create them as well.
Feel free to correct my views. Thanks!
Falcon will manage Oozie. And a Web UI instead of XML should be available soon if you don't find one out in the wild that you like. A lot of companies are running Oozie with lots of different jobs and it works well. If you are doing Sqoop, Pig and Hive it's your way to go. With NiFi I run Sqoop, Pig, Spark, Python, TensorFlow and MXNet jobs and connect them. I run them with cron timers and reactive when something happens (files appear, directories change, Kafka message arrives, MQTT message arrives, ...)
https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.... https://community.hortonworks.com/articles/61180/streaming-ingest-of-google-sheets-into-a-connected.... https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.... https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.... https://community.hortonworks.com/articles/72420/ingesting-remote-sensor-feeds-into-apache-phoenix.h... https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st....
Airflow maintainer here. Let me list some of the great things of Airflow that set it apart.
1. Configuration as code. Airflow uses python for the definitions of DAGs (I.e. workflows). This gives you the full power and flexibility of a programming language with a wealth of modules.
2. DAGs are testable and versionable. As they are in code you can integrate your workflow definitions into your CI/CD pipeline.
3. Ease of setup, local development. While Airflow gives you horizontal and vertical scaleability it also allows your developers to test and run locally, all from a single pip install Apache-airflow. This greatly enhances productivity and reproducibility.
4. Real Data sucks Airflow knows that so we have features for retrying and SLAs
5. Changing history. After a year you find out that you need to put a task into a dag, but it needs to run ‘in the past’. Airflow allows you to do backfills giving you the opportunity to rewrite history. And guess what, you more often need it than you think.
6. Great debugability. There are logs for everything, but nicely tied to the unit of work they are doing. Scheduler logs, DAG parsing/professing logs, task logs. Being in python the hurdle is quite low to jump in and do a fix yourself if needed.
7. A wealth of connectors that allow you to run tasks on kubernetes, Docker, spark, hive, presto, Druid, etc etc.
8. A very active community.
As to your question. There is no particular dependency between HDP and Airflow. If you make Ambari deploy the client libraries on your Airflow workers, it will work just fine.
New version has been released. Now you're able to integrate airflow with virtual environment.
Also I wrote an article about airflow integration: