Thanks for your wuick reply. The point is that I am quite ungappy with oozie. Well, it does its job but handling the xmls is not my favourite. So i was looking for something more sophisticated where i can have a dependency between dofferent job packages (i.e. a coordinator in oozie).
Okay I had a closer look into it. For me it looks like ApacheNiFi (Hortonworks' DataFlow) is more or less a tool piping your data from a non Hadoopsystem (RDMBS, IoT,...) into Hadoop. Thereafter, an other tool is needed to manage data. Here, Apache Falcon has its strength.
Airflow, Luigi, Azkaban are solutions for broader scheduling tasks and need more effort to be installed (next) to your cluster.
Quickly dipping my toe into scheduling with Spark I didn't come up with many resources.
Last but not least Oozie (e.g. managed via Hue) seems like the easiest fit to manage all kind of workflows (Sqoop, Hive, Shell, Spark,...) within a cluster. Of course, I have dependencies between single action whereas dependencies between single coordinators is missing. In my humble opinion this funcitonality can be added with flagfiles.
I think, Oozie is still the best fit although it is cumbersome to handle via xml files. Of course there is the Eclipse plugin to visualize workflows and create them as well.
Falcon will manage Oozie. And a Web UI instead of XML should be available soon if you don't find one out in the wild that you like. A lot of companies are running Oozie with lots of different jobs and it works well. If you are doing Sqoop, Pig and Hive it's your way to go. With NiFi I run Sqoop, Pig, Spark, Python, TensorFlow and MXNet jobs and connect them. I run them with cron timers and reactive when something happens (files appear, directories change, Kafka message arrives, MQTT message arrives, ...)
Airflow maintainer here. Let me list some of the great things of Airflow that set it apart.
1. Configuration as code. Airflow uses python for the definitions of DAGs (I.e. workflows). This gives you the full power and flexibility of a programming language with a wealth of modules.
2. DAGs are testable and versionable. As they are in code you can integrate your workflow definitions into your CI/CD pipeline.
3. Ease of setup, local development. While Airflow gives you horizontal and vertical scaleability it also allows your developers to test and run locally, all from a single pip install Apache-airflow. This greatly enhances productivity and reproducibility.
4. Real Data sucks Airflow knows that so we have features for retrying and SLAs
5. Changing history. After a year you find out that you need to put a task into a dag, but it needs to run ‘in the past’. Airflow allows you to do backfills giving you the opportunity to rewrite history. And guess what, you more often need it than you think.
6. Great debugability. There are logs for everything, but nicely tied to the unit of work they are doing. Scheduler logs, DAG parsing/professing logs, task logs. Being in python the hurdle is quite low to jump in and do a fix yourself if needed.
7. A wealth of connectors that allow you to run tasks on kubernetes, Docker, spark, hive, presto, Druid, etc etc.
8. A very active community.
As to your question. There is no particular dependency between HDP and Airflow. If you make Ambari deploy the client libraries on your Airflow workers, it will work just fine.