Created 12-08-2016 07:19 AM
We have a two node cluster. HDF is installed in one cluster and Spark on the other.
In a single node cluster I was able to trigger Spark from Nifi using 'ExecuteStreamCommand' processor using a Spark-submit command put in a shell script.
Can you please let me know the guidelines to trigger Spark from nifi in a multinode cluster for the above mentioned scenario.
Created 12-08-2016 12:49 PM
Sure. Make sure the Spark cli dependencies are available on every node, i.e. you are able to submit your spark job from any node in the NiFi cluster.
Next, assuming you'd like to submit the job only once within a cluster, configure ExecuteStreamCommand by going in its Scheduling tab and selecting On Primary Node in the strategy dropdown. This will ensure it is a cluster-wide singleton. Note that you can't pin the primary node for failover reasons, e.g. this is a role automatically voted by a cluster and may change through its lifecycle if there's a recovery event, etc.
Created 12-20-2016 01:11 PM
Thanks @Andrew Grande. We are now planning to use the job server Livy (http://livy.io/). Can anyone please guide me through this. I tried searching for some documentation but failed to find out a useful one.
Created 12-20-2016 02:52 PM
Here is a good article on calling Livy from REST with CURL examples. Very easy to move those to NIFI
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-livy-rest-interface
Created 12-20-2016 02:53 PM
What I like to do is run Spark Streaming and not batch. You can call that via Site-To-Site or Kafka. Then it's always ready to run.