I have installed and built anaconda virtual environment on a node outside of the HDP cluster. To use this with Spark, we need to have this on the HDP cluster. Around this I have a couple of questions.
1) Do we need to install Anaconda on all the nodes? We will possibly try to avoid this as we do not have internet access from cluster and Anaconda installation will require download of the libraries while installation. Did not find repos officially supported for linux installations.
2) If we need to distribute the environment by copying in to all the the nodes before starting any spark applications, then while submitting the spark job from edge node, how do we make sure Spark job uses the Anaconda virtual environment? ( On the single node it is easy as we can switch the Anaconda environments)
But I think I have a conceptual problem and I am not understading how everything works. With these settings the %spark2.pyspark interpreter is not working as expected and doesn't find packages like panda.
My next try was to modified the spark2 interpreter as follows: