Support Questions
Find answers, ask questions, and share your expertise

Best practices to install Anaconda package distribution for pyspark on Hortonworks Data Cloud (HDCloud)

What are the required (recommended) steps to set up Anaconda packages on Hortonworks Data Cloud (HDCloud)?

Evaluating several ML algorithms in pyspark and missing libraries such as numpy (available via Anaconda).

Also, is this Anaconda cluster setup recommended?:


@Robert Hryniewicz

I don't believe there are any existing best practices established for HDCloud, particularly in regards to Anaconda. The Anaconda Cluster approach is viable. However it's limited with respect to the number of nodes you can manage with the free version.

An alternative would be to use something like Ansible (which is Python based) to push out Anaconda. It requires a little more work to get set up (playbooks, etc), but deployment is straight forward. The difficulty is the dynamic IP addresses in use.

I think the best answer is to enable Ambari to push out Anaconda. I would love to see that functionality.