Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best practices to install Anaconda package distribution for pyspark on Hortonworks Data Cloud (HDCloud)

Highlighted

Best practices to install Anaconda package distribution for pyspark on Hortonworks Data Cloud (HDCloud)

What are the required (recommended) steps to set up Anaconda packages on Hortonworks Data Cloud (HDCloud)?

Evaluating several ML algorithms in pyspark and missing libraries such as numpy (available via Anaconda).

Also, is this Anaconda cluster setup recommended?: https://docs.continuum.io/anaconda-cluster/

1 REPLY 1
Highlighted

Re: Best practices to install Anaconda package distribution for pyspark on Hortonworks Data Cloud (HDCloud)

@Robert Hryniewicz

I don't believe there are any existing best practices established for HDCloud, particularly in regards to Anaconda. The Anaconda Cluster approach is viable. However it's limited with respect to the number of nodes you can manage with the free version.

An alternative would be to use something like Ansible (which is Python based) to push out Anaconda. It requires a little more work to get set up (playbooks, etc), but deployment is straight forward. The difficulty is the dynamic IP addresses in use.

I think the best answer is to enable Ambari to push out Anaconda. I would love to see that functionality.

Don't have an account?
Coming from Hortonworks? Activate your account here