Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

netlib-java and Anaconda for Spark ML?

Solved Go to solution

netlib-java and Anaconda for Spark ML?

Expert Contributor

Guys,

I was going through articles for Spark ML, found references that suggests to have netlib-java for setting up Spark-MLlib if we plan to run ML applications in Java/Scala.

Another posts/article suggests to install Anaconda libraries for using Spark with Python. I ran simple programs and used Spark SQL without Anaconda, was wondering do we really need Anaconda packages for Spark Python for MLlib usage?

It would be great if someone could kindly comment on the netlib-java and Anacoda dependencies with respect to Spark and Spark MLlib use cases.

Thanks,

SS

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: netlib-java and Anaconda for Spark ML?

For example kmeans clustering in a SparkML pipeline with python requires "numpy" to be installed on every node. Anaconda is a nice way to get the full python scientific stack installed (including numpy) without caring about details. However, using Anaconda instead of operating system's python means you need to set the PATHs correct for Spark and Zeppelin.

Alternatively I have just used "apt-get install python-numpy" on all of my ubuntu 14.04 based HDP nodes and then numpy is available and kmeans works (I guess there are other algorithms that also need numpy). Should be available on Redhat based systems too.

I have never installed netlib-java manually. Spark is based on Breeze which uses netlib and netlib is already in the spark assembly jar.

So numpy for python is a must if you want to use SparkML with python, netlib-java should already be there.

View solution in original post

2 REPLIES 2
Highlighted

Re: netlib-java and Anaconda for Spark ML?

For example kmeans clustering in a SparkML pipeline with python requires "numpy" to be installed on every node. Anaconda is a nice way to get the full python scientific stack installed (including numpy) without caring about details. However, using Anaconda instead of operating system's python means you need to set the PATHs correct for Spark and Zeppelin.

Alternatively I have just used "apt-get install python-numpy" on all of my ubuntu 14.04 based HDP nodes and then numpy is available and kmeans works (I guess there are other algorithms that also need numpy). Should be available on Redhat based systems too.

I have never installed netlib-java manually. Spark is based on Breeze which uses netlib and netlib is already in the spark assembly jar.

So numpy for python is a must if you want to use SparkML with python, netlib-java should already be there.

View solution in original post

Highlighted

Re: netlib-java and Anaconda for Spark ML?

The above holds for Spark 1.6.2. Haven't checked Spark 2.0

Don't have an account?
Coming from Hortonworks? Activate your account here