Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

netlib-java and Anaconda for Spark ML?

avatar
Expert Contributor

Guys,

I was going through articles for Spark ML, found references that suggests to have netlib-java for setting up Spark-MLlib if we plan to run ML applications in Java/Scala.

Another posts/article suggests to install Anaconda libraries for using Spark with Python. I ran simple programs and used Spark SQL without Anaconda, was wondering do we really need Anaconda packages for Spark Python for MLlib usage?

It would be great if someone could kindly comment on the netlib-java and Anacoda dependencies with respect to Spark and Spark MLlib use cases.

Thanks,

SS

1 ACCEPTED SOLUTION

avatar

For example kmeans clustering in a SparkML pipeline with python requires "numpy" to be installed on every node. Anaconda is a nice way to get the full python scientific stack installed (including numpy) without caring about details. However, using Anaconda instead of operating system's python means you need to set the PATHs correct for Spark and Zeppelin.

Alternatively I have just used "apt-get install python-numpy" on all of my ubuntu 14.04 based HDP nodes and then numpy is available and kmeans works (I guess there are other algorithms that also need numpy). Should be available on Redhat based systems too.

I have never installed netlib-java manually. Spark is based on Breeze which uses netlib and netlib is already in the spark assembly jar.

So numpy for python is a must if you want to use SparkML with python, netlib-java should already be there.

View solution in original post

2 REPLIES 2

avatar

For example kmeans clustering in a SparkML pipeline with python requires "numpy" to be installed on every node. Anaconda is a nice way to get the full python scientific stack installed (including numpy) without caring about details. However, using Anaconda instead of operating system's python means you need to set the PATHs correct for Spark and Zeppelin.

Alternatively I have just used "apt-get install python-numpy" on all of my ubuntu 14.04 based HDP nodes and then numpy is available and kmeans works (I guess there are other algorithms that also need numpy). Should be available on Redhat based systems too.

I have never installed netlib-java manually. Spark is based on Breeze which uses netlib and netlib is already in the spark assembly jar.

So numpy for python is a must if you want to use SparkML with python, netlib-java should already be there.

avatar

The above holds for Spark 1.6.2. Haven't checked Spark 2.0