Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What language should I use to learn Spark?

avatar

I have been a Java Developer with reasonable experience in Java Stack.

Since I have never learned either Python or Scala, I am not able to choose a path.

What are the pros and cons of using either of these languages to learn Spark and consequently, which one should I choose?

1 ACCEPTED SOLUTION

avatar

@Dinesh Chitlangia

Based on your background, I'd recommend that you go with Scala. Spark is written in Scala and it will give you full access to all the latest APIs, features, etc. You should be able to pick up Scala quickly and can also incorporate Java into your Scala code (if you'd like). In case you didn't know, you can also use the Spark Java API: http://spark.apache.org/docs/latest/api/java/index.html

I personally like python and would like to convert you, 🙂 but in this situation I really don't see any advantage for you to go that route given your history with Java.

From a performance perspective, this is a great presentation to view: http://www.slideshare.net/databricks/2015-0616-spark-summit. Slide 16 shows performance comparisons - Takeaway is that you should use Dataframes (or preferably the new Datasets), instead of RDDs regardless of what language you go with.

View solution in original post

9 REPLIES 9

avatar

@Dinesh Chitlangia

Based on your background, I'd recommend that you go with Scala. Spark is written in Scala and it will give you full access to all the latest APIs, features, etc. You should be able to pick up Scala quickly and can also incorporate Java into your Scala code (if you'd like). In case you didn't know, you can also use the Spark Java API: http://spark.apache.org/docs/latest/api/java/index.html

I personally like python and would like to convert you, 🙂 but in this situation I really don't see any advantage for you to go that route given your history with Java.

From a performance perspective, this is a great presentation to view: http://www.slideshare.net/databricks/2015-0616-spark-summit. Slide 16 shows performance comparisons - Takeaway is that you should use Dataframes (or preferably the new Datasets), instead of RDDs regardless of what language you go with.

avatar

In terms of memory usage, processing power, performance, do you think there is any difference between Scala and Python? I am asking this since I work for Enterprise Solutions eventually these are the questions I would have to answer when I make a choice.

Thank you.

avatar

Thank you so much for such detailed insight.

avatar
Super Collaborator

@Dinesh Chitlangia I'd also ask about your goals. If you plan to focus more on analytics, Python should support more statistical packages/libraries. There is also a Java API for Spark, which might get you started with Spark constructs more quickly; see https://spark.apache.org/docs/0.9.1/java-programming-guide.html. When I was thinking about a similar question the following article was helpful: https://datasciencevademecum.wordpress.com/2016/01/28/6-points-to-compare-python-and-scala-for-data-...

avatar

Thank you for a different perspective! Upvoted your response.

avatar

Python is easier to learn...Scala is a complex language. But, as a Java developer, having some scala knowledge may be good for your resume, and learning it in a notebook is an easy way to learn the language compared to writing a complex program.

One way to learn is to start with very small amounts of data and write tests in scalatest, run them from maven. That way you can use the API you are used to. But the interactive notebooks are a great way to play fast and iterate rapidly without running builds.

avatar
Contributor

Scala and Python are both easy to program and help data experts get productive fast. Data scientists often prefer to learn both Scala and Python for Spark but Python is usually the second favourite language for Apache Spark, as Scala was there first.

avatar
Explorer

Agree that Python is likely the easiest, and that with a Java background could pick up Scala quickly. Having a Java background, should be straightforward with more verbose coding to use the Java API.

To get the basic concepts down of data-parallelism, Python seemed really fast to implement the ideas in Spark, although for performance issues, believe RDDs are slower in Python than Scala or Java, and Dataframes are only slightly slower in Python than the other two programming options.

avatar
Super Collaborator

You can opt for Python, Scala or Java. Lately industry is moving towards Scala and Python. If you prior experience with python better go with python. If curious to learn Scala, thats good too. Data scientits often prefer to use Scala.