Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What language should I use to learn Spark?

Solved Go to solution
Highlighted

What language should I use to learn Spark?

I have been a Java Developer with reasonable experience in Java Stack.

Since I have never learned either Python or Scala, I am not able to choose a path.

What are the pros and cons of using either of these languages to learn Spark and consequently, which one should I choose?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: What language should I use to learn Spark?

@Dinesh Chitlangia

Based on your background, I'd recommend that you go with Scala. Spark is written in Scala and it will give you full access to all the latest APIs, features, etc. You should be able to pick up Scala quickly and can also incorporate Java into your Scala code (if you'd like). In case you didn't know, you can also use the Spark Java API: http://spark.apache.org/docs/latest/api/java/index.html

I personally like python and would like to convert you, :) but in this situation I really don't see any advantage for you to go that route given your history with Java.

From a performance perspective, this is a great presentation to view: http://www.slideshare.net/databricks/2015-0616-spark-summit. Slide 16 shows performance comparisons - Takeaway is that you should use Dataframes (or preferably the new Datasets), instead of RDDs regardless of what language you go with.

View solution in original post

9 REPLIES 9
Highlighted

Re: What language should I use to learn Spark?

@Dinesh Chitlangia

Based on your background, I'd recommend that you go with Scala. Spark is written in Scala and it will give you full access to all the latest APIs, features, etc. You should be able to pick up Scala quickly and can also incorporate Java into your Scala code (if you'd like). In case you didn't know, you can also use the Spark Java API: http://spark.apache.org/docs/latest/api/java/index.html

I personally like python and would like to convert you, :) but in this situation I really don't see any advantage for you to go that route given your history with Java.

From a performance perspective, this is a great presentation to view: http://www.slideshare.net/databricks/2015-0616-spark-summit. Slide 16 shows performance comparisons - Takeaway is that you should use Dataframes (or preferably the new Datasets), instead of RDDs regardless of what language you go with.

View solution in original post

Highlighted

Re: What language should I use to learn Spark?

In terms of memory usage, processing power, performance, do you think there is any difference between Scala and Python? I am asking this since I work for Enterprise Solutions eventually these are the questions I would have to answer when I make a choice.

Thank you.

Highlighted

Re: What language should I use to learn Spark?

Thank you so much for such detailed insight.

Highlighted

Re: What language should I use to learn Spark?

Expert Contributor

@Dinesh Chitlangia I'd also ask about your goals. If you plan to focus more on analytics, Python should support more statistical packages/libraries. There is also a Java API for Spark, which might get you started with Spark constructs more quickly; see https://spark.apache.org/docs/0.9.1/java-programming-guide.html. When I was thinking about a similar question the following article was helpful: https://datasciencevademecum.wordpress.com/2016/01/28/6-points-to-compare-python-and-scala-for-data-...

Highlighted

Re: What language should I use to learn Spark?

Thank you for a different perspective! Upvoted your response.

Highlighted

Re: What language should I use to learn Spark?

Python is easier to learn...Scala is a complex language. But, as a Java developer, having some scala knowledge may be good for your resume, and learning it in a notebook is an easy way to learn the language compared to writing a complex program.

One way to learn is to start with very small amounts of data and write tests in scalatest, run them from maven. That way you can use the API you are used to. But the interactive notebooks are a great way to play fast and iterate rapidly without running builds.

Re: What language should I use to learn Spark?

Explorer

Scala and Python are both easy to program and help data experts get productive fast. Data scientists often prefer to learn both Scala and Python for Spark but Python is usually the second favourite language for Apache Spark, as Scala was there first.

Highlighted

Re: What language should I use to learn Spark?

New Contributor

Agree that Python is likely the easiest, and that with a Java background could pick up Scala quickly. Having a Java background, should be straightforward with more verbose coding to use the Java API.

To get the basic concepts down of data-parallelism, Python seemed really fast to implement the ideas in Spark, although for performance issues, believe RDDs are slower in Python than Scala or Java, and Dataframes are only slightly slower in Python than the other two programming options.

Highlighted

Re: What language should I use to learn Spark?

Expert Contributor

You can opt for Python, Scala or Java. Lately industry is moving towards Scala and Python. If you prior experience with python better go with python. If curious to learn Scala, thats good too. Data scientits often prefer to use Scala.

Don't have an account?
Coming from Hortonworks? Activate your account here