Created 02-21-2017 08:31 PM
I have been a Java Developer with reasonable experience in Java Stack.
Since I have never learned either Python or Scala, I am not able to choose a path.
What are the pros and cons of using either of these languages to learn Spark and consequently, which one should I choose?
Created 02-21-2017 08:53 PM
Based on your background, I'd recommend that you go with Scala. Spark is written in Scala and it will give you full access to all the latest APIs, features, etc. You should be able to pick up Scala quickly and can also incorporate Java into your Scala code (if you'd like). In case you didn't know, you can also use the Spark Java API: http://spark.apache.org/docs/latest/api/java/index.html
I personally like python and would like to convert you, 🙂 but in this situation I really don't see any advantage for you to go that route given your history with Java.
From a performance perspective, this is a great presentation to view: http://www.slideshare.net/databricks/2015-0616-spark-summit. Slide 16 shows performance comparisons - Takeaway is that you should use Dataframes (or preferably the new Datasets), instead of RDDs regardless of what language you go with.
Created 02-21-2017 08:53 PM
Based on your background, I'd recommend that you go with Scala. Spark is written in Scala and it will give you full access to all the latest APIs, features, etc. You should be able to pick up Scala quickly and can also incorporate Java into your Scala code (if you'd like). In case you didn't know, you can also use the Spark Java API: http://spark.apache.org/docs/latest/api/java/index.html
I personally like python and would like to convert you, 🙂 but in this situation I really don't see any advantage for you to go that route given your history with Java.
From a performance perspective, this is a great presentation to view: http://www.slideshare.net/databricks/2015-0616-spark-summit. Slide 16 shows performance comparisons - Takeaway is that you should use Dataframes (or preferably the new Datasets), instead of RDDs regardless of what language you go with.
Created 02-21-2017 09:07 PM
In terms of memory usage, processing power, performance, do you think there is any difference between Scala and Python? I am asking this since I work for Enterprise Solutions eventually these are the questions I would have to answer when I make a choice.
Thank you.
Created 02-21-2017 09:15 PM
Thank you so much for such detailed insight.
Created 02-22-2017 11:30 PM
@Dinesh Chitlangia I'd also ask about your goals. If you plan to focus more on analytics, Python should support more statistical packages/libraries. There is also a Java API for Spark, which might get you started with Spark constructs more quickly; see https://spark.apache.org/docs/0.9.1/java-programming-guide.html. When I was thinking about a similar question the following article was helpful: https://datasciencevademecum.wordpress.com/2016/01/28/6-points-to-compare-python-and-scala-for-data-...
Created 03-21-2017 06:50 PM
Thank you for a different perspective! Upvoted your response.
Created 02-23-2017 09:26 PM
Python is easier to learn...Scala is a complex language. But, as a Java developer, having some scala knowledge may be good for your resume, and learning it in a notebook is an easy way to learn the language compared to writing a complex program.
One way to learn is to start with very small amounts of data and write tests in scalatest, run them from maven. That way you can use the API you are used to. But the interactive notebooks are a great way to play fast and iterate rapidly without running builds.
Created 03-09-2017 05:31 PM
Scala and Python are both easy to program and help data experts get productive fast. Data scientists often prefer to learn both Scala and Python for Spark but Python is usually the second favourite language for Apache Spark, as Scala was there first.
Created 03-15-2017 04:37 PM
Agree that Python is likely the easiest, and that with a Java background could pick up Scala quickly. Having a Java background, should be straightforward with more verbose coding to use the Java API.
To get the basic concepts down of data-parallelism, Python seemed really fast to implement the ideas in Spark, although for performance issues, believe RDDs are slower in Python than Scala or Java, and Dataframes are only slightly slower in Python than the other two programming options.
Created 03-21-2017 07:00 PM
You can opt for Python, Scala or Java. Lately industry is moving towards Scala and Python. If you prior experience with python better go with python. If curious to learn Scala, thats good too. Data scientits often prefer to use Scala.