I setup a HDP 2.6 cluster, where I created a huge Hive table (50 mio rows, about 500 GB of ORC Files).
Now I'm playing around in Zeppelin and query the data. Here I discovered the following behavior:
When I run a query with the LIMIT clause and print the resulting dataframe, it takes ~3 seconds to finish it. Spark History UI shows 1 Task.
%spark2 sqlContext.clearCache() val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10") df.show()
Afterwards, I added a second dataframe, just to play around with new columns, nothing specific (see query below). Now the query takes minutes and Spark History shows ~4000 Tasks (still with the LIMIT clause).
%spark2 sqlContext.clearCache() val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10") // With these rows it takes minutes! val df2 = df.withColumn("valuelist_array", df.col("valuelist")) df2.show()
Same behavior when I just call the cache method before showing the dataframe: Takes minutes and about 4000 Tasks (with LIMIT clause):
%spark2 // Takes also minutes now, when I add the cache call! sqlContext.clearCache() val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10").cache() df.show()
Why is Spark using the 4000 tasks (seems like it does a full table scan) when I re-use the DataFrame or if I want to cache it?
Thanks for your helpful explanations!
@Daniel Müller have you tried same with spark 2.2? There are a number of improvements that have been introduced in spark 2.2 for orc. You can read more here:
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.