I setup a HDP 2.6 cluster, where I created a huge Hive table (50 mio rows, about 500 GB of ORC Files).
Now I'm playing around in Zeppelin and query the data. Here I discovered the following behavior:
When I run a query with the LIMIT clause and print the resulting dataframe, it takes ~3 seconds to finish it. Spark History UI shows 1 Task.
%spark2
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10")
df.show()
Afterwards, I added a second dataframe, just to play around with new columns, nothing specific (see query below). Now the query takes minutes and Spark History shows ~4000 Tasks (still with the LIMIT clause).
%spark2
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10")
// With these rows it takes minutes!
val df2 = df.withColumn("valuelist_array", df.col("valuelist"))
df2.show()
Same behavior when I just call the cache method before showing the dataframe: Takes minutes and about 4000 Tasks (with LIMIT clause):
%spark2
// Takes also minutes now, when I add the cache call!
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10").cache()
df.show()
Why is Spark using the 4000 tasks (seems like it does a full table scan) when I re-use the DataFrame or if I want to cache it?
Thanks for your helpful explanations!