Support Questions

Find answers, ask questions, and share your expertise

SparkSQL: Understanding different runtimes

Expert Contributor

I setup a HDP 2.6 cluster, where I created a huge Hive table (50 mio rows, about 500 GB of ORC Files).

Now I'm playing around in Zeppelin and query the data. Here I discovered the following behavior:

When I run a query with the LIMIT clause and print the resulting dataframe, it takes ~3 seconds to finish it. Spark History UI shows 1 Task.

%spark2
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10")
df.show()

Afterwards, I added a second dataframe, just to play around with new columns, nothing specific (see query below). Now the query takes minutes and Spark History shows ~4000 Tasks (still with the LIMIT clause).

%spark2
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10")
// With these rows it takes minutes!
val df2 = df.withColumn("valuelist_array", df.col("valuelist"))
df2.show()

Same behavior when I just call the cache method before showing the dataframe: Takes minutes and about 4000 Tasks (with LIMIT clause):

%spark2
// Takes also minutes now, when I add the cache call!
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10").cache()
df.show()

Why is Spark using the 4000 tasks (seems like it does a full table scan) when I re-use the DataFrame or if I want to cache it?

Thanks for your helpful explanations!

2 REPLIES 2

@Daniel Müller have you tried same with spark 2.2? There are a number of improvements that have been introduced in spark 2.2 for orc. You can read more here:

https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-spark-22.html

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

Expert Contributor

Yes, I'm already using Spark 2.2 (changed the configs of my %spark Interpreter)