Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

SparkSQL: Hive sub-query leads to full table scan

avatar
Expert Contributor

I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code:

%spark
val df1 = sqlContext.sql("select * from mydb.mytable limit 10")
df1.show()

That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes!

%spark
val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a")
df2.show()

1. Is this normal behavior?

2. Is there a way to avoid full table scans, when using sub-queries?

Thank you!

2 REPLIES 2

avatar
Super Collaborator

I am not completly sure, but I think i came across an information that the limit statement cause a repartioning, so you have significant performance impact by using it. Instead you should use TABLESAMPLE or rewrite the query if it is important which row you get (and not only the limitation)

avatar
Expert Contributor

Thank you! Can you give me some details about this or do you have some helpful links?