Support Questions

Find answers, ask questions, and share your expertise
Welcome to the upgraded Community! Read this blog to see What’s New!

SparkSQL: Hive sub-query leads to full table scan

Expert Contributor

I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code:

val df1 = sqlContext.sql("select * from mydb.mytable limit 10")

That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes!

val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a")

1. Is this normal behavior?

2. Is there a way to avoid full table scans, when using sub-queries?

Thank you!


Super Collaborator

I am not completly sure, but I think i came across an information that the limit statement cause a repartioning, so you have significant performance impact by using it. Instead you should use TABLESAMPLE or rewrite the query if it is important which row you get (and not only the limitation)

Expert Contributor

Thank you! Can you give me some details about this or do you have some helpful links?