Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SparkSQL: Hive sub-query leads to full table scan

Highlighted

SparkSQL: Hive sub-query leads to full table scan

Expert Contributor

I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code:

%spark
val df1 = sqlContext.sql("select * from mydb.mytable limit 10")
df1.show()

That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes!

%spark
val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a")
df2.show()

1. Is this normal behavior?

2. Is there a way to avoid full table scans, when using sub-queries?

Thank you!

2 REPLIES 2

Re: SparkSQL: Hive sub-query leads to full table scan

Super Collaborator

I am not completly sure, but I think i came across an information that the limit statement cause a repartioning, so you have significant performance impact by using it. Instead you should use TABLESAMPLE or rewrite the query if it is important which row you get (and not only the limitation)

Re: SparkSQL: Hive sub-query leads to full table scan

Expert Contributor

Thank you! Can you give me some details about this or do you have some helpful links?