Support Questions

dmueller1607 · ‎08-15-2018

I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code:

%spark
val df1 = sqlContext.sql("select * from mydb.mytable limit 10")
df1.show()

That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes!

%spark
val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a")
df2.show()

1. Is this normal behavior?

2. Is there a way to avoid full table scans, when using sub-queries?

Thank you!

arald · ‎08-15-2018

I am not completly sure, but I think i came across an information that the limit statement cause a repartioning, so you have significant performance impact by using it. Instead you should use TABLESAMPLE or rewrite the query if it is important which row you get (and not only the limitation)

dmueller1607 · ‎09-03-2018

Thank you! Can you give me some details about this or do you have some helpful links?

Cloudera Community

Support Questions

SparkSQL: Hive sub-query leads to full table scan