I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code:
%spark
val df1 = sqlContext.sql("select * from mydb.mytable limit 10")
df1.show()
That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes!
%spark
val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a")
df2.show()
1. Is this normal behavior?
2. Is there a way to avoid full table scans, when using sub-queries?
Thank you!