I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code:
%spark val df1 = sqlContext.sql("select * from mydb.mytable limit 10") df1.show()
That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes!
%spark val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a") df2.show()
1. Is this normal behavior?
2. Is there a way to avoid full table scans, when using sub-queries?
I am not completly sure, but I think i came across an information that the limit statement cause a repartioning, so you have significant performance impact by using it. Instead you should use TABLESAMPLE or rewrite the query if it is important which row you get (and not only the limitation)