Created 08-15-2018 07:35 AM
I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code:
%spark val df1 = sqlContext.sql("select * from mydb.mytable limit 10") df1.show()
That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes!
%spark val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a") df2.show()
1. Is this normal behavior?
2. Is there a way to avoid full table scans, when using sub-queries?
Thank you!
Created 08-15-2018 10:00 AM
I am not completly sure, but I think i came across an information that the limit statement cause a repartioning, so you have significant performance impact by using it. Instead you should use TABLESAMPLE or rewrite the query if it is important which row you get (and not only the limitation)
Created 09-03-2018 07:15 AM
Thank you! Can you give me some details about this or do you have some helpful links?