Hi,
I was wondering how to improve / setup Tez in order to achieve performance I get when using Spark / Spark SQL.
Currently, I have a table that I need to scan and grab all the data matching certain column. The table is partitioned daily and I have ~100,000,000 rows per day. In Spark SQL, a simple spark.sql("select * from table where col=12345 limit 10000").show(false) finishes in 5-10 minutes, while Hive SQL Query (Hive on Tez) works over 20-30 minutes and I then break it. Also worth noting is that Hive SQL Query occupies pretty much 100% of the cluster, while Spark SQL only goes up to 50%.
Cluster is currently running on ~4 TB on Yarn. I can provide more details, I just don't know exactly what to share at the moment.
BR,
Dan