I have a HIVE ORC table with 30 millions records flat table with reg_id(bigint), product_family (string), product_score(double), product_sales(int) , and wanted to do dense rank for top 5 products associated with reg_id. each reg_id may have more product lines minimum 5 to maximum 1000's. So my old query was the below
SELECT * from (SELECT *,DENSE_RANK() OVER ( PARTITION BY reg_id ORDER BY product_score DESC, product_sales DESC) AS rank FROM SampleTable) AS st WHERE rank<=5;
Running time was 10+ hours with reasonable container resources. Not completed it. Reason being that, one mapper was reading the dataset. Tried in MR and Tez with optimized key parameters, still it was running forever. (I have run rank(), dense_rank(), row_number() functions in 150 millions record sets based on timestamp and other columns in Cloudera Hive MR Engine with Parquet file format, it was 10 to 15 minutes running time based on Selected and Ordered by columns.)
Then I chose reg_id as bucketed column to create a intermediate table. So that i can split the datasets to create multiple files and somewhat i can read in parallel for dense rank query.
CREATE TABLE SampleTempTable (
CLUSTERED BY (reg_id) INTO 20 BUCKETS;<br>
With this intermediate table option, i can manage to read parallel, MR( atleast 5 mappers) engine or Tez (atleast 16 containers launched for 16 Mappers ). Managed to bring the run time to Max 4 hours. The total data set size is 100 MB only. Still I am not convinced by this run time. It supposed to be 15mins to 30mins.
I tried the below optimization parameters in MR/Tez:
set hive.optimize.sort.dynamic.partition = false;
Note: reg_id has minimum 5 product lines to 1000 lines, so user_id is not evenly distributed, but the size of the dataset is evenly distributed. I need to rank products per reg_id. IO operation is taking more time.
Any suggestion to bring down the running time 1 hour is appreciated.
Thank you for your reply. All of those attributes I tried in testing and was not improved less than 4 hours. After I increased 20 buckets to 30 buckets in source table, 30 minutes reduced in run time. Total run time of the query 3 hour 30 minute. Input data size is not more and i dont want to write multiple files with too less data sets.