I have a HIVE ORC table with 30 millions records flat table with reg_id(bigint), product_family (string), product_score(double), product_sales(int) , and wanted to do dense rank for top 5 products associated with reg_id. each reg_id may have more product lines minimum 5 to maximum 1000's. So my old query was the below
SELECT * from (SELECT *,DENSE_RANK() OVER ( PARTITION BY reg_id ORDER BY product_score DESC, product_sales DESC) AS rank FROM SampleTable) AS st WHERE rank<=5;
Running time was 10+ hours with reasonable container resources. Not completed it. Reason being that, one mapper was reading the dataset. Tried in MR and Tez with optimized key parameters, still it was running forever. (I have run rank(), dense_rank(), row_number() functions in 150 millions record sets based on timestamp and other columns in Cloudera Hive MR Engine with Parquet file format, it was 10 to 15 minutes running time based on Selected and Ordered by columns.)
Then I chose reg_id as bucketed column to create a intermediate table. So that i can split the datasets to create multiple files and somewhat i can read in parallel for dense rank query.
CREATE TABLE SampleTempTable ( reg_id bigint, product_family string, product_score double, product_sales int) CLUSTERED BY (reg_id) INTO 20 BUCKETS;<br>
With this intermediate table option, i can manage to read parallel, MR( atleast 5 mappers) engine or Tez (atleast 16 containers launched for 16 Mappers ). Managed to bring the run time to Max 4 hours. The total data set size is 100 MB only. Still I am not convinced by this run time. It supposed to be 15mins to 30mins.
I tried the below optimization parameters in MR/Tez:
SET hive.vectorized.execution.enabled=true; set hive.vectorized.execution.reduce.enabled=true; set hive.execution.engine=tez; set hive.exec.parallel=true; set hive.exec.parallel.thread.number=8; set mapreduce.input.fileinputformat.split.maxsize=128000000; set hive.optimize.sort.dynamic.partition = false;
Note: reg_id has minimum 5 product lines to 1000 lines, so user_id is not evenly distributed, but the size of the dataset is evenly distributed. I need to rank products per reg_id. IO operation is taking more time.
Any suggestion to bring down the running time 1 hour is appreciated.
Try using the following properties to control the mapper count for your job.
set tez.grouping.min-size=16777216; --16 MB min splitset tez.grouping.max-size=1073741824; --1 GB max split
These parameters will control the number of mappers for splittable formats with Tez. Please update your results after using these properties for your execution.
How many nodes do you have? how much RAM? How much disk space used by those 300 million rows? please post the table DDL. How is the data formatted? How big is each row? can you post a row? https://community.hortonworks.com/articles/68631/optimizing-hive-queries-for-orc-formatted-tables.ht...
Thank you for your reply. All of those attributes I tried in testing and was not improved less than 4 hours. After I increased 20 buckets to 30 buckets in source table, 30 minutes reduced in run time. Total run time of the query 3 hour 30 minute. Input data size is not more and i dont want to write multiple files with too less data sets.
# col_name data_type comment
# Detailed Table Information
CreateTime: Tue Mar 27 10:58:41 EDT 2018
Protect Mode: None
Table Type: MANAGED_TABLE
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
Num Buckets: 30
Bucket Columns: [reg_id]
Sort Columns: 
Storage Desc Params:
Time taken: 1.772 seconds, Fetched: 3
Each Container - Resource: Resource:8192 Memory, 1 VCores Application Master: 4096 Memory, 1 VCores
Available Resources: <memory:1851392, vCores:275> Used Resources: <memory:102400, vCores:13>
Source Data Sample:
2234 TOMMY HILFIGER HOME 0.176216039651281134 9171
4222 CHARTER CLUB 0.165046936631880472259 610
2234 AUTHENTIC COMFORT 0.17621603965128113 4901
4222 PHILOSOPHY 0.8252346831594022254 575
2234 WEATHERPROOF VINTAGE 0.1762160396512811317 3671
4 columns with few MB's ORC data table, in more than required resources availability, query runs 3 hour 30 minutes.