Support Questions

Selvaraj1982 · ‎03-30-2018

I have a HIVE ORC table with 30 millions records flat table with reg_id(bigint), product_family (string), product_score(double), product_sales(int) , and wanted to do dense rank for top 5 products associated with reg_id. each reg_id may have more product lines minimum 5 to maximum 1000's. So my old query was the below

SELECT * from (SELECT *,DENSE_RANK() OVER ( PARTITION BY reg_id ORDER BY product_score DESC, product_sales DESC) AS rank FROM SampleTable) AS st WHERE rank<=5;

Running time was 10+ hours with reasonable container resources. Not completed it. Reason being that, one mapper was reading the dataset. Tried in MR and Tez with optimized key parameters, still it was running forever. (I have run rank(), dense_rank(), row_number() functions in 150 millions record sets based on timestamp and other columns in Cloudera Hive MR Engine with Parquet file format, it was 10 to 15 minutes running time based on Selected and Ordered by columns.)

New Approach:

Then I chose reg_id as bucketed column to create a intermediate table. So that i can split the datasets to create multiple files and somewhat i can read in parallel for dense rank query.

CREATE TABLE SampleTempTable (
         reg_id bigint,
         product_family string,
         product_score double,
         product_sales int)
CLUSTERED BY (reg_id) INTO 20 BUCKETS;<br>

With this intermediate table option, i can manage to read parallel, MR( atleast 5 mappers) engine or Tez (atleast 16 containers launched for 16 Mappers ). Managed to bring the run time to Max 4 hours. The total data set size is 100 MB only. Still I am not convinced by this run time. It supposed to be 15mins to 30mins.

I tried the below optimization parameters in MR/Tez:

SET hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;

set hive.execution.engine=tez;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set mapreduce.input.fileinputformat.split.maxsize=128000000;
set hive.optimize.sort.dynamic.partition = false;

Note: reg_id has minimum 5 product lines to 1000 lines, so user_id is not evenly distributed, but the size of the dataset is evenly distributed. I need to rank products per reg_id. IO operation is taking more time.

Any suggestion to bring down the running time 1 hour is appreciated.

RahulSoni · ‎04-01-2018

@Selvaraju Sellamuthu

Try using the following properties to control the mapper count for your job.

set tez.grouping.min-size=16777216;	--16 MB min splitset 
tez.grouping.max-size=1073741824;	--1 GB max split

These parameters will control the number of mappers for splittable formats with Tez. Please update your results after using these properties for your execution.

Selvaraj1982 · ‎04-02-2018

@Rahul Soni

Thanks for your input. Added those two properties and rerun my dense rank query. Same time it took(~ 4 hours). No improvements.

TimothySpann · ‎04-01-2018

How many nodes do you have? how much RAM? How much disk space used by those 300 million rows? please post the table DDL. How is the data formatted? How big is each row? can you post a row? https://community.hortonworks.com/articles/68631/optimizing-hive-queries-for-orc-formatted-tables.ht...

Selvaraj1982 · ‎04-02-2018

@Timothy Spann

Thank you for your reply. All of those attributes I tried in testing and was not improved less than 4 hours. After I increased 20 buckets to 30 buckets in source table, 30 minutes reduced in run time. Total run time of the query 3 hour 30 minute. Input data size is not more and i dont want to write multiple files with too less data sets.

DDL:

# col_name              data_type              comment             
       
reg_id                 bigint                                     
product_family                  string                                     
product_score        double                                     
product_sales         bigint                                     
       
# Detailed Table Information       
Database:              default                 
Owner:                                 
CreateTime:            Tue Mar 27 10:58:41 EDT 2018    
LastAccessTime:        UNKNOWN                 
Protect Mode:          None                    
Retention:             0                       
Location:              hdfs://xxx.xx.xxxx  
Table Type:            MANAGED_TABLE           
Table Parameters:      
   COLUMN_STATS_ACCURATE  true                
   numFiles               30                  
   numRows                32788685            
   rawDataSize            3901853515          
   totalSize              204127850           
   transient_lastDdlTime  1522162963          
       
# Storage Information      
SerDe Library:         org.apache.hadoop.hive.ql.io.orc.OrcSerde   
InputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcInputFormat     
OutputFormat:          org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat    
Compressed:            No                      
Num Buckets:           30                      
Bucket Columns:        [reg_id]                
Sort Columns:          []                      
Storage Desc Params:       
   serialization.format   1                   
Time taken: 1.772 seconds, Fetched: 3

Each Container - Resource: Resource:8192 Memory, 1 VCores Application Master: 4096 Memory, 1 VCores

Available Resources: <memory:1851392, vCores:275> Used Resources: <memory:102400, vCores:13>

Source Data Sample:

2234 TOMMY HILFIGER HOME 0.176216039651281134 9171
4222 CHARTER CLUB 0.165046936631880472259 610
2234 AUTHENTIC COMFORT 0.17621603965128113 4901
4222 PHILOSOPHY 0.8252346831594022254 575
2234 WEATHERPROOF VINTAGE 0.1762160396512811317 3671

4 columns with few MB's ORC data table, in more than required resources availability, query runs 3 hour 30 minutes.

Cloudera Community

Support Questions

Hive Dense_Rank Query to read 30 million records run very slow