Member since
01-24-2014
5
Posts
0
Kudos Received
0
Solutions
04-02-2018
08:48 AM
@Timothy Spann Thank you for your reply. All of those attributes I tried in testing and was not improved less than 4 hours. After I increased 20 buckets to 30 buckets in source table, 30 minutes reduced in run time. Total run time of the query 3 hour 30 minute. Input data size is not more and i dont want to write multiple files with too less data sets. DDL: # col_name data_type comment reg_id bigint product_family string product_score double product_sales bigint # Detailed Table Information Database: default Owner: CreateTime: Tue Mar 27 10:58:41 EDT 2018 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://xxx.xx.xxxx Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 30 numRows 32788685 rawDataSize 3901853515 totalSize 204127850 transient_lastDdlTime 1522162963 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: 30 Bucket Columns: [reg_id] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 1.772 seconds, Fetched: 3 Each Container - Resource:
Resource:8192 Memory, 1 VCores Application Master: 4096 Memory, 1 VCores Available Resources: <memory:1851392, vCores:275>
Used Resources: <memory:102400, vCores:13> Source Data Sample: 2234 TOMMY HILFIGER HOME 0.176216039651281134 9171 4222 CHARTER CLUB 0.165046936631880472259 610 2234 AUTHENTIC COMFORT 0.17621603965128113 4901 4222 PHILOSOPHY 0.8252346831594022254 575 2234 WEATHERPROOF VINTAGE 0.1762160396512811317 3671 4 columns with few MB's ORC data table, in more than required resources availability, query runs 3 hour 30 minutes.
... View more
04-02-2018
08:22 AM
@Rahul Soni Thanks for your input. Added those two properties and rerun my dense rank query. Same time it took(~ 4 hours). No improvements.
... View more
03-30-2018
12:21 PM
I have a HIVE ORC table with 30 millions records flat table with reg_id(bigint), product_family (string), product_score(double), product_sales(int) , and wanted to do dense rank for top 5 products associated with reg_id. each reg_id may have more product lines minimum 5 to maximum 1000's. So my old query was the below SELECT * from (SELECT *,DENSE_RANK() OVER ( PARTITION BY reg_id ORDER BY product_score DESC, product_sales DESC) AS rank FROM SampleTable) AS st WHERE rank<=5;
Running time was 10+ hours with reasonable container resources. Not completed it. Reason being that, one mapper was reading the dataset. Tried in MR and Tez with optimized key parameters, still it was running forever. (I have run rank(), dense_rank(), row_number() functions in 150 millions record sets based on timestamp and other columns in Cloudera Hive MR Engine with Parquet file format, it was 10 to 15 minutes running time based on Selected and Ordered by columns.) New Approach: Then I chose reg_id as bucketed column to create a intermediate table. So that i can split the datasets to create multiple files and somewhat i can read in parallel for dense rank query. CREATE TABLE SampleTempTable (
reg_id bigint,
product_family string,
product_score double,
product_sales int)
CLUSTERED BY (reg_id) INTO 20 BUCKETS;<br> With this intermediate table option, i can manage to read parallel, MR( atleast 5 mappers) engine or Tez (atleast 16 containers launched for 16 Mappers ). Managed to bring the run time to Max 4 hours. The total data set size is 100 MB only. Still I am not convinced by this run time. It supposed to be 15mins to 30mins. I tried the below optimization parameters in MR/Tez: SET hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.execution.engine=tez;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set mapreduce.input.fileinputformat.split.maxsize=128000000;
set hive.optimize.sort.dynamic.partition = false;
Note: reg_id has minimum 5 product lines to 1000 lines, so user_id is not evenly distributed, but the size of the dataset is evenly distributed. I need to rank products per reg_id. IO operation is taking more time. Any suggestion to bring down the running time 1 hour is appreciated.
... View more
Labels:
01-24-2014
06:25 AM
Thank you for your prompt reply. Just to confirm again, If I take the exam as per my scheduled date on 30th Jan. It will be based on new exam format with 4 sections.
... View more
01-24-2014
03:57 AM
I have scheduled to take CCDH 410 certification exam on Jan 30th 2014. I am preparing based on the old study guide for the last 3 months. Recently new study guide is updated here. Seems exam patterns also going to change according to new study guide. Could you please clarify, whether the exam pattern is updated as per new study guide?
... View more
Labels:
- Labels:
-
Certification