Created 01-09-2016 05:52 PM
I have orc.stripe.size=67108864
What size of stripe would be for small orc file (for example 2 Mb, 350K records)?
I thought that in this case file contains 1 stripe with size 2Mb. However in orcfiledump I see 426 stripes! Why?
Created 01-13-2016 12:25 PM
That is very curious I have seen lots of stripes being created because of memory problems. But normally he only gets down to 5000 rows and then out of memory.
Which version of Hive are you using? What are your memory settings for the hive tasks and if the file is small is it possible that the table is partitioned and the task is writing into a large number of partitions at the same time?
Can you share the LOAD command and the table layout?
Created 01-10-2016 03:12 AM
Created 01-11-2016 05:22 PM
Thanks for the help. But I still don't understand((
My orc file has all default settings including orc.stripe.size=67108864, orc.compress=ZLIB
HDFS block = 128Mb
Why orc file with size 2Mb has 426 stripes?
Created 01-11-2016 08:29 PM
Can you please provide the output begining output of 'hive --orcfiledump $pathtoOrcFile'
Created 01-13-2016 08:58 AM
@Joseph Niemiec Here is output of orcfiledump
Structure for /apps/hive/warehouse/xxxx/000006_0 File Version: 0.12 with HIVE_8732 Rows: 357239 Compression: ZLIB Compression size: 262144 Type: struct<_col0:string,_col1:string,_col2:string> Stripe Statistics: Stripe 1: Column 0: count: 2176 Column 1: count: 2176 min: 005056BF6CA51ED5A0FA3CF37E9803E0 max: 3640B59A1A7F1ED3AFCA4A98D9929B1F sum: 69632 Column 2: count: 2176 min: 0000001287 max: 0800133136 sum: 21760 Column 3: count: 2176 min: ., A013 max: TG_TRIG_APPLE_OCT2015_W1 sum: 15320 Stripe 2: Column 0: count: 687 Column 1: count: 687 min: 3640B59A1A7F1ED3AFCA42E0BAE75B19 max: 3640B59A1A7F1ED3AFCA4A98D9929B1F sum: 21984 Column 2: count: 687 min: 0800134310 max: 0800303673 sum: 6870 Column 3: count: 687 min: ., A013 max: ., A691 sum: 4809 Stripe 3: Column 0: count: 288 Column 1: count: 288 min: 3640B59A1A7F1ED3AFCA42E0BAE75B19 max: 3640B59A1A7F1ED3AFCA4A98D9929B1F sum: 9216 Column 2: count: 288 min: 0800304109 max: 0800372594 sum: 2880 Column 3: count: 288 min: ., A013 max: ., A691 sum: 2016 ........ ........ ........ Stripe 426: Column 0: count: 46106 Column 1: count: 46106 min: 005056BF6CA51ED5A0FA3CF37E97C3E0 max: 3640B59A1A7F1ED3AFCA4A98D9929B1F sum: 1475392 Column 2: count: 46106 min: 0839185621 max: 0839464656 sum: 461060 Column 3: count: 46106 min: ., A013 max: TG_TRIG_APPLE_OCT2015_W1 sum: 323199 File Statistics: Column 0: count: 357239 Column 1: count: 357239 min: 005056BF6CA51ED5A0FA3CF37E97C3E0 max: 3640B59A1A7F1ED3AFCA4A98D9929B1F sum: 11431648 Column 2: count: 357239 min: 0000001287 max: 0839464656 sum: 3572390 Column 3: count: 357239 min: ., A013 max: TG_TRIG_APPLE_OCT2015_W2 sum: 2502199 Stripes: Stripe: offset: 3 data: 15609 rows: 2176 tail: 105 index: 196 Stream: column 0 section ROW_INDEX start: 3 length 10 Stream: column 1 section ROW_INDEX start: 13 length 87 Stream: column 2 section ROW_INDEX start: 100 length 43 Stream: column 3 section ROW_INDEX start: 143 length 56 Stream: column 1 section DATA start: 199 length 3075 Stream: column 1 section LENGTH start: 3274 length 7 Stream: column 1 section DICTIONARY_DATA start: 3281 length 1381 Stream: column 2 section DATA start: 4662 length 7276 Stream: column 2 section LENGTH start: 11938 length 26 Stream: column 3 section DATA start: 11964 length 3063 Stream: column 3 section LENGTH start: 15027 length 11 Stream: column 3 section DICTIONARY_DATA start: 15038 length 770 Encoding column 0: DIRECT Encoding column 1: DICTIONARY_V2[351] Encoding column 2: DIRECT_V2 Encoding column 3: DICTIONARY_V2[351] ....... ....... ....... Stripe: offset: 2486960 data: 253432 rows: 46106 tail: 113 index: 427 Stream: column 0 section ROW_INDEX start: 2486960 length 17 Stream: column 1 section ROW_INDEX start: 2486977 length 141 Stream: column 2 section ROW_INDEX start: 2487118 length 159 Stream: column 3 section ROW_INDEX start: 2487277 length 110 Stream: column 1 section DATA start: 2487387 length 59834 Stream: column 1 section LENGTH start: 2547221 length 7 Stream: column 1 section DICTIONARY_DATA start: 2547228 length 1374 Stream: column 2 section DATA start: 2548602 length 131452 Stream: column 2 section LENGTH start: 2680054 length 131 Stream: column 3 section DATA start: 2680185 length 59841 Stream: column 3 section LENGTH start: 2740026 length 11 Stream: column 3 section DICTIONARY_DATA start: 2740037 length 782 Encoding column 0: DIRECT Encoding column 1: DICTIONARY_V2[355] Encoding column 2: DIRECT_V2 Encoding column 3: DICTIONARY_V2[355] File length: 2761506 bytes Padding length: 0 bytes Padding ratio: 0%
Created 01-13-2016 12:25 PM
That is very curious I have seen lots of stripes being created because of memory problems. But normally he only gets down to 5000 rows and then out of memory.
Which version of Hive are you using? What are your memory settings for the hive tasks and if the file is small is it possible that the table is partitioned and the task is writing into a large number of partitions at the same time?
Can you share the LOAD command and the table layout?
Created 01-14-2016 06:15 AM
Hive 0.14
hive.execution.engine=tez tez.am.resource.memory.mb=5120 tez.task.resource.memory.mb=5120 mapreduce.map.memory.mb=5120 mapreduce.reduce.memory.mb=8192 mapreduce.map.java.opts=-Xmx4096mm apreduce.reduce.java.opts=-Xmx4096m yarn.scheduler.minimum-allocation-mb=5120 hive.merge.orcfile.stripe.level=true hive.merge.smallfiles.avgsize=16000000 hive.merge.size.per.task=256000000 upd hive.tez.container.size=1536 hive.tez.java.opts=-Xmx1229m tez.runtime.io.sort.mb=2047 tez.runtime.unordered.output.buffer.size-mb=384
The table is dailly partitioned. Folder with parition contains one file. Average size of orc file is 30 Mb. Some files contain 400 stipes, some 15 stripes.
SAP Data Services loads data to my table by hiveql file:
CREATE TABLE bods.mytable_sapds_4b87 ( tggrp_id string, bpartner string, tggrp_txt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1'; LOAD DATA LOCAL INPATH '/hdfs/gate/xxx/hiveLoad*.dat' INTO TABLE bods.mytable_sapds_4b87; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true; USE bods; FROM bods.mytable_sapds_4b87 INSERT INTO TABLE bods.my_orc_table PARTITION(calday) SELECT tggrp_id, bpartner, tggrp_txt, calday; DROP TABLE bods.mytable_sapds_4b87;
create table my_orc_table (tggrp_id string, bpartner string, tggrp_txt string ) PARTITIONED BY (CALDAY string) STORED AS ORC;
Created 01-14-2016 06:23 PM
Apart from apreduce.reduce.java.opts=-Xmx4096m missing an m which I don't think will be the problem;
How many days are you loading? You essentially do a dynamic partitioning so the task needs to keep memory for every day you load into. If you have a lot of days this might be the reason:
Possible solutions:
a) Try to load one day and see if that makes it better.
b) use dynamic sorted partitioning, ( slide 16) this theoretically should fix the problem if this is the reason
c) use manual distribution ( slide 19 )
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Created 01-19-2016 05:50 AM
Thanks a lot for the slides! It's very useful.
I load data one day, this daily process.
What I tried:
1) Increased tez container size and orc.memory.pool as on slide 12
2) Increased mapreduce.reduce.java.opts=-Xmx6554m (instead of -Xmx4096m)
3) set hive.optimize.sort.dynamic.partition=true
4) used global sorting (order by)
All attempts except 4 yielded nothing. I got the desired results only with 4) - order by - 1 stripe for small files and strong reduction in the number of stripes for large files. Also the size of large files significantly reduced and speed queries increased. I am happy with the results 🙂
Thanks!
Created 01-25-2016 10:00 AM
Hmmmm weird, the order shouldn't really make a difference. I assume he added a reducer doing that. Only explanation I have. Adding a distribute by would most likely also have helped. But sort is good for predicate pushdown and so as long as all is good ... 🙂