Created 09-06-2017 09:51 PM
Currently, we are running a Hive job which inserts around 2 billion rows into an acid table which is partitioned and clustered.
I see a huge number of 'SPILLED_RECORDS' and I'm not exactly sure how to fix/improve. I think more the spilled_records higher the io and processing times. Any inputs are appreciated. Some TEZ stats of the job inline:
org.apache.tez.common.counters.TaskCounter | REDUCE_INPUT_GROUPS | 1653127325 |
org.apache.tez.common.counters.TaskCounter | SPILLED_RECORDS | 11490401485 |
org.apache.tez.common.counters.TaskCounter | PHYSICAL_MEMORY_BYTES | 6732617089024 |
org.apache.tez.common.counters.TaskCounter | VIRTUAL_MEMORY_BYTES | 13973924044800 |
org.apache.tez.common.counters.TaskCounter | COMMITTED_HEAP_BYTES | 6732617089024 |
org.apache.tez.common.counters.TaskCounter | ADDITIONAL_SPILLS_BYTES_WRITTEN | 572880403808 |
org.apache.tez.common.counters.TaskCounter | ADDITIONAL_SPILLS_BYTES_READ | 1540736899809 |
org.apache.tez.common.counters.TaskCounter | ADDITIONAL_SPILL_COUNT | 6965 |
HIVE | RECORDS_IN_Map_1 | 1941777885 |
TaskCounter_Map_1_OUTPUT_Reducer_2 | SPILLED_RECORDS | 3739831692 |
TaskCounter_Reducer_2_INPUT_Map_1 | SPILLED_RECORDS | 1941777885 |
TaskCounter_Reducer_2_OUTPUT_Reducer_3 | SPILLED_RECORDS | 3867014023 |
TaskCounter_Reducer_2_OUTPUT_Reducer_3 | ADDITIONAL_SPILLS_BYTES_READ | 387364743478 |
TaskCounter_Reducer_2_OUTPUT_Reducer_3 | ADDITIONAL_SPILLS_BYTES_WRITTEN | 320256756650 |
TaskCounter_Reducer_3_INPUT_Reducer_2 | ADDITIONAL_SPILLS_BYTES_WRITTEN | 11229906959 |
TaskCounter_reducer_3_INPUT_Reducer_2 | SPILLED_RECORDS | 1941777885 |
Created 09-06-2017 11:07 PM
Acid tables require system determined sort order so you should not specify Sort By. Also, since Acid tables have to be bucketed the system should determine which rows go to which writer based on "Clustered By (...) into N buckets" clause of the DDL so it should not need Distribute By either.
Created 09-07-2017 08:32 PM
@Eugene Koifman That helped reduce the spilled_rows from 11 billion to 5 billion. I was under the impression that inserting data into a partition is faster with a distribute by. This was useful.
Also, I heard compressing the intermediary files helps reduce the spilled_rows. Is that correct?
set mapreduce.map.output.compress = true
set mapreduce.output.fileoutputformat.compress = true
Or anything else we can do to optimize the query?