Support Questions

gauravmallik12 · ‎09-06-2017

Currently, we are running a Hive job which inserts around 2 billion rows into an acid table which is partitioned and clustered.

I see a huge number of 'SPILLED_RECORDS' and I'm not exactly sure how to fix/improve. I think more the spilled_records higher the io and processing times. Any inputs are appreciated. Some TEZ stats of the job inline:

org.apache.tez.common.counters.TaskCounter	REDUCE_INPUT_GROUPS	1653127325
org.apache.tez.common.counters.TaskCounter	SPILLED_RECORDS	11490401485
org.apache.tez.common.counters.TaskCounter	PHYSICAL_MEMORY_BYTES	6732617089024
org.apache.tez.common.counters.TaskCounter	VIRTUAL_MEMORY_BYTES	13973924044800
org.apache.tez.common.counters.TaskCounter	COMMITTED_HEAP_BYTES	6732617089024
org.apache.tez.common.counters.TaskCounter	ADDITIONAL_SPILLS_BYTES_WRITTEN	572880403808
org.apache.tez.common.counters.TaskCounter	ADDITIONAL_SPILLS_BYTES_READ	1540736899809
org.apache.tez.common.counters.TaskCounter	ADDITIONAL_SPILL_COUNT	6965
HIVE	RECORDS_IN_Map_1	1941777885
TaskCounter_Map_1_OUTPUT_Reducer_2	SPILLED_RECORDS	3739831692
TaskCounter_Reducer_2_INPUT_Map_1	SPILLED_RECORDS	1941777885
TaskCounter_Reducer_2_OUTPUT_Reducer_3	SPILLED_RECORDS	3867014023
TaskCounter_Reducer_2_OUTPUT_Reducer_3	ADDITIONAL_SPILLS_BYTES_READ	387364743478
TaskCounter_Reducer_2_OUTPUT_Reducer_3	ADDITIONAL_SPILLS_BYTES_WRITTEN	320256756650
TaskCounter_Reducer_3_INPUT_Reducer_2	ADDITIONAL_SPILLS_BYTES_WRITTEN	11229906959
TaskCounter_reducer_3_INPUT_Reducer_2	SPILLED_RECORDS	1941777885

ekoifman · ‎09-06-2017

Acid tables require system determined sort order so you should not specify Sort By. Also, since Acid tables have to be bucketed the system should determine which rows go to which writer based on "Clustered By (...) into N buckets" clause of the DDL so it should not need Distribute By either.

gauravmallik12 · ‎09-07-2017

@Eugene Koifman That helped reduce the spilled_rows from 11 billion to 5 billion. I was under the impression that inserting data into a partition is faster with a distribute by. This was useful.

Also, I heard compressing the intermediary files helps reduce the spilled_rows. Is that correct?

set mapreduce.map.output.compress = true

set mapreduce.output.fileoutputformat.compress = true

Or anything else we can do to optimize the query?

Cloudera Community

Support Questions

How to reduce 'SPILLED_RECORDS' in Hive with TEZ ? Running a Insert as select (distribute by and sort by in select statement) into Partitioned_Clustered_Acid_table.