Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to reduce 'SPILLED_RECORDS' in Hive with TEZ ? Running a Insert as select (distribute by and sort by in select statement) into Partitioned_Clustered_Acid_table.

avatar
Contributor

Currently, we are running a Hive job which inserts around 2 billion rows into an acid table which is partitioned and clustered.

I see a huge number of 'SPILLED_RECORDS' and I'm not exactly sure how to fix/improve. I think more the spilled_records higher the io and processing times. Any inputs are appreciated. Some TEZ stats of the job inline:

org.apache.tez.common.counters.TaskCounterREDUCE_INPUT_GROUPS1653127325
org.apache.tez.common.counters.TaskCounterSPILLED_RECORDS11490401485
org.apache.tez.common.counters.TaskCounterPHYSICAL_MEMORY_BYTES6732617089024
org.apache.tez.common.counters.TaskCounterVIRTUAL_MEMORY_BYTES13973924044800
org.apache.tez.common.counters.TaskCounterCOMMITTED_HEAP_BYTES6732617089024
org.apache.tez.common.counters.TaskCounterADDITIONAL_SPILLS_BYTES_WRITTEN572880403808
org.apache.tez.common.counters.TaskCounterADDITIONAL_SPILLS_BYTES_READ1540736899809
org.apache.tez.common.counters.TaskCounterADDITIONAL_SPILL_COUNT6965
HIVERECORDS_IN_Map_11941777885
TaskCounter_Map_1_OUTPUT_Reducer_2SPILLED_RECORDS3739831692
TaskCounter_Reducer_2_INPUT_Map_1SPILLED_RECORDS1941777885
TaskCounter_Reducer_2_OUTPUT_Reducer_3SPILLED_RECORDS3867014023
TaskCounter_Reducer_2_OUTPUT_Reducer_3ADDITIONAL_SPILLS_BYTES_READ387364743478

TaskCounter_Reducer_2_OUTPUT_Reducer_3

ADDITIONAL_SPILLS_BYTES_WRITTEN320256756650
TaskCounter_Reducer_3_INPUT_Reducer_2ADDITIONAL_SPILLS_BYTES_WRITTEN11229906959
TaskCounter_reducer_3_INPUT_Reducer_2SPILLED_RECORDS1941777885

2 REPLIES 2

avatar
Super Collaborator

Acid tables require system determined sort order so you should not specify Sort By. Also, since Acid tables have to be bucketed the system should determine which rows go to which writer based on "Clustered By (...) into N buckets" clause of the DDL so it should not need Distribute By either.

avatar
Contributor

@Eugene Koifman That helped reduce the spilled_rows from 11 billion to 5 billion. I was under the impression that inserting data into a partition is faster with a distribute by. This was useful.

Also, I heard compressing the intermediary files helps reduce the spilled_rows. Is that correct?

set mapreduce.map.output.compress = true

set mapreduce.output.fileoutputformat.compress = true

Or anything else we can do to optimize the query?