- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to reduce 'SPILLED_RECORDS' in Hive with TEZ ? Running a Insert as select (distribute by and sort by in select statement) into Partitioned_Clustered_Acid_table.
- Labels:
-
Apache Hive
-
Apache Tez
Created ‎09-06-2017 09:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Currently, we are running a Hive job which inserts around 2 billion rows into an acid table which is partitioned and clustered.
I see a huge number of 'SPILLED_RECORDS' and I'm not exactly sure how to fix/improve. I think more the spilled_records higher the io and processing times. Any inputs are appreciated. Some TEZ stats of the job inline:
org.apache.tez.common.counters.TaskCounter | REDUCE_INPUT_GROUPS | 1653127325 |
org.apache.tez.common.counters.TaskCounter | SPILLED_RECORDS | 11490401485 |
org.apache.tez.common.counters.TaskCounter | PHYSICAL_MEMORY_BYTES | 6732617089024 |
org.apache.tez.common.counters.TaskCounter | VIRTUAL_MEMORY_BYTES | 13973924044800 |
org.apache.tez.common.counters.TaskCounter | COMMITTED_HEAP_BYTES | 6732617089024 |
org.apache.tez.common.counters.TaskCounter | ADDITIONAL_SPILLS_BYTES_WRITTEN | 572880403808 |
org.apache.tez.common.counters.TaskCounter | ADDITIONAL_SPILLS_BYTES_READ | 1540736899809 |
org.apache.tez.common.counters.TaskCounter | ADDITIONAL_SPILL_COUNT | 6965 |
HIVE | RECORDS_IN_Map_1 | 1941777885 |
TaskCounter_Map_1_OUTPUT_Reducer_2 | SPILLED_RECORDS | 3739831692 |
TaskCounter_Reducer_2_INPUT_Map_1 | SPILLED_RECORDS | 1941777885 |
TaskCounter_Reducer_2_OUTPUT_Reducer_3 | SPILLED_RECORDS | 3867014023 |
TaskCounter_Reducer_2_OUTPUT_Reducer_3 | ADDITIONAL_SPILLS_BYTES_READ | 387364743478 |
TaskCounter_Reducer_2_OUTPUT_Reducer_3 | ADDITIONAL_SPILLS_BYTES_WRITTEN | 320256756650 |
TaskCounter_Reducer_3_INPUT_Reducer_2 | ADDITIONAL_SPILLS_BYTES_WRITTEN | 11229906959 |
TaskCounter_reducer_3_INPUT_Reducer_2 | SPILLED_RECORDS | 1941777885 |
Created ‎09-06-2017 11:07 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Acid tables require system determined sort order so you should not specify Sort By. Also, since Acid tables have to be bucketed the system should determine which rows go to which writer based on "Clustered By (...) into N buckets" clause of the DDL so it should not need Distribute By either.
Created ‎09-07-2017 08:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Eugene Koifman That helped reduce the spilled_rows from 11 billion to 5 billion. I was under the impression that inserting data into a partition is faster with a distribute by. This was useful.
Also, I heard compressing the intermediary files helps reduce the spilled_rows. Is that correct?
set mapreduce.map.output.compress = true
set mapreduce.output.fileoutputformat.compress = true
Or anything else we can do to optimize the query?
