Hi, I have a SQL:
SELECT *
FROM
(
SELECT a, b, ROW_NUMBER() OVER (PARTITON BY x, y ORDER BY create_time DESC) as rn
FROM huge_table h
LEFT JOIN small_table s ON h.c = s.id
WHERE s.dt='2020-02-02'
)
WHERE rn=1
- huge_table has 8 billion of rows, small_table has 1.5 million of rows after the dt filtering
But from the Tez Counters I see frequently:
- RECORDS_OUT_INTERMEDIATE_Map_1 and RS_22 RECORDS_OUT_OPERATOR_RS_27 goes to 300+ billion
Why is this happening?
I also see ADDITIONAL_SPILLS_BYTES_WRITTEN to be 500422346859 (~400GB), considering the total ORC files in huge_table is just ~500GB, is this weird? There are 4345 files and 5472 mappers, why does it require so much additional spills?
Thanks!