I've a hive querying running on Spark that never completes when aggregating more than X records on a Key column in a table stored as Parquet.
I tried wth few datasets :
1 317 474 rows -> 1400 seconds (over() Partition by key #10000 )
2 627 466 rows -> 1460seconds (over() Partition by key #20000 )
14 548 806 rows -> never ends up (over() Partition by key #30000 )
i.e: SELECT SUM(col1) OVER (PARTITION BY key ORDER BY num_col2) FROM table1; Executors logs don't log anything after about 1 hour and the yarn application keeps running forever.
How to be sure that the Query still performs well ?