Member since
10-05-2016
4
Posts
0
Kudos Received
0
Solutions
02-23-2019
06:00 PM
for example, in one of my DAG, all that those task do is Sort WithinPartition (so no shuffle) still it spills data on disk because partition size is huge and spark resort to ExternalMergeSort. As a result, I have a high Shuffle Spill (memor) and also some Shuffle Spill(Disk). There is no shuffle here. but on the other hand you can argue that Sorting process moves data in order to sort so it's kind of internal shuffle 🙂
... View more
02-23-2019
05:57 PM
I agree with 1. It shouldn't call just Shuffle Spill. Title should be more generic then that. for 2, I think it's tasks' Max deserialized data in memory that it used until one point or ever if task is finished. It could be GCd from that executor.
... View more
02-02-2018
08:11 PM
with hiveserver2 you can also submit job on spark if spark is configured as execution engine in hive, right?
... View more
10-05-2016
08:51 PM
If you think real time stream processing can be done without streaming framework like kafka in middle why all your above use case are based on kafka 🙂 Kafka is not just persistent store its a highly scalable messaging queue that can feed data from multiple datasources to your target framework(spark streams, storm ) etc. I wonder how would you directly feed RDBMS data in real time to storm or spark without any middleware messaging system like kafka.
... View more