About niravcp

niravcp · ‎02-23-2019

for example, in one of my DAG, all that those task do is Sort WithinPartition (so no shuffle) still it spills data on disk because partition size is huge and spark resort to ExternalMergeSort. As a result, I have a high Shuffle Spill (memor) and also some Shuffle Spill(Disk). There is no shuffle here. but on the other hand you can argue that Sorting process moves data in order to sort so it's kind of internal shuffle 🙂

niravcp · ‎02-23-2019

I agree with 1. It shouldn't call just Shuffle Spill. Title should be more generic then that. for 2, I think it's tasks' Max deserialized data in memory that it used until one point or ever if task is finished. It could be GCd from that executor.

niravcp · ‎02-02-2018

with hiveserver2 you can also submit job on spark if spark is configured as execution engine in hive, right?

niravcp · ‎10-05-2016

If you think real time stream processing can be done without streaming framework like kafka in middle why all your above use case are based on kafka 🙂 Kafka is not just persistent store its a highly scalable messaging queue that can feed data from multiple datasources to your target framework(spark streams, storm ) etc. I wonder how would you directly feed RDBMS data in real time to storm or spark without any middleware messaging system like kafka.

Online	Offline
Last Visited	‎02-23-2019 05:57 PM

Member Since	‎10-05-2016 08:45 PM
Last Visited	‎02-23-2019 05:57 PM
Posts	4

Cloudera Community

Re: Spark shuffle spill (Memory)

Re: Spark shuffle spill (Memory)

Re: Why do we need to setup Spark Thrift Server?

Re: How to integrate kafka to pull data from RDBMS