@Felix thanks for your input. Shuffle data is serialized over the network so when deserialized its spilled to memory ==> From my understanding, operators spill data to disk if it does not fit in memory. So it's not directly related to the shuffle process. Furthermore, I have plenty of jobs with shuffles where no data spills. Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former ==> In the present case the size of the shuffle spill (disk) is null. So I am still unsure of what happened to the "shuffle spilled (memory) data"
... View more
Running jobs with spark 2.2, I noted in the spark webUI that spill occurs for some tasks : I understand that on the reduce side, the reducer fetched the needed partitions (shuffle read), then performed the reduce computation using the execution memory of the executor. As there was not enough execution memory some data was spilled. My questions:
Am I correct ? Where the data is spilled ? Spark webUI states some data is spilled to memory shuffle spilled (memory), but nothing is spilled to disk shuffle spilled (disk) Thanks in advance for your help
... View more