I upgrade my cloudera to 5.7.0 ( have spark 1.6.0).
As per spark 1.6.0 configuration document "spark.memory.useLegacyMode " control the memory behavior and default value is false.
but I have seen memory behavior of spark 1.6.0 is same as spark 1.5.0 . when i explicitly set "spark.memory.useLegacyMode " to false then its look like spark 1.6.0 memory management works .
Is default value of "spark.memory.useLegacyMode " in CHD 5.7.0 is true ??
You can look in the Environment tab to see all the settings and confirm whether they're what you want. No I do not see that legacyMode is true by default. There is something else at work in your configuration. Look at the actual runtime values first to see where it doesn't match expectation.
Thanks for the information.
I have checked my enviroment tab and also fetch configuration in application at run time : but not able find out any information related to spark.memory.useLegacyMode . [ if not set "spark.memory.useLegacyMode " to false in application ]
below are the configuration i have found on my enviroment tab and fetch configuration in application :
if i run my application with --executor-memory 2g and not set any "spark.memory.useLegacyMode " to false then executor show below information :
INFO storage.MemoryStore: MemoryStore started with capacity 1060.3 MB
and if i set "spark.memory.useLegacyMode " to false then executor show below information :
NFO storage.MemoryStore: MemoryStore started with capacity 1247.6 MB
Hm, I'm not aware that the default is different in CDH, and don't see it set in spark-defaults.conf. Is it perhaps set elsewhere in your config? Maybe I'm missing it, and it does somehow default to true for backwards-compatilibility within CDH minor releases (it's actually a behavior change upstream).
In any event, I would simply explicitly set this parameter to the value you want, if you do care about its value.
Hope, below information might help in getting clarity for the issue .
"SPARK-10000 - Spark 1.6.0 includes a new unified memory manager. The new memory manager is turned off by default (unlike Apache Spark 1.6.0), to make it easier for users to migrate existing workloads, but it is supported."
Ah thank you even I missed that. Yes, that explains it then. It is indeed for backwards compatibility.
Thanks a lot for your support to clarify my doubt .
I am facing one more issue related to spark configuration .
In CDH 5.7.0 [ Spark 1.6 ] i need to explicitly set "spark.shuffle.reduceLocality.enabled= false" to distributed RDD Partitions evenly to executors .
below are the complete scenerio to explain my problem :
My Spark Streaming application receiving data from one kafka topic ( one partition) and rdd have 30 partition.
but scheduler schedule the task between executors running on same host with NODE_LOCAL locality level. ( where kafka topic partition created) .
Below are the logs :
16/05/06 11:21:38 INFO YarnScheduler: Adding task set 1.0 with 30 tasks
16/05/06 11:21:38 DEBUG TaskSetManager: Epoch for TaskSet 1.0: 1
16/05/06 11:21:38 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY
16/05/06 11:21:38 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m04.novalocal, partition 0,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m04.novalocal, partition 1,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m04.novalocal, partition 2,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,NODE_LOCAL, 2248 bytes)
I have seen this scenario after upgrading my CDH 5.6.0 [Spark 1.5] to 5.7.0 [Spark 1.6] . same application distributed rdd partition evenly to executors in spark 1.5 .
As mentioned on some spark developer blogs , I have tried spark.shuffle.reduceLocality.enabled=false ( Not mentioned in spark documentation ) and after that my rdd partition is distributed between executors of all host with PROCESS_LOCAL locality level.
Below are the logs :
16/05/06 11:24:46 INFO YarnScheduler: Adding task set 1.0 with 30 tasks
16/05/06 11:24:46 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NO_PREF, ANY
16/05/06 11:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m02.novalocal, partition 0,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m01.novalocal, partition 1,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m06.novalocal, partition 2,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,PROCESS_LOCAL, 2248 bytes)
Is above configuration is correct solution for problem ?
Thank you for sharing your information. Took me a while to figure out why I'm getting the legacy behaviour even though I'm on Spark 1.6.0 which is supposed to have unified memory management (CDH 5.10.1).
Now my question is, does setting spark.memory.useLegacyMode to false have any unintended consequences?! Or Is it safe to simply set it to false?! And when does Cloudera plan to make it default?!