Reply
Contributor
Posts: 43
Registered: ‎03-04-2015

spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to false

Hi

 

I upgrade my cloudera to 5.7.0 ( have spark 1.6.0).

 As per spark 1.6.0  configuration document "spark.memory.useLegacyMode "  control the memory behavior and default value is false.

 

but I have seen  memory behavior of spark 1.6.0 is same as  spark 1.5.0 . when i explicitly set  "spark.memory.useLegacyMode " to false  then its look like spark 1.6.0 memory management works .

 

Is default value of "spark.memory.useLegacyMode " in CHD 5.7.0 is true ??

 

Regards

Prateek

 

 

Cloudera Employee
Posts: 464
Registered: ‎08-11-2014

Re: spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to

You can look in the Environment tab to see all the settings and confirm whether they're what you want. No I do not see that legacyMode is true by default. There is something else at work in your configuration. Look at the actual runtime values first to see where it doesn't match expectation.

Contributor
Posts: 43
Registered: ‎03-04-2015

Re: spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to false

hi 

Thanks for the information.

 

I have checked my enviroment tab and also fetch configuration in application at run time : but not able find out any information related to spark.memory.useLegacyMode . [ if not set  "spark.memory.useLegacyMode " to false in application ]

 

below are the configuration i have found on my enviroment tab and fetch configuration in application : 

 

spark.driver.appUIAddress:http://192.168.44.101:4040
spark.dynamicAllocation.executorIdleTimeout:60
spark.serializer:org.apache.spark.serializer.KryoSerializer
spark.app.name:test_kafka_stream
spark.authenticate:false
spark.yarn.jar:local:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/lib/spark-assembly.jar
spark.executor.memory:1g
spark.externalBlockStore.folderName:spark-8ccc2730-664b-4d35-bb80-1234ee86cb47
spark.submit.deployMode:client
spark.driver.port:44600
spark.executor.extraJavaOptions:-verbose:gc
spark.ui.filters:org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.yarn.am.extraLibraryPath:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native:/opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/native
spark.shuffle.service.enabled:true
spark.executor.instances:42
spark.master:yarn-client
spark.app.id:application_1462489392454_0043
spark.executor.id:driver
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES:http://ivcp-m01.novalocal:8088/proxy/application_1462489392454_0043
spark.dynamicAllocation.schedulerBacklogTimeout:1
spark.shuffle.service.port:7337
spark.executor.extraLibraryPath:/home/nativelibraries/native_lib/
spark.yarn.historyServer.address:http://ivcp-m01.novalocal:18088
spark.yarn.config.gatewayPath:/opt/cloudera/parcels
spark.driver.extraLibraryPath:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native:/opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/native
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS:ivcp-m01.novalocal
spark.driver.host:192.168.44.101
spark.yarn.config.replacementPath:{{HADOOP_COMMON_HOME}}/../../..
spark.dynamicAllocation.minExecutors:0
spark.eventLog.dir:hdfs://ivcp-m01.novalocal:8020/user/spark/applicationHistory
spark.jars:file:/home/ubuntu/prateek/merge_image_kafka_stream/test_merge_image_stream-assembly-3.0.jar
spark.dynamicAllocation.enabled:true
spark.driver.extraJavaOptions:-Dlog4j.configuration=file:/home/ubuntu/prateek/merge_image_kafka_stream/mylog4j.properties

 

 

if i run my application with --executor-memory 2g  and not set any "spark.memory.useLegacyMode " to false then  executor show below information :

 

INFO storage.MemoryStore: MemoryStore started with capacity 1060.3 MB

 

and if i set "spark.memory.useLegacyMode " to false  then executor show below information :

 

NFO storage.MemoryStore: MemoryStore started with capacity 1247.6 MB

 

Regards

Prateek

Cloudera Employee
Posts: 464
Registered: ‎08-11-2014

Re: spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to

Hm, I'm not aware that the default is different in CDH, and don't see it set in spark-defaults.conf. Is it perhaps set elsewhere in your config? Maybe I'm missing it, and it does somehow default to true for backwards-compatilibility within CDH minor releases (it's actually a behavior change upstream).

 

In any event, I would simply explicitly set this parameter to the value you want, if you do care about its value.

Contributor
Posts: 43
Registered: ‎03-04-2015

Re: spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to false

Hi

 

Hope, below information might help in getting clarity for the issue .

 

As per official  page : http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_new_in_cdh_57.html#conc...

 

"SPARK-10000 - Spark 1.6.0 includes a new unified memory manager. The new memory manager is turned off by default (unlike Apache Spark 1.6.0), to make it easier for users to migrate existing workloads, but it is supported."

 

 

Regards

Prateek

  

Cloudera Employee
Posts: 464
Registered: ‎08-11-2014

Re: spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to

Ah thank you even I missed that. Yes, that explains it then. It is indeed for backwards compatibility.

Contributor
Posts: 43
Registered: ‎03-04-2015

Re: spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to

Hi

 

Thanks a lot  for your support to clarify my doubt .

 

I am facing one more issue related to spark configuration .

In CDH 5.7.0 [ Spark 1.6 ]  i need to explicitly set "spark.shuffle.reduceLocality.enabled= false" to distributed RDD Partitions evenly to executors .

 

below are the complete scenerio to explain my problem : 

 

My Spark Streaming application receiving data from one kafka  topic ( one partition) and rdd have 30 partition.   

but scheduler schedule the task between executors  running on  same host with NODE_LOCAL locality level.  ( where kafka topic partition created) . 

Below are the logs : 

16/05/06 11:21:38 INFO YarnScheduler: Adding task set 1.0 with 30 tasks 
16/05/06 11:21:38 DEBUG TaskSetManager: Epoch for TaskSet 1.0: 1 
16/05/06 11:21:38 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY 
16/05/06 11:21:38 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m04.novalocal, partition 0,NODE_LOCAL, 2248 bytes) 
16/05/06 11:21:38 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m04.novalocal, partition 1,NODE_LOCAL, 2248 bytes) 
16/05/06 11:21:38 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m04.novalocal, partition 2,NODE_LOCAL, 2248 bytes) 
16/05/06 11:21:38 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,NODE_LOCAL, 2248 bytes) 
16/05/06 11:21:38 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,NODE_LOCAL, 2248 bytes) 



I have seen this scenario after upgrading my CDH  5.6.0 [Spark 1.5] to  5.7.0 [Spark 1.6]  .  same application distributed rdd partition  evenly to executors in spark 1.5 . 

As mentioned on some spark developer blogs , I have tried spark.shuffle.reduceLocality.enabled=false ( Not mentioned in spark documentation  )   and after that my rdd partition is distributed between executors of all host with PROCESS_LOCAL locality level. 

Below are the logs : 

16/05/06 11:24:46 INFO YarnScheduler: Adding task set 1.0 with 30 tasks 
16/05/06 11:24:46 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NO_PREF, ANY 

16/05/06 11:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m02.novalocal, partition 0,PROCESS_LOCAL, 2248 bytes) 
16/05/06 11:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m01.novalocal, partition 1,PROCESS_LOCAL, 2248 bytes) 
16/05/06 11:24:46 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m06.novalocal, partition 2,PROCESS_LOCAL, 2248 bytes) 
16/05/06 11:24:46 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,PROCESS_LOCAL, 2248 bytes) 
16/05/06 11:24:46 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,PROCESS_LOCAL, 2248 bytes) 
-------- 
-------- 
-------- 


Is above configuration is correct solution for problem  ?


Regards 
Prateek

New Contributor
Posts: 1
Registered: ‎06-26-2017

Re: spark 1.6 [cloudera 5.7.0 ] : Need to explicitly set "spark.memory.useLegacyMode " to

Thank you for sharing your information. Took me a while to figure out why I'm getting the legacy behaviour even though I'm on Spark 1.6.0 which is supposed to have unified memory management (CDH 5.10.1).

Now my question is, does setting spark.memory.useLegacyMode to false have any unintended consequences?! Or Is it safe to simply set it to false?! And when does Cloudera plan to make it default?! 

Announcements