Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

CDH 5.7.0 spark 1.6 : RDD Partitions not distributed evenly to executors

CDH 5.7.0 spark 1.6 : RDD Partitions not distributed evenly to executors

Hi

My Spark Streaming application receiving data from one kafka  topic ( one partition) and rdd have 30 partition.  

but scheduler schedule the task between executors  running on  same host with NODE_LOCAL locality level.  ( where kafka topic partition created) .

Below are the logs :

16/05/06 11:21:38 INFO YarnScheduler: Adding task set 1.0 with 30 tasks
16/05/06 11:21:38 DEBUG TaskSetManager: Epoch for TaskSet 1.0: 1
16/05/06 11:21:38 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY
16/05/06 11:21:38 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m04.novalocal, partition 0,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m04.novalocal, partition 1,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m04.novalocal, partition 2,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,NODE_LOCAL, 2248 bytes)



I have seen this scenario after upgrading cloudera from 5.6.0  (spark 1.5) to 5.7.0 (spark 1.6) . same application distributed rdd partition  evenly to executors in spark 1.5 .

As mentioned on some spark developer blogs , I have tried spark.shuffle.reduceLocality.enabled=false   and after that my rdd partition is distributed between executors of all host with PROCESS_LOCAL locality level.

Below are the logs :

16/05/06 11:24:46 INFO YarnScheduler: Adding task set 1.0 with 30 tasks
16/05/06 11:24:46 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NO_PREF, ANY

16/05/06 11:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m02.novalocal, partition 0,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m01.novalocal, partition 1,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m06.novalocal, partition 2,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,PROCESS_LOCAL, 2248 bytes)
--------
--------
--------

is this is abug or functionality ?  Is above configuration is correct solution for problem  ? and why spark.shuffle.reduceLocality.enabled not mentioned in spark configuration document ?




Regards
Prateek

1 REPLY 1
Highlighted

Re: CDH 5.7.0 spark 1.6 : RDD Partitions not distributed evenly to executors

Hi

 

Please help to solve my problem .

 

Regards

Prateek