Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

pyspark (python 2.6 WORKS vs python 2.7 ERROR)

Highlighted

pyspark (python 2.6 WORKS vs python 2.7 ERROR)

Rising Star

Running a simple Spark job (calling pyspark shell) in Python 2.6 and Python 2.7.

When Python 2.6 environment is set the job completes successfully (although at the end it throws a weird error:

/////////////////////////////////////////////////////// Python 2.6 ////////////////////////////////////////////////////////////

>>> rdd=sc.parallelize (range (1000)).map (lambda x: (x % 100, 1)).reduceByKey (lambda a,b:a+b)
>>> for i in rdd.collect():
...    print (i)
...
15/05/26 16:21:18 INFO SparkContext: Starting job: collect at <stdin>:1
15/05/26 16:21:18 INFO DAGScheduler: Registering RDD 3 (reduceByKey at <stdin>:1)
15/05/26 16:21:18 INFO DAGScheduler: Got job 0 (collect at <stdin>:1) with 160 output partitions (allowLocal=false)
15/05/26 16:21:18 INFO DAGScheduler: Final stage: Stage 1(collect at <stdin>:1)
15/05/26 16:21:18 INFO DAGScheduler: Parents of final stage: List(Stage 0)
15/05/26 16:21:18 INFO DAGScheduler: Missing parents: List(Stage 0)
15/05/26 16:21:18 INFO DAGScheduler: Submitting Stage 0 (PairwiseRDD[3] at reduceByKey at <stdin>:1), which has no missing parents
15/05/26 16:21:18 INFO MemoryStore: ensureFreeSpace(6640) called with curMem=0, maxMem=277842493
15/05/26 16:21:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 6.5 KB, free 265.0 MB)
15/05/26 16:21:18 INFO MemoryStore: ensureFreeSpace(4234) called with curMem=6640, maxMem=277842493
15/05/26 16:21:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.1 KB, free 265.0 MB)

..................................................................................................................................................................

..................................................................................................................................................................

..................................................................................................................................................................

(94, 10)
(95, 10)
(96, 10)
(97, 10)
(98, 10)
(99, 10)
>>> 15/05/26 16:21:30 INFO AppClient$ClientActor: Executor updated: app-20150526161954-0002/0 is now EXITED (Command exited with code 52)
15/05/26 16:21:30 INFO SparkDeploySchedulerBackend: Executor app-20150526161954-0002/0 removed: Command exited with code 52
15/05/26 16:21:30 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
15/05/26 16:21:30 INFO AppClient$ClientActor: Executor added: app-20150526161954-0002/7 on worker-20150522151303-node3.-7078 (node3:7078) with 40 cores
15/05/26 16:21:30 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150526161954-0002/7 on hostPort node3:7078 with 40 cores, 512.0 MB RAM
15/05/26 16:21:30 INFO AppClient$ClientActor: Executor updated: app-20150526161954-0002/7 is now RUNNING
15/05/26 16:21:30 INFO AppClient$ClientActor: Executor updated: app-20150526161954-0002/7 is now LOADING
15/05/26 16:21:32 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@node3:50790/user/Executor#738338483] with ID 7
15/05/26 16:21:33 INFO BlockManagerMasterActor: Registering block manager node3:58910 with 265.4 MB RAM, BlockManagerId(7, node3, 58910)

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

 when run the same code in Python 2.7 env it keeps throwing this error:

 

/////////////////////////////////////////////////////// Python 2.7 ////////////////////////////////////////////////////////////

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.3.0
      /_/

Using Python version 2.7.5 (default, Dec  3 2013 08:35:16)
SparkContext available as sc, HiveContext available as sqlCtx.
>>> rdd=sc.parallelize (range (1000)).map (lambda x: (x % 100, 1)).reduceByKey (lambda a,b:a+b)
>>> for i in rdd.collect():
...     print (i)
...
15/05/26 16:30:57 INFO SparkContext: Starting job: collect at <stdin>:1
15/05/26 16:30:57 INFO DAGScheduler: Registering RDD 3 (reduceByKey at <stdin>:1)
15/05/26 16:30:57 INFO DAGScheduler: Got job 0 (collect at <stdin>:1) with 2 output partitions (allowLocal=false)
15/05/26 16:30:57 INFO DAGScheduler: Final stage: Stage 1(collect at <stdin>:1)
15/05/26 16:30:57 INFO DAGScheduler: Parents of final stage: List(Stage 0)
15/05/26 16:30:57 INFO DAGScheduler: Missing parents: List(Stage 0)
15/05/26 16:30:57 INFO DAGScheduler: Submitting Stage 0 (PairwiseRDD[3] at reduceByKey at <stdin>:1), which has no missing parents
15/05/26 16:30:57 INFO MemoryStore: ensureFreeSpace(6632) called with curMem=0, maxMem=277842493
15/05/26 16:30:57 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 6.5 KB, free 265.0 MB)
15/05/26 16:30:57 INFO MemoryStore: ensureFreeSpace(4227) called with curMem=6632, maxMem=277842493
15/05/26 16:30:57 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.1 KB, free 265.0 MB)
15/05/26 16:30:57 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on master:55462 (size: 4.1 KB, free: 265.0 MB)
15/05/26 16:30:57 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/05/26 16:30:57 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:839
15/05/26 16:30:57 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PairwiseRDD[3] at reduceByKey at <stdin>:1)
15/05/26 16:30:57 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/05/26 16:31:12 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/05/26 16:31:27 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/05/26 16:31:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/05/26 16:31:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

..................................................................................................................................................................................

..................................................................................................................................................................................

..................................................................................................................................................................................

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

 

3 REPLIES 3

Re: pyspark (python 2.6 WORKS vs python 2.7 ERROR)

Super Collaborator

You  need to provide a ltille more detaul: standalone or on yarn, cmd line and the environment settings would be a good start. The error points to something else than a python version issue.

 

Wilfred

Re: pyspark (python 2.6 WORKS vs python 2.7 ERROR)

Rising Star

I am getting the same error both by calling Spark interactively (stand alone, calling pyspark) and executing:

spark-submit --master yarn-client spark_test.py

where spark_test.py contains the code:

import pyspark

sc=pyspark.SparkContext ()
rdd=sc.parallelize (range (1000)).map (lambda x: (x % 100, 1)).reduceByKey (lambda a,b:a+b)

print (rdd.collect())

 

Re: pyspark (python 2.6 WORKS vs python 2.7 ERROR)

Super Collaborator

This works for me in 2.6 and 2.7. It looks like you have another problem which is causing this, not related to the python version.

Based on the message about the resources I would look into the environment and make sure that all variables and paths are set with the same values after you switch python version.

You can look at the pyspark master UI and check that you have executors etc.

 

Wilfred

Don't have an account?
Coming from Hortonworks? Activate your account here