Support Questions

aliyesami · ‎01-04-2017

I am using this python code for word count that I got from the online lessons I am following but I am not able to run in my environment , getting errors shown below :

[sami@hadoop1 ~]$ more wordcount.py
from operator import add
from pyspark import SparkConf, SparkContext
import collections
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("local").setAppName("MyWordCount")
sc = SparkContext(conf = conf)
lines = sc.textFile("/user/sami/WC.txt").map(lambda x: x.split(' ')).reduceByKey(add)
output = lines.collect()
for (word, count) in output:
        print str(word) +": "+ str(count)

getting the following error

[sami@hadoop1 ~]$ spark-submit wordcount.py
SPARK_MAJOR_VERSION is set to 2, using Spark2
/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py:477: DeprecationWarning: HiveContext is deprecated in Spark 2.0.0. Please use SparkSession.builder.enableHiveSupport().getOrCreate() instead.
17/01/03 17:35:02 INFO SparkContext: Running Spark version 2.0.0.2.5.0.0-1245
17/01/03 17:35:02 INFO SecurityManager: Changing view acls to: sami
17/01/03 17:35:02 INFO SecurityManager: Changing modify acls to: sami
17/01/03 17:35:02 INFO SecurityManager: Changing view acls groups to:
17/01/03 17:35:02 INFO SecurityManager: Changing modify acls groups to:
17/01/03 17:35:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(sami); groups with view permissions: Set(); users  with modify permissions: Set(sami); groups with modify permissions: Set()
17/01/03 17:35:03 INFO Utils: Successfully started service 'sparkDriver' on port 39810.
17/01/03 17:35:03 INFO SparkEnv: Registering MapOutputTracker
17/01/03 17:35:03 INFO SparkEnv: Registering BlockManagerMaster
17/01/03 17:35:03 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-4b68fd3b-659c-4813-830d-52bd9854464b
17/01/03 17:35:03 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/01/03 17:35:03 INFO SparkEnv: Registering OutputCommitCoordinator
17/01/03 17:35:03 INFO log: Logging initialized @1879ms
17/01/03 17:35:03 INFO Server: jetty-9.2.z-SNAPSHOT
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7c1cf887{/jobs,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@49372d5a{/jobs/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@525012da{/jobs/job,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c08bf0e{/jobs/job/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@774baac8{/stages,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7e61486{/stages/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@55f21062{/stages/stage,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7c2b351d{/stages/stage/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@569cddba{/stages/pool,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@61e9b1cb{/stages/pool/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@24ddcc94{/storage,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1fb96cf{/storage/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@313afa47{/storage/rdd,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1d717242{/storage/rdd/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5702a414{/environment,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@a121dd5{/environment/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c1c2612{/executors,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7a69aab1{/executors/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@388993ec{/executors/threadDump,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@226fdd9d{/executors/threadDump/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2fce4374{/static,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1845481e{/,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c27ae09{/api,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30f81525{/stages/stage/kill,null,AVAILABLE}
17/01/03 17:35:03 INFO ServerConnector: Started ServerConnector@4f1ab37a{HTTP/1.1}{0.0.0.0:4040}
17/01/03 17:35:03 INFO Server: Started @1982ms
17/01/03 17:35:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/01/03 17:35:03 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.100.44.17:4040
17/01/03 17:35:03 INFO Utils: Copying /home/sami/wordcount.py to /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/userFiles-0018c417-c1d5-4734-aeb9-b020d0dfb936/wordcount.py
17/01/03 17:35:03 INFO SparkContext: Added file file:/home/sami/wordcount.py at file:/home/sami/wordcount.py with timestamp 1483482903703
17/01/03 17:35:03 INFO Executor: Starting executor ID driver on host localhost
17/01/03 17:35:03 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38633.
17/01/03 17:35:03 INFO NettyBlockTransferService: Server created on 10.100.44.17:38633
17/01/03 17:35:03 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.100.44.17, 38633)
17/01/03 17:35:03 INFO BlockManagerMasterEndpoint: Registering block manager 10.100.44.17:38633 with 366.3 MB RAM, BlockManagerId(driver, 10.100.44.17, 38633)
17/01/03 17:35:03 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.100.44.17, 38633)
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2fbfc77f{/metrics/json,null,AVAILABLE}
17/01/03 17:35:04 INFO EventLoggingListener: Logging events to hdfs:///spark2-history/local-1483482903743
17/01/03 17:35:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 344.2 KB, free 366.0 MB)
17/01/03 17:35:04 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 31.2 KB, free 365.9 MB)
17/01/03 17:35:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.100.44.17:38633 (size: 31.2 KB, free: 366.3 MB)
17/01/03 17:35:04 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
17/01/03 17:35:05 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 69 for sami on 10.100.44.17:8020
17/01/03 17:35:05 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.100.44.17:8020, Ident: (HDFS_DELEGATION_TOKEN token 69 for sami)
17/01/03 17:35:05 WARN Token: Cannot find class for token kind kms-dt
17/01/03 17:35:05 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: kms-dt, Service: 10.100.44.17:9292, Ident: 00 04 73 61 6d 69 04 79 61 72 6e 00 8a 01 59 66 78 b9 e0 8a 01 59 8a 85 3d e0 05 02
17/01/03 17:35:05 INFO FileInputFormat: Total input paths to process : 1
17/01/03 17:35:05 INFO SparkContext: Starting job: collect at /home/sami/wordcount.py:12
17/01/03 17:35:05 INFO DAGScheduler: Registering RDD 3 (reduceByKey at /home/sami/wordcount.py:11)
17/01/03 17:35:05 INFO DAGScheduler: Got job 0 (collect at /home/sami/wordcount.py:12) with 1 output partitions
17/01/03 17:35:05 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/sami/wordcount.py:12)
17/01/03 17:35:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/01/03 17:35:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/01/03 17:35:05 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11), which has no missing parents
17/01/03 17:35:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.8 KB, free 365.9 MB)
17/01/03 17:35:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.5 KB, free 365.9 MB)
17/01/03 17:35:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.100.44.17:38633 (size: 5.5 KB, free: 366.3 MB)
17/01/03 17:35:05 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1012
17/01/03 17:35:05 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11)
17/01/03 17:35:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/01/03 17:35:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, ANY, 5480 bytes)
17/01/03 17:35:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/01/03 17:35:05 INFO Executor: Fetching file:/home/sami/wordcount.py with timestamp 1483482903703
17/01/03 17:35:05 INFO Utils: /home/sami/wordcount.py has been previously copied to /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/userFiles-0018c417-c1d5-4734-aeb9-b020d0dfb936/wordcount.py
17/01/03 17:35:05 INFO HadoopRDD: Input split: hdfs://hadoop1.tolls.dot.state.fl.us:8020/user/sami/WC.txt:0+45
17/01/03 17:35:05 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/01/03 17:35:05 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/01/03 17:35:05 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/01/03 17:35:05 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/01/03 17:35:05 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py:477: DeprecationWarning: HiveContext is deprecated in Spark 2.0.0. Please use SparkSession.builder.enableHiveSupport().getOrCreate() instead.
17/01/03 17:35:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
17/01/03 17:35:05 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
17/01/03 17:35:05 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/01/03 17:35:05 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/01/03 17:35:05 INFO TaskSchedulerImpl: Cancelling stage 0
17/01/03 17:35:05 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /home/sami/wordcount.py:11) failed in 0.426 s
17/01/03 17:35:05 INFO DAGScheduler: Job 0 failed: collect at /home/sami/wordcount.py:12, took 0.491790 s
Traceback (most recent call last):
  File "/home/sami/wordcount.py", line 12, in <module>
    output = lines.collect()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
17/01/03 17:35:05 INFO SparkContext: Invoking stop() from shutdown hook
17/01/03 17:35:05 INFO ServerConnector: Stopped ServerConnector@4f1ab37a{HTTP/1.1}{0.0.0.0:4040}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@30f81525{/stages/stage/kill,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c27ae09{/api,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1845481e{/,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@2fce4374{/static,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@226fdd9d{/executors/threadDump/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@388993ec{/executors/threadDump,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a69aab1{/executors/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c1c2612{/executors,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@a121dd5{/environment/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@5702a414{/environment,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1d717242{/storage/rdd/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@313afa47{/storage/rdd,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1fb96cf{/storage/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@24ddcc94{/storage,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@61e9b1cb{/stages/pool/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@569cddba{/stages/pool,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7c2b351d{/stages/stage/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@55f21062{/stages/stage,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7e61486{/stages/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@774baac8{/stages,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c08bf0e{/jobs/job/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@525012da{/jobs/job,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@49372d5a{/jobs/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7c1cf887{/jobs,null,UNAVAILABLE}
17/01/03 17:35:05 INFO SparkUI: Stopped Spark web UI at http://10.100.44.17:4040
17/01/03 17:35:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/01/03 17:35:05 INFO MemoryStore: MemoryStore cleared
17/01/03 17:35:05 INFO BlockManager: BlockManager stopped
17/01/03 17:35:05 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/03 17:35:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/01/03 17:35:05 INFO SparkContext: Successfully stopped SparkContext
17/01/03 17:35:05 INFO ShutdownHookManager: Shutdown hook called
17/01/03 17:35:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/pyspark-9c5ed27a-af62-44ce-ae38-229c90a05438

sandyy006 · ‎01-04-2017

@Sami Ahmad

Below is the simple wordcount example from Spark docs and this should work with both Spark 1.6.x and Spark 2.x.

>>> textFile = sc.textFile("file:///etc/passwd")
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
>>> wordCounts.collect()

Ref: http://spark.apache.org/docs/latest/quick-start.html#more-on-rdd-operations

Hope this helps.

View solution in original post

TimothySpann · ‎01-04-2017

run this under Spark 1.6

aliyesami · ‎01-04-2017

but I want to use spark2, can I modify this to run under spark2 ?

aliyesami · ‎01-04-2017

tried it with spark 1.6 , gives the same error

[sami@hadoop1 ~]$ unset SPARK_MAJOR_VERSION
[sami@hadoop1 ~]$ spark-submit wordcount.py
Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
Spark1 will be picked by default
17/01/04 10:14:48 INFO SparkContext: Running Spark version 1.6.2
17/01/04 10:14:51 WARN UserGroupInformation: Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: Ticket expired while renewing credentials
17/01/04 10:14:51 INFO SecurityManager: Changing view acls to: sami
17/01/04 10:14:51 INFO SecurityManager: Changing modify acls to: sami
17/01/04 10:14:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sami); users with modify permissions: Set(sami)
17/01/04 10:14:53 INFO Utils: Successfully started service 'sparkDriver' on port 44312.
17/01/04 10:14:54 INFO Slf4jLogger: Slf4jLogger started
17/01/04 10:14:55 INFO Remoting: Starting remoting
17/01/04 10:14:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.100.44.17:41165]
17/01/04 10:14:55 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 41165.
17/01/04 10:14:55 INFO SparkEnv: Registering MapOutputTracker
17/01/04 10:14:55 INFO SparkEnv: Registering BlockManagerMaster
17/01/04 10:14:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-11455db4-924f-4877-84ee-43a376464966
17/01/04 10:14:55 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
17/01/04 10:14:55 INFO SparkEnv: Registering OutputCommitCoordinator
17/01/04 10:14:56 INFO Server: jetty-8.y.z-SNAPSHOT
17/01/04 10:14:56 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
17/01/04 10:14:56 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/01/04 10:14:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.100.44.17:4040
17/01/04 10:14:57 INFO Utils: Copying /home/sami/wordcount.py to /tmp/spark-d40dae3a-c20f-4d7e-ac77-60c504ff5be8/userFiles-ffa57bc1-db84-424b-a1b4-22c494cf7257/wordcount.py
17/01/04 10:14:57 INFO SparkContext: Added file file:/home/sami/wordcount.py at file:/home/sami/wordcount.py with timestamp 1483542897654
17/01/04 10:14:58 INFO Executor: Starting executor ID driver on host localhost
17/01/04 10:14:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45420.
17/01/04 10:14:58 INFO NettyBlockTransferService: Server created on 45420
17/01/04 10:14:58 INFO BlockManagerMaster: Trying to register BlockManager
17/01/04 10:14:58 INFO BlockManagerMasterEndpoint: Registering block manager localhost:45420 with 511.1 MB RAM, BlockManagerId(driver, localhost, 45420)
17/01/04 10:14:58 INFO BlockManagerMaster: Registered BlockManager
17/01/04 10:15:01 INFO EventLoggingListener: Logging events to hdfs:///spark-history/local-1483542898060
17/01/04 10:15:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 344.0 KB, free 344.0 KB)
17/01/04 10:15:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 30.1 KB, free 374.0 KB)
17/01/04 10:15:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45420 (size: 30.1 KB, free: 511.1 MB)
17/01/04 10:15:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
17/01/04 10:15:04 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 71 for sami on 10.100.44.17:8020
17/01/04 10:15:04 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.100.44.17:8020, Ident: (HDFS_DELEGATION_TOKEN token 71 for sami)
17/01/04 10:15:04 WARN Token: Cannot find class for token kind kms-dt
17/01/04 10:15:04 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: kms-dt, Service: 10.100.44.17:9292, Ident: 00 04 73 61 6d 69 04 79 61 72 6e 00 8a 01 59 6a 0c 3e c7 8a 01 59 8e 18 c2 c7 07 02
17/01/04 10:15:04 INFO FileInputFormat: Total input paths to process : 1
17/01/04 10:15:05 INFO SparkContext: Starting job: collect at /home/sami/wordcount.py:12
17/01/04 10:15:05 INFO DAGScheduler: Registering RDD 3 (reduceByKey at /home/sami/wordcount.py:11)
17/01/04 10:15:05 INFO DAGScheduler: Got job 0 (collect at /home/sami/wordcount.py:12) with 1 output partitions
17/01/04 10:15:05 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/sami/wordcount.py:12)
17/01/04 10:15:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/01/04 10:15:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/01/04 10:15:05 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11), which has no missing parents
17/01/04 10:15:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.9 KB, free 381.9 KB)
17/01/04 10:15:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.0 KB, free 386.9 KB)
17/01/04 10:15:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:45420 (size: 5.0 KB, free: 511.1 MB)
17/01/04 10:15:05 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
17/01/04 10:15:05 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11)
17/01/04 10:15:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/01/04 10:15:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,ANY, 2187 bytes)
17/01/04 10:15:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/01/04 10:15:05 INFO Executor: Fetching file:/home/sami/wordcount.py with timestamp 1483542897654
17/01/04 10:15:05 INFO Utils: /home/sami/wordcount.py has been previously copied to /tmp/spark-d40dae3a-c20f-4d7e-ac77-60c504ff5be8/userFiles-ffa57bc1-db84-424b-a1b4-22c494cf7257/wordcount.py
17/01/04 10:15:05 INFO HadoopRDD: Input split: hdfs://hadoop1.tolls.dot.state.fl.us:8020/user/sami/WC.txt:0+45
17/01/04 10:15:05 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/01/04 10:15:05 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/01/04 10:15:05 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/01/04 10:15:05 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/01/04 10:15:05 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/01/04 10:15:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1776, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

sandyy006 · ‎01-04-2017

@Sami Ahmad

Below is the simple wordcount example from Spark docs and this should work with both Spark 1.6.x and Spark 2.x.

>>> textFile = sc.textFile("file:///etc/passwd")
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
>>> wordCounts.collect()

Ref: http://spark.apache.org/docs/latest/quick-start.html#more-on-rdd-operations

Hope this helps.

Cloudera Community

Support Questions

word count exampl.e in hiveserver2