Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

word count exampl.e in hiveserver2

avatar
Master Collaborator

I am using this python code for word count that I got from the online lessons I am following but I am not able to run in my environment , getting errors shown below :

[sami@hadoop1 ~]$ more wordcount.py
from operator import add
from pyspark import SparkConf, SparkContext
import collections
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("local").setAppName("MyWordCount")
sc = SparkContext(conf = conf)
lines = sc.textFile("/user/sami/WC.txt").map(lambda x: x.split(' ')).reduceByKey(add)
output = lines.collect()
for (word, count) in output:
        print str(word) +": "+ str(count)

getting the following error

[sami@hadoop1 ~]$ spark-submit wordcount.py
SPARK_MAJOR_VERSION is set to 2, using Spark2
/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py:477: DeprecationWarning: HiveContext is deprecated in Spark 2.0.0. Please use SparkSession.builder.enableHiveSupport().getOrCreate() instead.
17/01/03 17:35:02 INFO SparkContext: Running Spark version 2.0.0.2.5.0.0-1245
17/01/03 17:35:02 INFO SecurityManager: Changing view acls to: sami
17/01/03 17:35:02 INFO SecurityManager: Changing modify acls to: sami
17/01/03 17:35:02 INFO SecurityManager: Changing view acls groups to:
17/01/03 17:35:02 INFO SecurityManager: Changing modify acls groups to:
17/01/03 17:35:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(sami); groups with view permissions: Set(); users  with modify permissions: Set(sami); groups with modify permissions: Set()
17/01/03 17:35:03 INFO Utils: Successfully started service 'sparkDriver' on port 39810.
17/01/03 17:35:03 INFO SparkEnv: Registering MapOutputTracker
17/01/03 17:35:03 INFO SparkEnv: Registering BlockManagerMaster
17/01/03 17:35:03 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-4b68fd3b-659c-4813-830d-52bd9854464b
17/01/03 17:35:03 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/01/03 17:35:03 INFO SparkEnv: Registering OutputCommitCoordinator
17/01/03 17:35:03 INFO log: Logging initialized @1879ms
17/01/03 17:35:03 INFO Server: jetty-9.2.z-SNAPSHOT
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7c1cf887{/jobs,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@49372d5a{/jobs/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@525012da{/jobs/job,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c08bf0e{/jobs/job/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@774baac8{/stages,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7e61486{/stages/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@55f21062{/stages/stage,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7c2b351d{/stages/stage/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@569cddba{/stages/pool,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@61e9b1cb{/stages/pool/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@24ddcc94{/storage,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1fb96cf{/storage/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@313afa47{/storage/rdd,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1d717242{/storage/rdd/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5702a414{/environment,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@a121dd5{/environment/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c1c2612{/executors,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7a69aab1{/executors/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@388993ec{/executors/threadDump,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@226fdd9d{/executors/threadDump/json,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2fce4374{/static,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1845481e{/,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c27ae09{/api,null,AVAILABLE}
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30f81525{/stages/stage/kill,null,AVAILABLE}
17/01/03 17:35:03 INFO ServerConnector: Started ServerConnector@4f1ab37a{HTTP/1.1}{0.0.0.0:4040}
17/01/03 17:35:03 INFO Server: Started @1982ms
17/01/03 17:35:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/01/03 17:35:03 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.100.44.17:4040
17/01/03 17:35:03 INFO Utils: Copying /home/sami/wordcount.py to /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/userFiles-0018c417-c1d5-4734-aeb9-b020d0dfb936/wordcount.py
17/01/03 17:35:03 INFO SparkContext: Added file file:/home/sami/wordcount.py at file:/home/sami/wordcount.py with timestamp 1483482903703
17/01/03 17:35:03 INFO Executor: Starting executor ID driver on host localhost
17/01/03 17:35:03 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38633.
17/01/03 17:35:03 INFO NettyBlockTransferService: Server created on 10.100.44.17:38633
17/01/03 17:35:03 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.100.44.17, 38633)
17/01/03 17:35:03 INFO BlockManagerMasterEndpoint: Registering block manager 10.100.44.17:38633 with 366.3 MB RAM, BlockManagerId(driver, 10.100.44.17, 38633)
17/01/03 17:35:03 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.100.44.17, 38633)
17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2fbfc77f{/metrics/json,null,AVAILABLE}
17/01/03 17:35:04 INFO EventLoggingListener: Logging events to hdfs:///spark2-history/local-1483482903743
17/01/03 17:35:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 344.2 KB, free 366.0 MB)
17/01/03 17:35:04 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 31.2 KB, free 365.9 MB)
17/01/03 17:35:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.100.44.17:38633 (size: 31.2 KB, free: 366.3 MB)
17/01/03 17:35:04 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
17/01/03 17:35:05 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 69 for sami on 10.100.44.17:8020
17/01/03 17:35:05 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.100.44.17:8020, Ident: (HDFS_DELEGATION_TOKEN token 69 for sami)
17/01/03 17:35:05 WARN Token: Cannot find class for token kind kms-dt
17/01/03 17:35:05 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: kms-dt, Service: 10.100.44.17:9292, Ident: 00 04 73 61 6d 69 04 79 61 72 6e 00 8a 01 59 66 78 b9 e0 8a 01 59 8a 85 3d e0 05 02
17/01/03 17:35:05 INFO FileInputFormat: Total input paths to process : 1
17/01/03 17:35:05 INFO SparkContext: Starting job: collect at /home/sami/wordcount.py:12
17/01/03 17:35:05 INFO DAGScheduler: Registering RDD 3 (reduceByKey at /home/sami/wordcount.py:11)
17/01/03 17:35:05 INFO DAGScheduler: Got job 0 (collect at /home/sami/wordcount.py:12) with 1 output partitions
17/01/03 17:35:05 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/sami/wordcount.py:12)
17/01/03 17:35:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/01/03 17:35:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/01/03 17:35:05 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11), which has no missing parents
17/01/03 17:35:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.8 KB, free 365.9 MB)
17/01/03 17:35:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.5 KB, free 365.9 MB)
17/01/03 17:35:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.100.44.17:38633 (size: 5.5 KB, free: 366.3 MB)
17/01/03 17:35:05 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1012
17/01/03 17:35:05 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11)
17/01/03 17:35:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/01/03 17:35:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, ANY, 5480 bytes)
17/01/03 17:35:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/01/03 17:35:05 INFO Executor: Fetching file:/home/sami/wordcount.py with timestamp 1483482903703
17/01/03 17:35:05 INFO Utils: /home/sami/wordcount.py has been previously copied to /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/userFiles-0018c417-c1d5-4734-aeb9-b020d0dfb936/wordcount.py
17/01/03 17:35:05 INFO HadoopRDD: Input split: hdfs://hadoop1.tolls.dot.state.fl.us:8020/user/sami/WC.txt:0+45
17/01/03 17:35:05 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/01/03 17:35:05 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/01/03 17:35:05 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/01/03 17:35:05 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/01/03 17:35:05 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py:477: DeprecationWarning: HiveContext is deprecated in Spark 2.0.0. Please use SparkSession.builder.enableHiveSupport().getOrCreate() instead.
17/01/03 17:35:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
17/01/03 17:35:05 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
17/01/03 17:35:05 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/01/03 17:35:05 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/01/03 17:35:05 INFO TaskSchedulerImpl: Cancelling stage 0
17/01/03 17:35:05 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /home/sami/wordcount.py:11) failed in 0.426 s
17/01/03 17:35:05 INFO DAGScheduler: Job 0 failed: collect at /home/sami/wordcount.py:12, took 0.491790 s
Traceback (most recent call last):
  File "/home/sami/wordcount.py", line 12, in <module>
    output = lines.collect()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
17/01/03 17:35:05 INFO SparkContext: Invoking stop() from shutdown hook
17/01/03 17:35:05 INFO ServerConnector: Stopped ServerConnector@4f1ab37a{HTTP/1.1}{0.0.0.0:4040}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@30f81525{/stages/stage/kill,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c27ae09{/api,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1845481e{/,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@2fce4374{/static,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@226fdd9d{/executors/threadDump/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@388993ec{/executors/threadDump,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a69aab1{/executors/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c1c2612{/executors,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@a121dd5{/environment/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@5702a414{/environment,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1d717242{/storage/rdd/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@313afa47{/storage/rdd,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1fb96cf{/storage/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@24ddcc94{/storage,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@61e9b1cb{/stages/pool/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@569cddba{/stages/pool,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7c2b351d{/stages/stage/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@55f21062{/stages/stage,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7e61486{/stages/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@774baac8{/stages,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c08bf0e{/jobs/job/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@525012da{/jobs/job,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@49372d5a{/jobs/json,null,UNAVAILABLE}
17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7c1cf887{/jobs,null,UNAVAILABLE}
17/01/03 17:35:05 INFO SparkUI: Stopped Spark web UI at http://10.100.44.17:4040
17/01/03 17:35:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/01/03 17:35:05 INFO MemoryStore: MemoryStore cleared
17/01/03 17:35:05 INFO BlockManager: BlockManager stopped
17/01/03 17:35:05 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/03 17:35:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/01/03 17:35:05 INFO SparkContext: Successfully stopped SparkContext
17/01/03 17:35:05 INFO ShutdownHookManager: Shutdown hook called
17/01/03 17:35:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/pyspark-9c5ed27a-af62-44ce-ae38-229c90a05438

1 ACCEPTED SOLUTION

avatar

@Sami Ahmad

Below is the simple wordcount example from Spark docs and this should work with both Spark 1.6.x and Spark 2.x.

>>> textFile = sc.textFile("file:///etc/passwd")
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
>>> wordCounts.collect()

Ref: http://spark.apache.org/docs/latest/quick-start.html#more-on-rdd-operations

Hope this helps.

View solution in original post

4 REPLIES 4

avatar
Super Guru

run this under Spark 1.6

avatar
Master Collaborator

but I want to use spark2, can I modify this to run under spark2 ?

avatar
Master Collaborator

tried it with spark 1.6 , gives the same error

[sami@hadoop1 ~]$ unset SPARK_MAJOR_VERSION
[sami@hadoop1 ~]$ spark-submit wordcount.py
Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
Spark1 will be picked by default
17/01/04 10:14:48 INFO SparkContext: Running Spark version 1.6.2
17/01/04 10:14:51 WARN UserGroupInformation: Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: Ticket expired while renewing credentials
17/01/04 10:14:51 INFO SecurityManager: Changing view acls to: sami
17/01/04 10:14:51 INFO SecurityManager: Changing modify acls to: sami
17/01/04 10:14:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sami); users with modify permissions: Set(sami)
17/01/04 10:14:53 INFO Utils: Successfully started service 'sparkDriver' on port 44312.
17/01/04 10:14:54 INFO Slf4jLogger: Slf4jLogger started
17/01/04 10:14:55 INFO Remoting: Starting remoting
17/01/04 10:14:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.100.44.17:41165]
17/01/04 10:14:55 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 41165.
17/01/04 10:14:55 INFO SparkEnv: Registering MapOutputTracker
17/01/04 10:14:55 INFO SparkEnv: Registering BlockManagerMaster
17/01/04 10:14:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-11455db4-924f-4877-84ee-43a376464966
17/01/04 10:14:55 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
17/01/04 10:14:55 INFO SparkEnv: Registering OutputCommitCoordinator
17/01/04 10:14:56 INFO Server: jetty-8.y.z-SNAPSHOT
17/01/04 10:14:56 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
17/01/04 10:14:56 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/01/04 10:14:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.100.44.17:4040
17/01/04 10:14:57 INFO Utils: Copying /home/sami/wordcount.py to /tmp/spark-d40dae3a-c20f-4d7e-ac77-60c504ff5be8/userFiles-ffa57bc1-db84-424b-a1b4-22c494cf7257/wordcount.py
17/01/04 10:14:57 INFO SparkContext: Added file file:/home/sami/wordcount.py at file:/home/sami/wordcount.py with timestamp 1483542897654
17/01/04 10:14:58 INFO Executor: Starting executor ID driver on host localhost
17/01/04 10:14:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45420.
17/01/04 10:14:58 INFO NettyBlockTransferService: Server created on 45420
17/01/04 10:14:58 INFO BlockManagerMaster: Trying to register BlockManager
17/01/04 10:14:58 INFO BlockManagerMasterEndpoint: Registering block manager localhost:45420 with 511.1 MB RAM, BlockManagerId(driver, localhost, 45420)
17/01/04 10:14:58 INFO BlockManagerMaster: Registered BlockManager
17/01/04 10:15:01 INFO EventLoggingListener: Logging events to hdfs:///spark-history/local-1483542898060
17/01/04 10:15:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 344.0 KB, free 344.0 KB)
17/01/04 10:15:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 30.1 KB, free 374.0 KB)
17/01/04 10:15:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45420 (size: 30.1 KB, free: 511.1 MB)
17/01/04 10:15:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
17/01/04 10:15:04 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 71 for sami on 10.100.44.17:8020
17/01/04 10:15:04 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.100.44.17:8020, Ident: (HDFS_DELEGATION_TOKEN token 71 for sami)
17/01/04 10:15:04 WARN Token: Cannot find class for token kind kms-dt
17/01/04 10:15:04 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: kms-dt, Service: 10.100.44.17:9292, Ident: 00 04 73 61 6d 69 04 79 61 72 6e 00 8a 01 59 6a 0c 3e c7 8a 01 59 8e 18 c2 c7 07 02
17/01/04 10:15:04 INFO FileInputFormat: Total input paths to process : 1
17/01/04 10:15:05 INFO SparkContext: Starting job: collect at /home/sami/wordcount.py:12
17/01/04 10:15:05 INFO DAGScheduler: Registering RDD 3 (reduceByKey at /home/sami/wordcount.py:11)
17/01/04 10:15:05 INFO DAGScheduler: Got job 0 (collect at /home/sami/wordcount.py:12) with 1 output partitions
17/01/04 10:15:05 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/sami/wordcount.py:12)
17/01/04 10:15:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/01/04 10:15:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/01/04 10:15:05 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11), which has no missing parents
17/01/04 10:15:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.9 KB, free 381.9 KB)
17/01/04 10:15:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.0 KB, free 386.9 KB)
17/01/04 10:15:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:45420 (size: 5.0 KB, free: 511.1 MB)
17/01/04 10:15:05 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
17/01/04 10:15:05 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11)
17/01/04 10:15:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/01/04 10:15:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,ANY, 2187 bytes)
17/01/04 10:15:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/01/04 10:15:05 INFO Executor: Fetching file:/home/sami/wordcount.py with timestamp 1483542897654
17/01/04 10:15:05 INFO Utils: /home/sami/wordcount.py has been previously copied to /tmp/spark-d40dae3a-c20f-4d7e-ac77-60c504ff5be8/userFiles-ffa57bc1-db84-424b-a1b4-22c494cf7257/wordcount.py
17/01/04 10:15:05 INFO HadoopRDD: Input split: hdfs://hadoop1.tolls.dot.state.fl.us:8020/user/sami/WC.txt:0+45
17/01/04 10:15:05 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/01/04 10:15:05 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/01/04 10:15:05 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/01/04 10:15:05 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/01/04 10:15:05 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/01/04 10:15:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1776, in combineLocally
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
ValueError: too many values to unpack
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

avatar

@Sami Ahmad

Below is the simple wordcount example from Spark docs and this should work with both Spark 1.6.x and Spark 2.x.

>>> textFile = sc.textFile("file:///etc/passwd")
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
>>> wordCounts.collect()

Ref: http://spark.apache.org/docs/latest/quick-start.html#more-on-rdd-operations

Hope this helps.

Labels