Created 01-04-2017 02:16 PM
I am using this python code for word count that I got from the online lessons I am following but I am not able to run in my environment , getting errors shown below :
[sami@hadoop1 ~]$ more wordcount.py from operator import add from pyspark import SparkConf, SparkContext import collections from pyspark import SparkConf, SparkContext import collections conf = SparkConf().setMaster("local").setAppName("MyWordCount") sc = SparkContext(conf = conf) lines = sc.textFile("/user/sami/WC.txt").map(lambda x: x.split(' ')).reduceByKey(add) output = lines.collect() for (word, count) in output: print str(word) +": "+ str(count)
getting the following error
[sami@hadoop1 ~]$ spark-submit wordcount.py SPARK_MAJOR_VERSION is set to 2, using Spark2 /usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py:477: DeprecationWarning: HiveContext is deprecated in Spark 2.0.0. Please use SparkSession.builder.enableHiveSupport().getOrCreate() instead. 17/01/03 17:35:02 INFO SparkContext: Running Spark version 2.0.0.2.5.0.0-1245 17/01/03 17:35:02 INFO SecurityManager: Changing view acls to: sami 17/01/03 17:35:02 INFO SecurityManager: Changing modify acls to: sami 17/01/03 17:35:02 INFO SecurityManager: Changing view acls groups to: 17/01/03 17:35:02 INFO SecurityManager: Changing modify acls groups to: 17/01/03 17:35:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sami); groups with view permissions: Set(); users with modify permissions: Set(sami); groups with modify permissions: Set() 17/01/03 17:35:03 INFO Utils: Successfully started service 'sparkDriver' on port 39810. 17/01/03 17:35:03 INFO SparkEnv: Registering MapOutputTracker 17/01/03 17:35:03 INFO SparkEnv: Registering BlockManagerMaster 17/01/03 17:35:03 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-4b68fd3b-659c-4813-830d-52bd9854464b 17/01/03 17:35:03 INFO MemoryStore: MemoryStore started with capacity 366.3 MB 17/01/03 17:35:03 INFO SparkEnv: Registering OutputCommitCoordinator 17/01/03 17:35:03 INFO log: Logging initialized @1879ms 17/01/03 17:35:03 INFO Server: jetty-9.2.z-SNAPSHOT 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7c1cf887{/jobs,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@49372d5a{/jobs/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@525012da{/jobs/job,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c08bf0e{/jobs/job/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@774baac8{/stages,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7e61486{/stages/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@55f21062{/stages/stage,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7c2b351d{/stages/stage/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@569cddba{/stages/pool,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@61e9b1cb{/stages/pool/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@24ddcc94{/storage,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1fb96cf{/storage/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@313afa47{/storage/rdd,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1d717242{/storage/rdd/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5702a414{/environment,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@a121dd5{/environment/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c1c2612{/executors,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7a69aab1{/executors/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@388993ec{/executors/threadDump,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@226fdd9d{/executors/threadDump/json,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2fce4374{/static,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1845481e{/,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c27ae09{/api,null,AVAILABLE} 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30f81525{/stages/stage/kill,null,AVAILABLE} 17/01/03 17:35:03 INFO ServerConnector: Started ServerConnector@4f1ab37a{HTTP/1.1}{0.0.0.0:4040} 17/01/03 17:35:03 INFO Server: Started @1982ms 17/01/03 17:35:03 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/01/03 17:35:03 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.100.44.17:4040 17/01/03 17:35:03 INFO Utils: Copying /home/sami/wordcount.py to /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/userFiles-0018c417-c1d5-4734-aeb9-b020d0dfb936/wordcount.py 17/01/03 17:35:03 INFO SparkContext: Added file file:/home/sami/wordcount.py at file:/home/sami/wordcount.py with timestamp 1483482903703 17/01/03 17:35:03 INFO Executor: Starting executor ID driver on host localhost 17/01/03 17:35:03 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38633. 17/01/03 17:35:03 INFO NettyBlockTransferService: Server created on 10.100.44.17:38633 17/01/03 17:35:03 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.100.44.17, 38633) 17/01/03 17:35:03 INFO BlockManagerMasterEndpoint: Registering block manager 10.100.44.17:38633 with 366.3 MB RAM, BlockManagerId(driver, 10.100.44.17, 38633) 17/01/03 17:35:03 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.100.44.17, 38633) 17/01/03 17:35:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2fbfc77f{/metrics/json,null,AVAILABLE} 17/01/03 17:35:04 INFO EventLoggingListener: Logging events to hdfs:///spark2-history/local-1483482903743 17/01/03 17:35:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 344.2 KB, free 366.0 MB) 17/01/03 17:35:04 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 31.2 KB, free 365.9 MB) 17/01/03 17:35:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.100.44.17:38633 (size: 31.2 KB, free: 366.3 MB) 17/01/03 17:35:04 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 17/01/03 17:35:05 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 69 for sami on 10.100.44.17:8020 17/01/03 17:35:05 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.100.44.17:8020, Ident: (HDFS_DELEGATION_TOKEN token 69 for sami) 17/01/03 17:35:05 WARN Token: Cannot find class for token kind kms-dt 17/01/03 17:35:05 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: kms-dt, Service: 10.100.44.17:9292, Ident: 00 04 73 61 6d 69 04 79 61 72 6e 00 8a 01 59 66 78 b9 e0 8a 01 59 8a 85 3d e0 05 02 17/01/03 17:35:05 INFO FileInputFormat: Total input paths to process : 1 17/01/03 17:35:05 INFO SparkContext: Starting job: collect at /home/sami/wordcount.py:12 17/01/03 17:35:05 INFO DAGScheduler: Registering RDD 3 (reduceByKey at /home/sami/wordcount.py:11) 17/01/03 17:35:05 INFO DAGScheduler: Got job 0 (collect at /home/sami/wordcount.py:12) with 1 output partitions 17/01/03 17:35:05 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/sami/wordcount.py:12) 17/01/03 17:35:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 17/01/03 17:35:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 17/01/03 17:35:05 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11), which has no missing parents 17/01/03 17:35:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.8 KB, free 365.9 MB) 17/01/03 17:35:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.5 KB, free 365.9 MB) 17/01/03 17:35:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.100.44.17:38633 (size: 5.5 KB, free: 366.3 MB) 17/01/03 17:35:05 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1012 17/01/03 17:35:05 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11) 17/01/03 17:35:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 17/01/03 17:35:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, ANY, 5480 bytes) 17/01/03 17:35:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 17/01/03 17:35:05 INFO Executor: Fetching file:/home/sami/wordcount.py with timestamp 1483482903703 17/01/03 17:35:05 INFO Utils: /home/sami/wordcount.py has been previously copied to /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/userFiles-0018c417-c1d5-4734-aeb9-b020d0dfb936/wordcount.py 17/01/03 17:35:05 INFO HadoopRDD: Input split: hdfs://hadoop1.tolls.dot.state.fl.us:8020/user/sami/WC.txt:0+45 17/01/03 17:35:05 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/01/03 17:35:05 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/01/03 17:35:05 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/01/03 17:35:05 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/01/03 17:35:05 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id /usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py:477: DeprecationWarning: HiveContext is deprecated in Spark 2.0.0. Please use SparkSession.builder.enableHiveSupport().getOrCreate() instead. 17/01/03 17:35:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/01/03 17:35:05 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/01/03 17:35:05 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 17/01/03 17:35:05 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 17/01/03 17:35:05 INFO TaskSchedulerImpl: Cancelling stage 0 17/01/03 17:35:05 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /home/sami/wordcount.py:11) failed in 0.426 s 17/01/03 17:35:05 INFO DAGScheduler: Job 0 failed: collect at /home/sami/wordcount.py:12, took 0.491790 s Traceback (most recent call last): File "/home/sami/wordcount.py", line 12, in <module> output = lines.collect() File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.collect(RDD.scala:892) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more 17/01/03 17:35:05 INFO SparkContext: Invoking stop() from shutdown hook 17/01/03 17:35:05 INFO ServerConnector: Stopped ServerConnector@4f1ab37a{HTTP/1.1}{0.0.0.0:4040} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@30f81525{/stages/stage/kill,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c27ae09{/api,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1845481e{/,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@2fce4374{/static,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@226fdd9d{/executors/threadDump/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@388993ec{/executors/threadDump,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a69aab1{/executors/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c1c2612{/executors,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@a121dd5{/environment/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@5702a414{/environment,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1d717242{/storage/rdd/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@313afa47{/storage/rdd,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@1fb96cf{/storage/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@24ddcc94{/storage,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@61e9b1cb{/stages/pool/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@569cddba{/stages/pool,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7c2b351d{/stages/stage/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@55f21062{/stages/stage,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7e61486{/stages/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@774baac8{/stages,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6c08bf0e{/jobs/job/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@525012da{/jobs/job,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@49372d5a{/jobs/json,null,UNAVAILABLE} 17/01/03 17:35:05 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@7c1cf887{/jobs,null,UNAVAILABLE} 17/01/03 17:35:05 INFO SparkUI: Stopped Spark web UI at http://10.100.44.17:4040 17/01/03 17:35:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/01/03 17:35:05 INFO MemoryStore: MemoryStore cleared 17/01/03 17:35:05 INFO BlockManager: BlockManager stopped 17/01/03 17:35:05 INFO BlockManagerMaster: BlockManagerMaster stopped 17/01/03 17:35:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/01/03 17:35:05 INFO SparkContext: Successfully stopped SparkContext 17/01/03 17:35:05 INFO ShutdownHookManager: Shutdown hook called 17/01/03 17:35:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-0ed1979f-65e3-4de1-a1ab-08b812041a78/pyspark-9c5ed27a-af62-44ce-ae38-229c90a05438
Created 01-04-2017 07:39 PM
Below is the simple wordcount example from Spark docs and this should work with both Spark 1.6.x and Spark 2.x.
>>> textFile = sc.textFile("file:///etc/passwd") >>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) >>> wordCounts.collect()
Ref: http://spark.apache.org/docs/latest/quick-start.html#more-on-rdd-operations
Hope this helps.
Created 01-04-2017 02:16 PM
run this under Spark 1.6
Created 01-04-2017 03:13 PM
but I want to use spark2, can I modify this to run under spark2 ?
Created 01-04-2017 03:16 PM
tried it with spark 1.6 , gives the same error
[sami@hadoop1 ~]$ unset SPARK_MAJOR_VERSION [sami@hadoop1 ~]$ spark-submit wordcount.py Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set Spark1 will be picked by default 17/01/04 10:14:48 INFO SparkContext: Running Spark version 1.6.2 17/01/04 10:14:51 WARN UserGroupInformation: Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: Ticket expired while renewing credentials 17/01/04 10:14:51 INFO SecurityManager: Changing view acls to: sami 17/01/04 10:14:51 INFO SecurityManager: Changing modify acls to: sami 17/01/04 10:14:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sami); users with modify permissions: Set(sami) 17/01/04 10:14:53 INFO Utils: Successfully started service 'sparkDriver' on port 44312. 17/01/04 10:14:54 INFO Slf4jLogger: Slf4jLogger started 17/01/04 10:14:55 INFO Remoting: Starting remoting 17/01/04 10:14:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.100.44.17:41165] 17/01/04 10:14:55 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 41165. 17/01/04 10:14:55 INFO SparkEnv: Registering MapOutputTracker 17/01/04 10:14:55 INFO SparkEnv: Registering BlockManagerMaster 17/01/04 10:14:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-11455db4-924f-4877-84ee-43a376464966 17/01/04 10:14:55 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 17/01/04 10:14:55 INFO SparkEnv: Registering OutputCommitCoordinator 17/01/04 10:14:56 INFO Server: jetty-8.y.z-SNAPSHOT 17/01/04 10:14:56 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 17/01/04 10:14:56 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/01/04 10:14:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.100.44.17:4040 17/01/04 10:14:57 INFO Utils: Copying /home/sami/wordcount.py to /tmp/spark-d40dae3a-c20f-4d7e-ac77-60c504ff5be8/userFiles-ffa57bc1-db84-424b-a1b4-22c494cf7257/wordcount.py 17/01/04 10:14:57 INFO SparkContext: Added file file:/home/sami/wordcount.py at file:/home/sami/wordcount.py with timestamp 1483542897654 17/01/04 10:14:58 INFO Executor: Starting executor ID driver on host localhost 17/01/04 10:14:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45420. 17/01/04 10:14:58 INFO NettyBlockTransferService: Server created on 45420 17/01/04 10:14:58 INFO BlockManagerMaster: Trying to register BlockManager 17/01/04 10:14:58 INFO BlockManagerMasterEndpoint: Registering block manager localhost:45420 with 511.1 MB RAM, BlockManagerId(driver, localhost, 45420) 17/01/04 10:14:58 INFO BlockManagerMaster: Registered BlockManager 17/01/04 10:15:01 INFO EventLoggingListener: Logging events to hdfs:///spark-history/local-1483542898060 17/01/04 10:15:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 344.0 KB, free 344.0 KB) 17/01/04 10:15:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 30.1 KB, free 374.0 KB) 17/01/04 10:15:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45420 (size: 30.1 KB, free: 511.1 MB) 17/01/04 10:15:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 17/01/04 10:15:04 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 71 for sami on 10.100.44.17:8020 17/01/04 10:15:04 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.100.44.17:8020, Ident: (HDFS_DELEGATION_TOKEN token 71 for sami) 17/01/04 10:15:04 WARN Token: Cannot find class for token kind kms-dt 17/01/04 10:15:04 INFO TokenCache: Got dt for hdfs://hadoop1.tolls.dot.state.fl.us:8020; Kind: kms-dt, Service: 10.100.44.17:9292, Ident: 00 04 73 61 6d 69 04 79 61 72 6e 00 8a 01 59 6a 0c 3e c7 8a 01 59 8e 18 c2 c7 07 02 17/01/04 10:15:04 INFO FileInputFormat: Total input paths to process : 1 17/01/04 10:15:05 INFO SparkContext: Starting job: collect at /home/sami/wordcount.py:12 17/01/04 10:15:05 INFO DAGScheduler: Registering RDD 3 (reduceByKey at /home/sami/wordcount.py:11) 17/01/04 10:15:05 INFO DAGScheduler: Got job 0 (collect at /home/sami/wordcount.py:12) with 1 output partitions 17/01/04 10:15:05 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/sami/wordcount.py:12) 17/01/04 10:15:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 17/01/04 10:15:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 17/01/04 10:15:05 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11), which has no missing parents 17/01/04 10:15:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.9 KB, free 381.9 KB) 17/01/04 10:15:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.0 KB, free 386.9 KB) 17/01/04 10:15:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:45420 (size: 5.0 KB, free: 511.1 MB) 17/01/04 10:15:05 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008 17/01/04 10:15:05 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /home/sami/wordcount.py:11) 17/01/04 10:15:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 17/01/04 10:15:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,ANY, 2187 bytes) 17/01/04 10:15:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 17/01/04 10:15:05 INFO Executor: Fetching file:/home/sami/wordcount.py with timestamp 1483542897654 17/01/04 10:15:05 INFO Utils: /home/sami/wordcount.py has been previously copied to /tmp/spark-d40dae3a-c20f-4d7e-ac77-60c504ff5be8/userFiles-ffa57bc1-db84-424b-a1b4-22c494cf7257/wordcount.py 17/01/04 10:15:05 INFO HadoopRDD: Input split: hdfs://hadoop1.tolls.dot.state.fl.us:8020/user/sami/WC.txt:0+45 17/01/04 10:15:05 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/01/04 10:15:05 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/01/04 10:15:05 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/01/04 10:15:05 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/01/04 10:15:05 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 17/01/04 10:15:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1776, in combineLocally File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
Created 01-04-2017 07:39 PM
Below is the simple wordcount example from Spark docs and this should work with both Spark 1.6.x and Spark 2.x.
>>> textFile = sc.textFile("file:///etc/passwd") >>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) >>> wordCounts.collect()
Ref: http://spark.apache.org/docs/latest/quick-start.html#more-on-rdd-operations
Hope this helps.