Created on 11-20-2018 06:57 AM - edited 09-16-2022 06:54 AM
Hello! I am tying to run a Spark script in hue:
Name of my script: SparkTest.py
Content of my script:
text_file = sc.textFile("hdfs://...testFile.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...RESULT.txt")
Content of my testFile:
Test Test
Problem:
After running of this script my RESULT.txt file is still empty.
Question:
- Which Spark/Hue configuration do I need to run simple Spark scripts with the help of hue?
I use VM Cloudera 5.13
Thank you!
Created 11-21-2018 01:54 AM
Thank you!
I have change it.
Also I have changed my paths. Because the path is for directory and not for a file. I have also added a / to my path. Now I get results which I have expected. I changed "setMaster to "local" because it is just a small Cloudera VM without cluster.
This is a simple Spark script which can be executed in hue per Spark editor:
from pyspark import SparkContext, SparkConf
appNameTEST ="my first working application"
conf = SparkConf().setAppName(appNameTEST).setMaster("local")
sc = SparkContext(conf=conf)
text_file = sc.textFile("hdfs:///user/hive/warehouse/TEST/FilePath")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hive/warehouse/TEST/RESULT")
Created 11-20-2018 07:52 AM
Created 11-20-2018 08:26 AM
I have added SparkContext to my script:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName(appNameTEST).setMaster(master)
sc = SparkContext(conf=conf)
Most relevant error log in hue:
Traceback (most recent call last):
File "/yarn/nm/usercache/cloudera/appcache/application_1542723589859_0008/container_1542723589859_0008_01_000002/SparkTest.py", line 3, in <module>
conf = SparkConf().setAppName(appNameTEST).setMaster(master)
NameError: name 'appNameTEST' is not defined
Less relevant error:
2018-11-20 08:07:59,555 [DataStreamer for file /user/cloudera/oozie-oozi/0000003-181120074347071-oozie-oozi-W/spark2-b3ea--spark/action-data.seq] WARN org.apache.hadoop.hdfs.DFSClient - Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
Created 11-20-2018 08:50 AM
Created 11-20-2018 09:05 AM
Thank you, you are right. How can set a variable in python for Spark?
I thought that "conf = SparkConf().setAppName(appNameTEST).setMaster(master)" would set this variable?
Created 11-20-2018 09:11 AM
Created 11-21-2018 01:54 AM
Thank you!
I have change it.
Also I have changed my paths. Because the path is for directory and not for a file. I have also added a / to my path. Now I get results which I have expected. I changed "setMaster to "local" because it is just a small Cloudera VM without cluster.
This is a simple Spark script which can be executed in hue per Spark editor:
from pyspark import SparkContext, SparkConf
appNameTEST ="my first working application"
conf = SparkConf().setAppName(appNameTEST).setMaster("local")
sc = SparkContext(conf=conf)
text_file = sc.textFile("hdfs:///user/hive/warehouse/TEST/FilePath")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hive/warehouse/TEST/RESULT")