Support Questions

Find answers, ask questions, and share your expertise

first steps with Spark and Hue

avatar
Explorer

Hello! I am tying to run a Spark script in hue:

 

Name of my script: SparkTest.py

 

Content of my script:

text_file = sc.textFile("hdfs://...testFile.txt")

counts = text_file.flatMap(lambda line: line.split(" ")) \

  .map(lambda word: (word, 1)) \

  .reduceByKey(lambda a, b: a + b)

  counts.saveAsTextFile("hdfs://...RESULT.txt")

 

Content of my testFile:

Test Test

 

Problem:

After running of this script my RESULT.txt file is still empty.

 

Question:

- Which Spark/Hue configuration do I need to run simple Spark scripts with the help of hue?

 

I use VM Cloudera 5.13

 

 

 

Thank you!

1 ACCEPTED SOLUTION

avatar
Explorer

Thank you!

 

I have change it.

 

Also I have changed my paths. Because the path is for directory and not for a file. I have also added a / to my path. Now I get results which I have expected. I changed "setMaster to "local" because it is just a small Cloudera VM without cluster.

 

This is a simple Spark script which can be executed in hue per Spark editor:

 

from pyspark import SparkContext, SparkConf
appNameTEST ="my first working application"

conf = SparkConf().setAppName(appNameTEST).setMaster("local")
sc = SparkContext(conf=conf)

text_file = sc.textFile("hdfs:///user/hive/warehouse/TEST/FilePath")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hive/warehouse/TEST/RESULT")

View solution in original post

6 REPLIES 6

avatar
It is hard to tell from this what can be the problem. Can you post the spark logs, do you have access to the Spark job UI? Do you get some error messages?

avatar
Explorer

I have added SparkContext to my script:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName(appNameTEST).setMaster(master)
sc = SparkContext(conf=conf)

 

 

Most relevant error log in hue:

Traceback (most recent call last):
  File "/yarn/nm/usercache/cloudera/appcache/application_1542723589859_0008/container_1542723589859_0008_01_000002/SparkTest.py", line 3, in <module>
    conf = SparkConf().setAppName(appNameTEST).setMaster(master)
NameError: name 'appNameTEST' is not defined

 

 

Less relevant error:

2018-11-20 08:07:59,555 [DataStreamer for file /user/cloudera/oozie-oozi/0000003-181120074347071-oozie-oozi-W/spark2-b3ea--spark/action-data.seq] WARN  org.apache.hadoop.hdfs.DFSClient  - Caught exception
java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1281)
    at java.lang.Thread.join(Thread.java:1355)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

 

 

avatar
NameError: name 'appNameTEST' is not defined -> that is a syntax error, python does not know any variable with this name

avatar
Explorer

Thank you, you are right. How can set a variable in python for Spark?

 

I thought that "conf = SparkConf().setAppName(appNameTEST).setMaster(master)" would set this variable?

 

 

avatar
It is a normal python 🙂
Just use myvariable = "value", so
app_name = "My gorgeous application"
conf = SparkConf().setAppName(app_name).setMaster(master)

avatar
Explorer

Thank you!

 

I have change it.

 

Also I have changed my paths. Because the path is for directory and not for a file. I have also added a / to my path. Now I get results which I have expected. I changed "setMaster to "local" because it is just a small Cloudera VM without cluster.

 

This is a simple Spark script which can be executed in hue per Spark editor:

 

from pyspark import SparkContext, SparkConf
appNameTEST ="my first working application"

conf = SparkConf().setAppName(appNameTEST).setMaster("local")
sc = SparkContext(conf=conf)

text_file = sc.textFile("hdfs:///user/hive/warehouse/TEST/FilePath")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hive/warehouse/TEST/RESULT")