Reply
Highlighted
Explorer
Posts: 7
Registered: ‎09-07-2018
Accepted Solution

first steps with Spark and Hue

[ Edited ]

Hello! I am tying to run a Spark script in hue:

 

Name of my script: SparkTest.py

 

Content of my script:

text_file = sc.textFile("hdfs://...testFile.txt")

counts = text_file.flatMap(lambda line: line.split(" ")) \

  .map(lambda word: (word, 1)) \

  .reduceByKey(lambda a, b: a + b)

  counts.saveAsTextFile("hdfs://...RESULT.txt")

 

Content of my testFile:

Test Test

 

Problem:

After running of this script my RESULT.txt file is still empty.

 

Question:

- Which Spark/Hue configuration do I need to run simple Spark scripts with the help of hue?

 

I use VM Cloudera 5.13

 

 

 

Thank you!

Master
Posts: 402
Registered: ‎07-01-2015

Re: first steps with Spark and Hue

It is hard to tell from this what can be the problem. Can you post the spark logs, do you have access to the Spark job UI? Do you get some error messages?
Explorer
Posts: 7
Registered: ‎09-07-2018

Re: first steps with Spark and Hue

I have added SparkContext to my script:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName(appNameTEST).setMaster(master)
sc = SparkContext(conf=conf)

 

 

Most relevant error log in hue:

Traceback (most recent call last):
  File "/yarn/nm/usercache/cloudera/appcache/application_1542723589859_0008/container_1542723589859_0008_01_000002/SparkTest.py", line 3, in <module>
    conf = SparkConf().setAppName(appNameTEST).setMaster(master)
NameError: name 'appNameTEST' is not defined

 

 

Less relevant error:

2018-11-20 08:07:59,555 [DataStreamer for file /user/cloudera/oozie-oozi/0000003-181120074347071-oozie-oozi-W/spark2-b3ea--spark/action-data.seq] WARN  org.apache.hadoop.hdfs.DFSClient  - Caught exception
java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1281)
    at java.lang.Thread.join(Thread.java:1355)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

 

 

Master
Posts: 402
Registered: ‎07-01-2015

Re: first steps with Spark and Hue

NameError: name 'appNameTEST' is not defined -> that is a syntax error, python does not know any variable with this name
Explorer
Posts: 7
Registered: ‎09-07-2018

Re: first steps with Spark and Hue

Thank you, you are right. How can set a variable in python for Spark?

 

I thought that "conf = SparkConf().setAppName(appNameTEST).setMaster(master)" would set this variable?

 

 

Master
Posts: 402
Registered: ‎07-01-2015

Re: first steps with Spark and Hue

It is a normal python :-)
Just use myvariable = "value", so
app_name = "My gorgeous application"
conf = SparkConf().setAppName(app_name).setMaster(master)
Explorer
Posts: 7
Registered: ‎09-07-2018

Re: first steps with Spark and Hue

Thank you!

 

I have change it.

 

Also I have changed my paths. Because the path is for directory and not for a file. I have also added a / to my path. Now I get results which I have expected. I changed "setMaster to "local" because it is just a small Cloudera VM without cluster.

 

This is a simple Spark script which can be executed in hue per Spark editor:

 

from pyspark import SparkContext, SparkConf
appNameTEST ="my first working application"

conf = SparkConf().setAppName(appNameTEST).setMaster("local")
sc = SparkContext(conf=conf)

text_file = sc.textFile("hdfs:///user/hive/warehouse/TEST/FilePath")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/hive/warehouse/TEST/RESULT")

Announcements