Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

PySpark on Zeppelin in sandbox is not loading data

avatar

If I execute this code from zeppelin:

%pyspark base_rdd = sc.textFile("/tmp/philadelphia-crime-data-2015-ytd.csv")
base_rdd.take(10)

I am not getting any results back, if U execute from pyspark CLI same code i get valid data. Note: I am running pyspark in local mode in CLI, not in Yarn mode.

Zeppelin is not returning any errors, and no errors in log files. I am using HDP 2.3.2 sandbox.

Same code works using scala works in zeppelin and in cli

Looks like a bug in zeppelin with pyspark?

1 ACCEPTED SOLUTION

avatar

@azeltov@hortonworks.com

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

View solution in original post

4 REPLIES 4

avatar

@Ali Bajwa @Neeraj have u encountered this issue with Pyspark and zeppelin?

avatar

@azeltov@hortonworks.com

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

avatar

Yup I have done similar with pyspark in Zeppelin as well so should work

avatar

I had to give it a fully qualified hdfs:// URI in pyspark for me to work