Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

PySpark on Zeppelin in sandbox is not loading data

avatar

If I execute this code from zeppelin:

%pyspark base_rdd = sc.textFile("/tmp/philadelphia-crime-data-2015-ytd.csv")
base_rdd.take(10)

I am not getting any results back, if U execute from pyspark CLI same code i get valid data. Note: I am running pyspark in local mode in CLI, not in Yarn mode.

Zeppelin is not returning any errors, and no errors in log files. I am using HDP 2.3.2 sandbox.

Same code works using scala works in zeppelin and in cli

Looks like a bug in zeppelin with pyspark?

1 ACCEPTED SOLUTION

avatar

@azeltov@hortonworks.com

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

View solution in original post

4 REPLIES 4

avatar

@Ali Bajwa @Neeraj have u encountered this issue with Pyspark and zeppelin?

avatar

@azeltov@hortonworks.com

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

avatar

Yup I have done similar with pyspark in Zeppelin as well so should work

avatar

I had to give it a fully qualified hdfs:// URI in pyspark for me to work