Support Questions

Find answers, ask questions, and share your expertise

PySpark on Zeppelin in sandbox is not loading data

avatar

If I execute this code from zeppelin:

%pyspark base_rdd = sc.textFile("/tmp/philadelphia-crime-data-2015-ytd.csv")
base_rdd.take(10)

I am not getting any results back, if U execute from pyspark CLI same code i get valid data. Note: I am running pyspark in local mode in CLI, not in Yarn mode.

Zeppelin is not returning any errors, and no errors in log files. I am using HDP 2.3.2 sandbox.

Same code works using scala works in zeppelin and in cli

Looks like a bug in zeppelin with pyspark?

1 ACCEPTED SOLUTION

avatar

@azeltov@hortonworks.com

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

View solution in original post

4 REPLIES 4

avatar

@Ali Bajwa @Neeraj have u encountered this issue with Pyspark and zeppelin?

avatar

@azeltov@hortonworks.com

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

avatar

Yup I have done similar with pyspark in Zeppelin as well so should work

avatar

I had to give it a fully qualified hdfs:// URI in pyspark for me to work