Archives of Support Questions (Read Only)

azeltov · ‎11-11-2015

If I execute this code from zeppelin:

%pyspark base_rdd = sc.textFile("/tmp/philadelphia-crime-data-2015-ytd.csv")
base_rdd.take(10)

I am not getting any results back, if U execute from pyspark CLI same code i get valid data. Note: I am running pyspark in local mode in CLI, not in Yarn mode.

Zeppelin is not returning any errors, and no errors in log files. I am using HDP 2.3.2 sandbox.

Same code works using scala works in zeppelin and in cli

Looks like a bug in zeppelin with pyspark?

gbraccialli3 · ‎11-11-2015

@[email protected]

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

View solution in original post

azeltov · ‎11-11-2015

@Ali Bajwa @Neeraj have u encountered this issue with Pyspark and zeppelin?

gbraccialli3 · ‎11-11-2015

@[email protected]

I think issues are:

1- you have to use file:// for local files

2- using pyspark, you have to use print before

see example below (working for me):

%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)

abajwa · ‎11-11-2015

Yup I have done similar with pyspark in Zeppelin as well so should work

azeltov · ‎11-11-2015

I had to give it a fully qualified hdfs:// URI in pyspark for me to work

Cloudera Community

Archives of Support Questions (Read Only)

PySpark on Zeppelin in sandbox is not loading data