Created 11-11-2015 03:46 AM
If I execute this code from zeppelin:
%pyspark base_rdd = sc.textFile("/tmp/philadelphia-crime-data-2015-ytd.csv")
base_rdd.take(10)I am not getting any results back, if U execute from pyspark CLI same code i get valid data. Note: I am running pyspark in local mode in CLI, not in Yarn mode.
Zeppelin is not returning any errors, and no errors in log files. I am using HDP 2.3.2 sandbox.
Same code works using scala works in zeppelin and in cli
Looks like a bug in zeppelin with pyspark?
Created 11-11-2015 06:08 PM
I think issues are:
1- you have to use file:// for local files
2- using pyspark, you have to use print before
see example below (working for me):
%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)
					
				
			
			
				
			
			
			
				
			
			
			
			
			
		Created 11-11-2015 01:42 PM
@Ali Bajwa @Neeraj have u encountered this issue with Pyspark and zeppelin?
Created 11-11-2015 06:08 PM
I think issues are:
1- you have to use file:// for local files
2- using pyspark, you have to use print before
see example below (working for me):
%pyspark 
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)
					
				
			
			
				
			
			
			
			
			
			
			
		Created 11-11-2015 06:17 PM
Yup I have done similar with pyspark in Zeppelin as well so should work
Created 11-11-2015 06:36 PM
I had to give it a fully qualified hdfs:// URI in pyspark for me to work
 
					
				
				
			
		
