Created 10-17-2018 02:30 AM
Can we read the unix file using pyspark script using zeppelin?
Created 10-17-2018 04:08 AM
Yes. you can read it like below
%pyspark content = sc.textFile("file:///path/example.txt")
If file schema is not given,it defaults to HDFS
Created 10-17-2018 02:32 AM
I would like read the contents of unix file, /path/example.txt
Created 10-17-2018 04:08 AM
Yes. you can read it like below
%pyspark content = sc.textFile("file:///path/example.txt")
If file schema is not given,it defaults to HDFS
Created 10-17-2018 01:31 PM
thank you, How do I read the contents?
Created 10-17-2018 01:42 PM
After you run the above snippet content is created as an RDD. You can perform operations on that RDD to whatever you want.
For ex:
%pyspark content = sc.textFile("file:///path/example.txt") content.collect() -------> prints all lines content.take(1) ----> prints 1 line lines = content.map(lambda x: len(x)) ----> count no of character of each line lines.take(5) ---> prints count of character of first 5 lines
Similarly you can perform other operations you want.
Created 10-17-2018 01:49 PM
thank you so much.
content = sc.textFile("file:///home/userid/test.txt")
Is it the right syntax or does it need to be pointed to HDFS only?
I am getting the following error message.
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/userid/test.txt at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121)
Created 10-17-2018 01:57 PM
Can you try
content = sc.textFile("file:/home/userid/test.txt")
Created 10-17-2018 02:00 PM
thaks a lot. It works through command line shell. not working through zeppelin
content = sc.textFile("file:///path/example.txt")
Created 10-17-2018 03:45 PM
looks like zeppelin issue and your coding is working great. Thanks a lot.