question Re: pyspark read file in Support Questions

pyspark read file

ananthan_kathir — Wed, 17 Oct 2018 09:30:40 GMT

Can we read the unix file using pyspark script using zeppelin?

Re: pyspark read file

ananthan_kathir — Wed, 17 Oct 2018 09:32:09 GMT

I would like read the contents of unix file, /path/example.txt

Re: pyspark read file

asirna — Wed, 17 Oct 2018 11:08:38 GMT

@Anpan K,

Yes. you can read it like below

%pyspark
content = sc.textFile("file:///path/example.txt")

If file schema is not given,it defaults to HDFS

Re: pyspark read file

ananthan_kathir — Wed, 17 Oct 2018 20:31:27 GMT

thank you, How do I read the contents?

Re: pyspark read file

asirna — Wed, 17 Oct 2018 20:42:01 GMT

@Anpan K,

After you run the above snippet content is created as an RDD. You can perform operations on that RDD to whatever you want.

For ex:

%pyspark
content = sc.textFile("file:///path/example.txt")
content.collect()     -------> prints all lines
content.take(1) ----> prints 1 line
lines = content.map(lambda x: len(x)) ----> count no of character of each line
lines.take(5) ---> prints count of character of first 5 lines

Similarly you can perform other operations you want.

Re: pyspark read file

ananthan_kathir — Wed, 17 Oct 2018 20:49:18 GMT

thank you so much.

content = sc.textFile("file:///home/userid/test.txt")

Is it the right syntax or does it need to be pointed to HDFS only?

I am getting the following error message.

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/userid/test.txt at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121)

Re: pyspark read file

asirna — Wed, 17 Oct 2018 20:57:37 GMT

Can you try

content = sc.textFile("file:/home/userid/test.txt")

Re: pyspark read file

ananthan_kathir — Wed, 17 Oct 2018 21:00:27 GMT

thaks a lot. It works through command line shell. not working through zeppelin

content = sc.textFile("file:///path/example.txt")

Re: pyspark read file

ananthan_kathir — Wed, 17 Oct 2018 22:45:25 GMT

looks like zeppelin issue and your coding is working great. Thanks a lot.