Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

pyspark read file

avatar

Can we read the unix file using pyspark script using zeppelin?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Anpan K,

Yes. you can read it like below

%pyspark
content = sc.textFile("file:///path/example.txt")

If file schema is not given,it defaults to HDFS

View solution in original post

8 REPLIES 8

avatar

I would like read the contents of unix file, /path/example.txt

avatar
Super Guru

@Anpan K,

Yes. you can read it like below

%pyspark
content = sc.textFile("file:///path/example.txt")

If file schema is not given,it defaults to HDFS

avatar

thank you, How do I read the contents?

avatar
Super Guru

@Anpan K,

After you run the above snippet content is created as an RDD. You can perform operations on that RDD to whatever you want.

For ex:

%pyspark
content = sc.textFile("file:///path/example.txt")
content.collect()     -------> prints all lines
content.take(1) ----> prints 1 line
lines = content.map(lambda x: len(x)) ----> count no of character of each line
lines.take(5) ---> prints count of character of first 5 lines

Similarly you can perform other operations you want.

avatar

thank you so much.

content = sc.textFile("file:///home/userid/test.txt")

Is it the right syntax or does it need to be pointed to HDFS only?

I am getting the following error message.

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/userid/test.txt at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121)

avatar
Super Guru

Can you try

content = sc.textFile("file:/home/userid/test.txt")

avatar

thaks a lot. It works through command line shell. not working through zeppelin

content = sc.textFile("file:///path/example.txt")

avatar

looks like zeppelin issue and your coding is working great. Thanks a lot.