- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
pyspark read file
- Labels:
-
Apache Spark
Created ‎10-17-2018 02:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can we read the unix file using pyspark script using zeppelin?
Created ‎10-17-2018 04:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes. you can read it like below
%pyspark content = sc.textFile("file:///path/example.txt")
If file schema is not given,it defaults to HDFS
Created ‎10-17-2018 02:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would like read the contents of unix file, /path/example.txt
Created ‎10-17-2018 04:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes. you can read it like below
%pyspark content = sc.textFile("file:///path/example.txt")
If file schema is not given,it defaults to HDFS
Created ‎10-17-2018 01:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thank you, How do I read the contents?
Created ‎10-17-2018 01:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After you run the above snippet content is created as an RDD. You can perform operations on that RDD to whatever you want.
For ex:
%pyspark content = sc.textFile("file:///path/example.txt") content.collect() -------> prints all lines content.take(1) ----> prints 1 line lines = content.map(lambda x: len(x)) ----> count no of character of each line lines.take(5) ---> prints count of character of first 5 lines
Similarly you can perform other operations you want.
Created ‎10-17-2018 01:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thank you so much.
content = sc.textFile("file:///home/userid/test.txt")
Is it the right syntax or does it need to be pointed to HDFS only?
I am getting the following error message.
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/userid/test.txt at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121)
Created ‎10-17-2018 01:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you try
content = sc.textFile("file:/home/userid/test.txt")
Created ‎10-17-2018 02:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thaks a lot. It works through command line shell. not working through zeppelin
content = sc.textFile("file:///path/example.txt")
Created ‎10-17-2018 03:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
looks like zeppelin issue and your coding is working great. Thanks a lot.
