Support Questions

ananthan_kathir · ‎10-17-2018

Can we read the unix file using pyspark script using zeppelin?

asirna · ‎10-17-2018

Yes. you can read it like below

%pyspark
content = sc.textFile("file:///path/example.txt")

If file schema is not given,it defaults to HDFS

View solution in original post

ananthan_kathir · ‎10-17-2018

I would like read the contents of unix file, /path/example.txt

asirna · ‎10-17-2018

@Anpan K,

Yes. you can read it like below

%pyspark
content = sc.textFile("file:///path/example.txt")

If file schema is not given,it defaults to HDFS

ananthan_kathir · ‎10-17-2018

thank you, How do I read the contents?

asirna · ‎10-17-2018

@Anpan K,

After you run the above snippet content is created as an RDD. You can perform operations on that RDD to whatever you want.

For ex:

%pyspark
content = sc.textFile("file:///path/example.txt")
content.collect()     -------> prints all lines
content.take(1) ----> prints 1 line
lines = content.map(lambda x: len(x)) ----> count no of character of each line
lines.take(5) ---> prints count of character of first 5 lines

Similarly you can perform other operations you want.

ananthan_kathir · ‎10-17-2018

thank you so much.

content = sc.textFile("file:///home/userid/test.txt")

Is it the right syntax or does it need to be pointed to HDFS only?

I am getting the following error message.

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/userid/test.txt at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121)

asirna · ‎10-17-2018

Can you try

content = sc.textFile("file:/home/userid/test.txt")

ananthan_kathir · ‎10-17-2018

thaks a lot. It works through command line shell. not working through zeppelin

content = sc.textFile("file:///path/example.txt")

ananthan_kathir · ‎10-17-2018

looks like zeppelin issue and your coding is working great. Thanks a lot.

Cloudera Community

Support Questions

pyspark read file

Write / Read Parquet File in Spark

How to Create an Iceberg Table with PySpark in Clo...

Build and use Parquet-tools to read parquet files

CDP DataHub - PySpark Structured Streaming reading...

Read SAS files into parquet using nifi

Spark (PySpark) for ETL to join text files with My...

i can't read file from hdfs using pyspark (ambari-...

(Zeppelin) pyspark read hive TypeError: 'JavaPacka...

Distributed XGBoost with PySpark in Cloudera Machi...

Pyspark Streaming Wordcount Example