Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to access file in HDFS from Spark-shell or app with Avro libs?

Explorer

Running HDP-2.4.2, Spark 1.6.1, Scala 2.10.5

I am trying to read avro files on HDFS from spark shell or code. First trying to pull in the schema file.

If I use:

val schema = sc.textFile("/user/test/ciws.avsc")

This loads file and I can do : schema.take(100).foreach(println) and see contents.

If I do (using Avro Schema parser):

val schema = new Schema.Parser().parse(new File("/home/test/ciws.avsc"))

or

val schema = scala.io.Source.fromFile("/user/test/ciws.avsc").mkString

I get:

java.io.FileNotFoundException: /user/test/ciws.avsc (No such file or directory)

My core-site.xml specifies defaulFS as our namenode.

I have tried adding "hdfs:/" to filepath and "hdfs://<defaultFS>/..." and still no dice. Any ideas how to reference file in HDFS with the Schema.Parser class or this scala.io.Source class?

Mike

5 REPLIES 5

scala.io.Source.fromFile is expecting a file from local filesystem, if you want to read from the hdfs then use hdfs api to read it like this

val file = org.apache.hadoop.fs.FileSystem.get(URI uri,Configuration conf)
val in =file.open(Path path)
....

Contributor

Hi Mike, follow the following steps:

1- in the CLI where spark is installed, first export Hadoop conf

export HADOOP_CONF_DIR= ~/etc/hadoop/conf

(you may want to put it in your spark conf file: export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf})

2- launch spark-shell

val input = sc.textFile("hdfs:///....insert/your/hdfs/file/path...")

input.count()

//prints the nr of lines read

...

Super Guru

make sure you have the avro package loaded for the shell. and hdfs with thw correct path and permissions is required. also if you are using kerberos there are other steps. a better experience than repl is zeppelin give it a try

Explorer

Thanks, I will giving these answers a try...I will report back...

Explorer

Almost forgot about this...

I access my avro files like so:

First as Tim said, include proper avro lib, in my case DataBricks.

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class MyMain MyMain.jar

val df = sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load("/user/user1/writer_test.avro") df.select("time").show()

...

Thanks all

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.