Support Questions

Find answers, ask questions, and share your expertise

How to access file in HDFS from Spark-shell or app with Avro libs?

avatar
Contributor

Running HDP-2.4.2, Spark 1.6.1, Scala 2.10.5

I am trying to read avro files on HDFS from spark shell or code. First trying to pull in the schema file.

If I use:

val schema = sc.textFile("/user/test/ciws.avsc")

This loads file and I can do : schema.take(100).foreach(println) and see contents.

If I do (using Avro Schema parser):

val schema = new Schema.Parser().parse(new File("/home/test/ciws.avsc"))

or

val schema = scala.io.Source.fromFile("/user/test/ciws.avsc").mkString

I get:

java.io.FileNotFoundException: /user/test/ciws.avsc (No such file or directory)

My core-site.xml specifies defaulFS as our namenode.

I have tried adding "hdfs:/" to filepath and "hdfs://<defaultFS>/..." and still no dice. Any ideas how to reference file in HDFS with the Schema.Parser class or this scala.io.Source class?

Mike

5 REPLIES 5

avatar
Super Guru

scala.io.Source.fromFile is expecting a file from local filesystem, if you want to read from the hdfs then use hdfs api to read it like this

val file = org.apache.hadoop.fs.FileSystem.get(URI uri,Configuration conf)
val in =file.open(Path path)
....

avatar
Rising Star

Hi Mike, follow the following steps:

1- in the CLI where spark is installed, first export Hadoop conf

export HADOOP_CONF_DIR= ~/etc/hadoop/conf

(you may want to put it in your spark conf file: export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf})

2- launch spark-shell

val input = sc.textFile("hdfs:///....insert/your/hdfs/file/path...")

input.count()

//prints the nr of lines read

...

avatar
Master Guru

make sure you have the avro package loaded for the shell. and hdfs with thw correct path and permissions is required. also if you are using kerberos there are other steps. a better experience than repl is zeppelin give it a try

avatar
Contributor

Thanks, I will giving these answers a try...I will report back...

avatar
Contributor

Almost forgot about this...

I access my avro files like so:

First as Tim said, include proper avro lib, in my case DataBricks.

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class MyMain MyMain.jar

val df = sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load("/user/user1/writer_test.avro") df.select("time").show()

...

Thanks all