Support Questions

mak88 · ‎09-18-2016

Running HDP-2.4.2, Spark 1.6.1, Scala 2.10.5

I am trying to read avro files on HDFS from spark shell or code. First trying to pull in the schema file.

If I use:

val schema = sc.textFile("/user/test/ciws.avsc")

This loads file and I can do : schema.take(100).foreach(println) and see contents.

If I do (using Avro Schema parser):

val schema = new Schema.Parser().parse(new File("/home/test/ciws.avsc"))

or

val schema = scala.io.Source.fromFile("/user/test/ciws.avsc").mkString

I get:

java.io.FileNotFoundException: /user/test/ciws.avsc (No such file or directory)

My core-site.xml specifies defaulFS as our namenode.

I have tried adding "hdfs:/" to filepath and "hdfs://<defaultFS>/..." and still no dice. Any ideas how to reference file in HDFS with the Schema.Parser class or this scala.io.Source class?

Mike

rajkumar_singh · ‎09-18-2016

scala.io.Source.fromFile is expecting a file from local filesystem, if you want to read from the hdfs then use hdfs api to read it like this

val file = org.apache.hadoop.fs.FileSystem.get(URI uri,Configuration conf)
val in =file.open(Path path)
....

anandi · ‎09-18-2016

Hi Mike, follow the following steps:

1- in the CLI where spark is installed, first export Hadoop conf

export HADOOP_CONF_DIR= ~/etc/hadoop/conf

(you may want to put it in your spark conf file: export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf})

2- launch spark-shell

val input = sc.textFile("hdfs:///....insert/your/hdfs/file/path...")

input.count()

//prints the nr of lines read

...

TimothySpann · ‎09-18-2016

make sure you have the avro package loaded for the shell. and hdfs with thw correct path and permissions is required. also if you are using kerberos there are other steps. a better experience than repl is zeppelin give it a try

mak88 · ‎09-18-2016

Thanks, I will giving these answers a try...I will report back...

mak88 · ‎09-20-2016

Almost forgot about this...

I access my avro files like so:

First as Tim said, include proper avro lib, in my case DataBricks.

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class MyMain MyMain.jar

val df = sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load("/user/user1/writer_test.avro") df.select("time").show()

...

Thanks all

Cloudera Community

Support Questions

How to access file in HDFS from Spark-shell or app with Avro libs?