Created 09-18-2016 01:07 AM
Running HDP-2.4.2, Spark 1.6.1, Scala 2.10.5
I am trying to read avro files on HDFS from spark shell or code. First trying to pull in the schema file.
If I use:
val schema = sc.textFile("/user/test/ciws.avsc")
This loads file and I can do : schema.take(100).foreach(println) and see contents.
If I do (using Avro Schema parser):
val schema = new Schema.Parser().parse(new File("/home/test/ciws.avsc"))
or
val schema = scala.io.Source.fromFile("/user/test/ciws.avsc").mkString
I get:
java.io.FileNotFoundException: /user/test/ciws.avsc (No such file or directory)
My core-site.xml specifies defaulFS as our namenode.
I have tried adding "hdfs:/" to filepath and "hdfs://<defaultFS>/..." and still no dice. Any ideas how to reference file in HDFS with the Schema.Parser class or this scala.io.Source class?
Mike
Created 09-18-2016 01:20 AM
scala.io.Source.fromFile is expecting a file from local filesystem, if you want to read from the hdfs then use hdfs api to read it like this
val file = org.apache.hadoop.fs.FileSystem.get(URI uri,Configuration conf) val in =file.open(Path path) ....
Created 09-18-2016 07:52 PM
Hi Mike, follow the following steps:
1- in the CLI where spark is installed, first export Hadoop conf
export HADOOP_CONF_DIR= ~/etc/hadoop/conf
(you may want to put it in your spark conf file: export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf})
2- launch spark-shell
val input = sc.textFile("hdfs:///....insert/your/hdfs/file/path...")
input.count()
//prints the nr of lines read
...
Created 09-18-2016 09:11 PM
make sure you have the avro package loaded for the shell. and hdfs with thw correct path and permissions is required. also if you are using kerberos there are other steps. a better experience than repl is zeppelin give it a try
Created 09-18-2016 09:37 PM
Thanks, I will giving these answers a try...I will report back...
Created 09-20-2016 02:20 AM
Almost forgot about this...
I access my avro files like so:
First as Tim said, include proper avro lib, in my case DataBricks.
spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class MyMain MyMain.jar
val df = sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load("/user/user1/writer_test.avro") df.select("time").show()
...
Thanks all