Created 02-05-2016 11:36 AM
Hi,
I want to read a metadata from avro file stored in HDFS using AVRO api ( https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/file/DataFileReader.html )
The avro DataFileReader accepts only File objects. Is it somehow possible to read data from file stored on hdfs instead of data stored on local fs?
Thank you
Created 02-05-2016 02:17 PM
I created sample code, it works FINE.
BufferedInputStream inStream = null; String inputF = "hdfs://CustomerData-20160128-1501807.avro"; org.apache.hadoop.fs.Path inPath = new org.apache.hadoop.fs.Path(inputF); try { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://sandbox.hortonworks.com:8020"); FileSystem fs = FileSystem.get(URI.create(inputF), conf); inStream = new BufferedInputStream(fs.open(inPath)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } DataFileStream reader = new DataFileStream(inStream, new GenericDatumReader()); Schema schema = reader.getSchema(); System.out.println(schema.toString());
Created 02-05-2016 12:04 PM
@John Smithcan you clarify, are you trying to do this programmatically using Java or in a pig script? You can look up schema using avro tools and pass getschema flag Link. I once kept schema in hdfs as XML but it can be any format even json ouut of avro tools and then process new records. Maybe what you suggest is better, to get schema. You can probably try reading it and passing hdfs scheme rather than file:///
Created 02-05-2016 01:05 PM
Hi,
im trying to do this as part of the JAVA programme.
Created 02-05-2016 01:08 PM
can you call
avro-tools-1.7.4.jar
within the pig script? and also is it possible to access files stored on HDFS using avro-tools?
Created 02-05-2016 01:36 PM
@John Smith those are all valid questions :), I haven't tried as there was never a need. Try it out, post an article! As far as accessing from Pig, not sure that's possible? Again, try it out. You might be able to look at source code and write a UDF that does what avro-tools tries to do, I don't know. By the way, avro-tools coincides with the version of avro, so I'd suggest downloading the latest avro-tools available, which at this moment is 1.8.0.
Created 02-05-2016 01:37 PM
@John Smith then look at how to infer schema in Java API. You don't need avro-tools in that case.
Created 02-05-2016 02:06 PM
im trying to write sample java code... but
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
[root@sandbox deploy-4]# find / -name core-default.xml
[root@sandbox deploy-4]# find / -name core-site..xml
there are no such a files in sandbox. How can i go thru this step?
thanks
Created 02-05-2016 03:12 PM
look in /etc/hadoop/conf directory @John Smith
Created 02-05-2016 02:17 PM
I created sample code, it works FINE.
BufferedInputStream inStream = null; String inputF = "hdfs://CustomerData-20160128-1501807.avro"; org.apache.hadoop.fs.Path inPath = new org.apache.hadoop.fs.Path(inputF); try { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://sandbox.hortonworks.com:8020"); FileSystem fs = FileSystem.get(URI.create(inputF), conf); inStream = new BufferedInputStream(fs.open(inPath)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } DataFileStream reader = new DataFileStream(inStream, new GenericDatumReader()); Schema schema = reader.getSchema(); System.out.println(schema.toString());
Created 02-05-2016 02:18 PM
you should fix that FORUM website its pain to format text, paste code etc....