Support Questions

Find answers, ask questions, and share your expertise

read a AVRO file stored in HDFS

avatar
Expert Contributor

Hi,

I want to read a metadata from avro file stored in HDFS using AVRO api ( https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/file/DataFileReader.html )

The avro DataFileReader accepts only File objects. Is it somehow possible to read data from file stored on hdfs instead of data stored on local fs?

Thank you

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I created sample code, it works FINE.

BufferedInputStream inStream = null;
String inputF = "hdfs://CustomerData-20160128-1501807.avro";
org.apache.hadoop.fs.Path inPath = new org.apache.hadoop.fs.Path(inputF);
try {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://sandbox.hortonworks.com:8020");
FileSystem fs = FileSystem.get(URI.create(inputF), conf);
inStream = new BufferedInputStream(fs.open(inPath));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
DataFileStream reader = new DataFileStream(inStream, new GenericDatumReader());
Schema schema = reader.getSchema();
System.out.println(schema.toString());

View solution in original post

11 REPLIES 11

avatar
Master Mentor

@John Smithcan you clarify, are you trying to do this programmatically using Java or in a pig script? You can look up schema using avro tools and pass getschema flag Link. I once kept schema in hdfs as XML but it can be any format even json ouut of avro tools and then process new records. Maybe what you suggest is better, to get schema. You can probably try reading it and passing hdfs scheme rather than file:///

avatar
Expert Contributor

Hi,

im trying to do this as part of the JAVA programme.

avatar
Expert Contributor

can you call

avro-tools-1.7.4.jar 

within the pig script? and also is it possible to access files stored on HDFS using avro-tools?

avatar
Master Mentor

@John Smith those are all valid questions :), I haven't tried as there was never a need. Try it out, post an article! As far as accessing from Pig, not sure that's possible? Again, try it out. You might be able to look at source code and write a UDF that does what avro-tools tries to do, I don't know. By the way, avro-tools coincides with the version of avro, so I'd suggest downloading the latest avro-tools available, which at this moment is 1.8.0.

avatar
Master Mentor

@John Smith then look at how to infer schema in Java API. You don't need avro-tools in that case.

avatar
Expert Contributor

im trying to write sample java code... but

https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html

[root@sandbox deploy-4]# find / -name core-default.xml

[root@sandbox deploy-4]# find / -name core-site..xml

there are no such a files in sandbox. How can i go thru this step?

thanks

avatar
Master Mentor

look in /etc/hadoop/conf directory @John Smith

avatar
Expert Contributor

I created sample code, it works FINE.

BufferedInputStream inStream = null;
String inputF = "hdfs://CustomerData-20160128-1501807.avro";
org.apache.hadoop.fs.Path inPath = new org.apache.hadoop.fs.Path(inputF);
try {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://sandbox.hortonworks.com:8020");
FileSystem fs = FileSystem.get(URI.create(inputF), conf);
inStream = new BufferedInputStream(fs.open(inPath));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
DataFileStream reader = new DataFileStream(inStream, new GenericDatumReader());
Schema schema = reader.getSchema();
System.out.println(schema.toString());

avatar
Expert Contributor

you should fix that FORUM website its pain to format text, paste code etc....