Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Access HDFS extended attributes (xattrs) in HiveQL

avatar
Contributor

Is there a way to retrieve and use in HiveQL extended attributes (xattrs) defined in underlying HDFS files?

If source raw data files stored in HDFS is tagged with metadata (e.g. original source, import time) at the file level, this would allow data within these files to automatically inherit such metadata.

Is this something a custom SerDe could be built for?

Is there an alternate way to achieve this?

Thanks

1 ACCEPTED SOLUTION

avatar
Contributor

There may be other ways to implement it, but based on previous answers and after I discovered Hive virtual column 'INPUT__FILE__NAME' containing the URL of the source HDFS file, I create a User-Defined Function in Java to read its extended attributes. This function can be used in a Hive query as:

  XAttrSimpleUDF(INPUT__FILE__NAME,'user.my_key')

The (quick and dirty) Java source code of the UDF looks like:

public class XAttrSimpleUDF extends UDF {
  
  public Text evaluate(Text uri, Text attr) {
    if(uri == null || attr == null) return null;
    
    Text xAttrTxt = null;
    try {
        Configuration myConf = new Configuration();
        
        //Creating filesystem using uri
        URI myURI = URI.create(uri.toString());
        FileSystem fs = FileSystem.get(myURI, myConf);

        // Retrieve value of extended attribute
        xAttrTxt = new Text(fs.getXAttr(new Path(myURI), attr.toString()));
    } catch (IOException e) {
        e.printStackTrace();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return xAttrTxt;
  }
}

I didn't test the performance of this when querying very large data sets. I wished that extended attributes could be retrieved directly as a virtual column in a way similar to using virtual column INPUT__FILE__NAME.

View solution in original post

16 REPLIES 16

avatar
Master Guru

I think your only choice would be to build a custom InputFormat that reads these properties and then adds them as columns to the data set. I don't know of any other ways to propagate metadata with the data set.

avatar
Contributor

Thanks for your answer. Good to see if there is a way to do it even if this requires custom (complex?) development.

I am still novice with this, but to avoid having to duplicate existing standard InputFormat (as the content of data file could be based on existing InputFormat), is there a way to combine multiple InputFormat as most columns would be coming from file contents and other columns from file attributes. In general, I basically want to be able to extract file attributes for files containing data formatted in various ways, but all having the same file attributes.

avatar
Master Guru

In my opinion you would need to extend the TextInputformat.

Then change the linerecordreader class to read the xattributes when reading the filesplit and prefixing each line with the attributes you want before it is been given to the Serde ( you don't want to change that if you can avoid it ) You could also filter out files completely here. You can then do more advanced filters in SQL.

http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/...

avatar
Master Guru

Regarding combining. You would have to extend the existing class, modify the linerecordreader etc. This would only work for text files (not sequence or ORC files ) because they need their own InputFormats. But you could have any kind of content in the text file since the deserialization is handled by the Serde ( which we would not touch ) . Doing the same for sequence files should be possible, doing the same for ORC files is most likely very difficult. In that case InputFormat und Serde are much more interconnected.

avatar
Contributor

Thanks. That doesn't look that easy and generic. As we don't necessarily know in advance what extra attributes will be available, prefixing lines with all available xattrs at read time would require prefixing the line with something looking like key=value and being able to decode it as pseudo columns.

Anyway, thanks again as this gives a possible way.

avatar
Contributor

I "unaccepted" this answer to use mine instead, based on using Hive virtual column INPUT__FILE__NAME and a simple User Defined function. Feel free to comment it.

Thanks anyway again for your valuable answers. Without them, I wouldn't have digged further into the technical feasibility of implementing this. This helped me learned some of the internals of Hive.

avatar
Master Mentor

You can execute dfs and shell commands in CLI, would that be something you are looking for? Not sure if it works in Ambari Hive View, Link @Claude Villermain

If you do write a serde please post here as a cool proof of tech.

avatar
Contributor

Thanks for your answer. I actually want to use file metadata as columns in my data set, so being able to execute dfs commands in CLI doesn't seem to answer that.

avatar
Master Mentor

@Claude Villermain there's also an option of using UDF in groovy, python or java. Grab the metadata and return as part of resultset but if you need to store it then yes Benjamin's approach is probably the way to go