- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Access HDFS extended attributes (xattrs) in HiveQL
- Labels:
-
Apache Hive
Created 01-28-2016 10:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there a way to retrieve and use in HiveQL extended attributes (xattrs) defined in underlying HDFS files?
If source raw data files stored in HDFS is tagged with metadata (e.g. original source, import time) at the file level, this would allow data within these files to automatically inherit such metadata.
Is this something a custom SerDe could be built for?
Is there an alternate way to achieve this?
Thanks
Created 02-02-2016 10:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There may be other ways to implement it, but based on previous answers and after I discovered Hive virtual column 'INPUT__FILE__NAME' containing the URL of the source HDFS file, I create a User-Defined Function in Java to read its extended attributes. This function can be used in a Hive query as:
XAttrSimpleUDF(INPUT__FILE__NAME,'user.my_key')
The (quick and dirty) Java source code of the UDF looks like:
public class XAttrSimpleUDF extends UDF { public Text evaluate(Text uri, Text attr) { if(uri == null || attr == null) return null; Text xAttrTxt = null; try { Configuration myConf = new Configuration(); //Creating filesystem using uri URI myURI = URI.create(uri.toString()); FileSystem fs = FileSystem.get(myURI, myConf); // Retrieve value of extended attribute xAttrTxt = new Text(fs.getXAttr(new Path(myURI), attr.toString())); } catch (IOException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } return xAttrTxt; } }
I didn't test the performance of this when querying very large data sets. I wished that extended attributes could be retrieved directly as a virtual column in a way similar to using virtual column INPUT__FILE__NAME.
Created 01-28-2016 03:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I actually don't need to store it. I want to be able to refer to HDFS file metata as some kind of virtual column in a hive query. For instance, if an existing HDFS file testdata.csv contains my data. The file had extended attributes defined:
hdfs dfs -setfattr -n user.src -v my_src testdata.csv
I then want to query a Hive external table with this HDFS file (or multiple similar files) defined as location by retrieving columns from the file content and file extended attributes (using xattrs or something similar):
select col1, col2, xattrs.user.src from Testdata;
Created 02-02-2016 10:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There may be other ways to implement it, but based on previous answers and after I discovered Hive virtual column 'INPUT__FILE__NAME' containing the URL of the source HDFS file, I create a User-Defined Function in Java to read its extended attributes. This function can be used in a Hive query as:
XAttrSimpleUDF(INPUT__FILE__NAME,'user.my_key')
The (quick and dirty) Java source code of the UDF looks like:
public class XAttrSimpleUDF extends UDF { public Text evaluate(Text uri, Text attr) { if(uri == null || attr == null) return null; Text xAttrTxt = null; try { Configuration myConf = new Configuration(); //Creating filesystem using uri URI myURI = URI.create(uri.toString()); FileSystem fs = FileSystem.get(myURI, myConf); // Retrieve value of extended attribute xAttrTxt = new Text(fs.getXAttr(new Path(myURI), attr.toString())); } catch (IOException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } return xAttrTxt; } }
I didn't test the performance of this when querying very large data sets. I wished that extended attributes could be retrieved directly as a virtual column in a way similar to using virtual column INPUT__FILE__NAME.
Created 02-02-2016 12:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Awesome @Claude Villermain
Created 02-02-2016 12:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And thanks again for your replies helping me building this!
Created 02-02-2016 03:03 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Similar UDF as Groovy code for a more direct use in Hive:
compile `import org.apache.hadoop.hive.ql.exec.UDF \; import org.apache.hadoop.io.Text \; import org.apache.hadoop.conf.Configuration \; import org.apache.hadoop.fs.FileSystem \; import org.apache.hadoop.fs.Path \; import java.net.URI \; public class XAttr extends UDF { public Text evaluate(Text uri, Text attr){ if (uri == null || attr == null) return null \; URI myURI = URI.create(uri.toString()) \; Configuration myConf = new Configuration() \; FileSystem fs = FileSystem.get(myURI, myConf) \; return new Text(fs.getXAttr(new Path(myURI), attr.toString())) \; } } ` AS GROOVY NAMED XAttr.groovy;
To be used similarly as:
XAttr(INPUT__FILE__NAME,'user.src')
Created 02-02-2016 03:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Claude Villermain create an article with this, this is really great.
Created 02-02-2016 03:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was not used to it, but here is my contribution to the community: Access HDFS file extended attributes in Hive with Groovy UDF

- « Previous
-
- 1
- 2
- Next »