Member since
01-28-2016
12
Posts
5
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5440 | 02-02-2016 10:30 AM |
02-02-2016
08:12 PM
2 Kudos
I was looking for a way to access HDFS extended attributes metadata in a Hive query. Extended attributes can be used to tag HDFS files with metadata such as file source, import time... They can then be used in queries for all records contained in such files, whatever their format is. After asking the community and digging further into it, I came up with the following Groovy UDF based on using Hive virtual column INPUT__FILE__NAME: compile `import org.apache.hadoop.hive.ql.exec.UDF \;
import org.apache.hadoop.io.Text \;
import org.apache.hadoop.conf.Configuration \;
import org.apache.hadoop.fs.FileSystem \;
import org.apache.hadoop.fs.Path \;
import java.net.URI \;
public class XAttr extends UDF {
public Text evaluate(Text uri, Text attr){
if (uri == null || attr == null) return null \;
URI myURI = URI.create(uri.toString()) \;
Configuration myConf = new Configuration() \;
FileSystem fs = FileSystem.get(myURI, myConf) \;
return new Text(fs.getXAttr(new Path(myURI), attr.toString())) \;
}
} ` AS GROOVY NAMED XAttr.groovy; Create a User Defied Function (e.g. temporary) based on this: CREATE TEMPORARY FUNCTION xattr as 'XAttr';
Use it in a Hive query as: xattr(INPUT__FILE__NAME,'user.src') where 'user.src' is the extended attribute to retrieve. I wished that extended attributes are defined as an actual virtual column, but this is a way to access them. I didn't test the performance impact and I didn't add data checks, so validate and enrich this first if needed
... View more
Labels:
02-02-2016
03:26 PM
I was not used to it, but here is my contribution to the community: Access HDFS file extended attributes in Hive with Groovy UDF
... View more