Member since
01-28-2016
12
Posts
5
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5327 | 02-02-2016 10:30 AM |
05-25-2016
02:52 PM
Thanks. This helped me purge metrics too as Ambari metrics services
were stuck because of a disk full error. I also updated metrics
retention parameters.
... View more
02-02-2016
08:12 PM
2 Kudos
I was looking for a way to access HDFS extended attributes metadata in a Hive query. Extended attributes can be used to tag HDFS files with metadata such as file source, import time... They can then be used in queries for all records contained in such files, whatever their format is. After asking the community and digging further into it, I came up with the following Groovy UDF based on using Hive virtual column INPUT__FILE__NAME: compile `import org.apache.hadoop.hive.ql.exec.UDF \;
import org.apache.hadoop.io.Text \;
import org.apache.hadoop.conf.Configuration \;
import org.apache.hadoop.fs.FileSystem \;
import org.apache.hadoop.fs.Path \;
import java.net.URI \;
public class XAttr extends UDF {
public Text evaluate(Text uri, Text attr){
if (uri == null || attr == null) return null \;
URI myURI = URI.create(uri.toString()) \;
Configuration myConf = new Configuration() \;
FileSystem fs = FileSystem.get(myURI, myConf) \;
return new Text(fs.getXAttr(new Path(myURI), attr.toString())) \;
}
} ` AS GROOVY NAMED XAttr.groovy; Create a User Defied Function (e.g. temporary) based on this: CREATE TEMPORARY FUNCTION xattr as 'XAttr';
Use it in a Hive query as: xattr(INPUT__FILE__NAME,'user.src') where 'user.src' is the extended attribute to retrieve. I wished that extended attributes are defined as an actual virtual column, but this is a way to access them. I didn't test the performance impact and I didn't add data checks, so validate and enrich this first if needed
... View more
Labels:
02-02-2016
03:26 PM
I was not used to it, but here is my contribution to the community: Access HDFS file extended attributes in Hive with Groovy UDF
... View more
02-02-2016
03:03 PM
1 Kudo
Similar UDF as Groovy code for a more direct use in Hive: compile `import org.apache.hadoop.hive.ql.exec.UDF \;
import org.apache.hadoop.io.Text \;
import org.apache.hadoop.conf.Configuration \;
import org.apache.hadoop.fs.FileSystem \;
import org.apache.hadoop.fs.Path \;
import java.net.URI \;
public class XAttr extends UDF {
public Text evaluate(Text uri, Text attr){
if (uri == null || attr == null) return null \;
URI myURI = URI.create(uri.toString()) \;
Configuration myConf = new Configuration() \;
FileSystem fs = FileSystem.get(myURI, myConf) \;
return new Text(fs.getXAttr(new Path(myURI), attr.toString())) \;
}
} ` AS GROOVY NAMED XAttr.groovy; To be used similarly as: XAttr(INPUT__FILE__NAME,'user.src')
... View more
02-02-2016
12:25 PM
And thanks again for your replies helping me building this!
... View more
02-02-2016
12:22 PM
I "unaccepted" this answer to use mine instead, based on using Hive virtual column INPUT__FILE__NAME and a simple User Defined function. Feel free to comment it. Thanks anyway again for your valuable answers. Without them, I wouldn't have digged further into the technical feasibility of implementing this. This helped me learned some of the internals of Hive.
... View more
02-02-2016
10:30 AM
2 Kudos
There may be other ways to implement it, but based on previous answers and after I discovered Hive virtual column 'INPUT__FILE__NAME' containing the URL of the source HDFS file, I create a User-Defined Function in Java to read its extended attributes. This function can be used in a Hive query as: XAttrSimpleUDF(INPUT__FILE__NAME,'user.my_key') The (quick and dirty) Java source code of the UDF looks like: public class XAttrSimpleUDF extends UDF {
public Text evaluate(Text uri, Text attr) {
if(uri == null || attr == null) return null;
Text xAttrTxt = null;
try {
Configuration myConf = new Configuration();
//Creating filesystem using uri
URI myURI = URI.create(uri.toString());
FileSystem fs = FileSystem.get(myURI, myConf);
// Retrieve value of extended attribute
xAttrTxt = new Text(fs.getXAttr(new Path(myURI), attr.toString()));
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
return xAttrTxt;
}
} I didn't test the performance of this when querying very large data sets.
I wished that extended attributes could be retrieved directly as a virtual column in a way similar to using virtual column INPUT__FILE__NAME.
... View more
01-28-2016
03:58 PM
Thanks. That doesn't look that easy and generic. As we don't necessarily know in advance what extra attributes will be available, prefixing lines with all available xattrs at read time would require prefixing the line with something looking like key=value and being able to decode it as pseudo columns. Anyway, thanks again as this gives a possible way.
... View more
01-28-2016
03:19 PM
I actually don't need to store it. I want to be able to refer to HDFS file metata as some kind of virtual column in a hive query. For instance, if an existing HDFS file testdata.csv contains my data. The file had extended attributes defined: hdfs dfs -setfattr -n user.src -v my_src testdata.csv I then want to query a Hive external table with this HDFS file (or multiple similar files) defined as location by retrieving columns from the file content and file extended attributes (using xattrs or something similar): select col1, col2, xattrs.user.src from Testdata;
... View more
01-28-2016
01:15 PM
Thanks for your answer. Good to see if there is a way to do it even if this requires custom (complex?) development. I am still novice with this, but to avoid having to duplicate existing standard InputFormat (as the content of data file could be based on existing InputFormat), is there a way to combine multiple InputFormat as most columns would be coming from file contents and other columns from file attributes. In general, I basically want to be able to extract file attributes for files containing data formatted in various ways, but all having the same file attributes.
... View more