Created 01-17-2017 03:48 AM
When reading an hbase table into a dataframe, is there a way to specify which cell version to get? Or will this alway be the most recent?
Created 01-17-2017 08:36 PM
Yes, you can specify which cell version to get. SHC users can select a timestamp, they can also select a time range with minimum timestamp and maximum timestamp (aka. retrieve multiple versions simultaneously). Please refer the test case here about how to do it.
Created 01-17-2017 05:38 AM
In your configuration, set the following and then use getColumnCells to get the version you want. Familiarize with from hbase client API which is probably what you are using.
conf.set("hbase.mapreduce.scan.maxversions", "VERSION_YOU_WANT")
Created 01-17-2017 05:47 AM
Does this approach work with ? I am hoping to retrieve multiple versions simultaneously.
Created 01-17-2017 09:10 PM
Yes. I think it should. I have not done it specifically but I have used class so it should work as it is the same class. Here is how I have done it.
// create hbase configuration Configuration configuration = HBaseConfiguration.create(); configuration.addResource(new Path("/etc/hbase/conf/hbase-site.xml")); configuration.set(TableInputFormat.INPUT_TABLE, hbaseTableName); // create java hbase context JavaHBaseContext javaHBaseContext = new JavaHBaseContext(javaSparkContext, configuration); JavaPairRDD<ImmutableBytesWritable, Result> hbaseRDD = javaSparkContext.newAPIHadoopRDD(configuration, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); JavaRDD<Row> rowJavaRDD = Function<Tuple2<ImmutableBytesWritable, Result>, Row >() { private static final long serialVersionUID = -2021713021648730786L; public Row call(Tuple2<ImmutableBytesWritable, Result> tuple) throws Exception { Object[] rowObject = new Object[namearr.length]; for (int i=0; i<namearr.length; i++) { Result result = tuple._2; // handle each data type we support if (typesarr[i].equals("string")) { String str = Bytes.toString(result.getValue(Bytes.toBytes(cfarr[i]), Bytes.toBytes(namearr[i]))); rowObject[i] = str; } }
Created 01-17-2017 08:23 PM
If this does not work for you please open the feature request by creating an issue on the github project for SHC. /cc @wyang
Created 01-17-2017 08:36 PM
Yes, you can specify which cell version to get. SHC users can select a timestamp, they can also select a time range with minimum timestamp and maximum timestamp (aka. retrieve multiple versions simultaneously). Please refer the test case here about how to do it.