Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDP Spark Hbase Connector Cell Versions?

Solved Go to solution

HDP Spark Hbase Connector Cell Versions?

New Contributor

When reading an hbase table into a dataframe, is there a way to specify which cell version to get? Or will this alway be the most recent?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: HDP Spark Hbase Connector Cell Versions?

New Contributor

Yes, you can specify which cell version to get. SHC users can select a timestamp, they can also select a time range with minimum timestamp and maximum timestamp (aka. retrieve multiple versions simultaneously). Please refer the test case here about how to do it.

5 REPLIES 5

Re: HDP Spark Hbase Connector Cell Versions?

Super Guru

@Todd Niven

In your configuration, set the following and then use getColumnCells to get the version you want. Familiarize with Result.java from hbase client API which is probably what you are using.

conf.set("hbase.mapreduce.scan.maxversions", "VERSION_YOU_WANT")

Re: HDP Spark Hbase Connector Cell Versions?

New Contributor

Does this approach work with https://github.com/hortonworks-spark/shc ? I am hoping to retrieve multiple versions simultaneously.

Re: HDP Spark Hbase Connector Cell Versions?

Super Guru

Yes. I think it should. I have not done it specifically but I have used Result.java class so it should work as it is the same class. Here is how I have done it.

  // create hbase configuration

        Configuration configuration = HBaseConfiguration.create();

        configuration.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));

        configuration.set(TableInputFormat.INPUT_TABLE, hbaseTableName);




        // create java hbase context

        JavaHBaseContext javaHBaseContext = new JavaHBaseContext(javaSparkContext, configuration);




        JavaPairRDD<ImmutableBytesWritable, Result> hbaseRDD =

                javaSparkContext.newAPIHadoopRDD(configuration, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);




        JavaRDD<Row> rowJavaRDD = hbaseRDD.map(new Function<Tuple2<ImmutableBytesWritable, Result>, Row  >() {

            private static final long serialVersionUID = -2021713021648730786L;

            public Row  call(Tuple2<ImmutableBytesWritable, Result> tuple) throws Exception {




                Object[] rowObject = new Object[namearr.length];




                for (int i=0; i<namearr.length; i++) {

                    Result result = tuple._2;
                    // handle each data type we support
                    if (typesarr[i].equals("string")) {
                        String str = Bytes.toString(result.getValue(Bytes.toBytes(cfarr[i]), Bytes.toBytes(namearr[i])));
                        rowObject[i] = str;
                    }

                }
Highlighted

Re: HDP Spark Hbase Connector Cell Versions?

Expert Contributor

If this does not work for you please open the feature request by creating an issue on the github project for SHC. /cc @wyang

Re: HDP Spark Hbase Connector Cell Versions?

New Contributor

Yes, you can specify which cell version to get. SHC users can select a timestamp, they can also select a time range with minimum timestamp and maximum timestamp (aka. retrieve multiple versions simultaneously). Please refer the test case here about how to do it.

Don't have an account?
Coming from Hortonworks? Activate your account here