Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Hbase Connector - how to get/scan all versions of  columns and read/load to spark DF?

Spark Hbase Connector - how to get/scan all versions of  columns and read/load to spark DF?

New Contributor

Dear All,

At NOKIA technologies we are evaluating the SHC connector to seamlessly read/write SPARK dataframes in HBase. So far, the writes and modifications work perfectly but the version based reads are failing - always returning the latest version only.

I have a Hbase table with multiple versions of a column family cf1:

hbase(main):008:0* describe 'gtest'

Table gtest is ENABLED

gtest COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CE LLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_ VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

1 row(s) in 0.0350 seconds

hbase(main):038:0> scan 'gtest',{VERSIONS=>5}

ROW COLUMN+CELL

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399503, value=138

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399425, value=1

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399345, value=59

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376205, value=138

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376095, value=1

1 row(s) in 0.0290 seconds

I am trying to read all the versions of this column family using SHC connector.

Scanning the table from HBase shell fetches all the versions as displayed above. However, using the SHC connector, the dataframe contains only the latest version.

def catalog = s"""{ "table":{"namespace":"default", "name":"gtest"}, "rowkey":"account_id", "columns":{ "account_id":{"cf":"rowkey", "col":"account_id", "type":"string"}, "event_result":{"cf":"cf1", "col":"event_result", "type":"string"} } }""".stripMargin

val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504", HBaseRelation.MAX_VERSIONS -> "5").format("org.apache.spark.sql.execution.datasources.hbase").load()

df.show()

+--------------------+------------+

| account_id|event_result|

+--------------------+------------+

|0000adb15e1d04181...| 138|

+--------------------+------------+

4 REPLIES 4

Re: Spark Hbase Connector - how to get/scan all versions of  columns and read/load to spark DF?

New Contributor

Could you please try this: remove "HBaseRelation.MAX_VERSIONS -> "5" " , only use "HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504" " ?

Re: Spark Hbase Connector - how to get/scan all versions of  columns and read/load to spark DF?

New Contributor

I did. but again, just the last row is returned.

Re: Spark Hbase Connector - how to get/scan all versions of  columns and read/load to spark DF?

New Contributor

It's wired. It's supposed to work well (refer to here). Which's SHC version you are using?

Re: Spark Hbase Connector - how to get/scan all versions of  columns and read/load to spark DF?

New Contributor

This issue has been resolved (https://github.com/hortonworks-spark/shc/pull/193). Before SHC publishes a new release, you can rebuild current SHC code which has included the fix.