Support Questions

Find answers, ask questions, and share your expertise

Spark Hbase Connector - how to get/scan all versions of  columns and read/load to spark DF?

avatar

Dear All,

At NOKIA technologies we are evaluating the SHC connector to seamlessly read/write SPARK dataframes in HBase. So far, the writes and modifications work perfectly but the version based reads are failing - always returning the latest version only.

I have a Hbase table with multiple versions of a column family cf1:

hbase(main):008:0* describe 'gtest'

Table gtest is ENABLED

gtest COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CE LLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_ VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

1 row(s) in 0.0350 seconds

hbase(main):038:0> scan 'gtest',{VERSIONS=>5}

ROW COLUMN+CELL

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399503, value=138

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399425, value=1

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399345, value=59

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376205, value=138

0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376095, value=1

1 row(s) in 0.0290 seconds

I am trying to read all the versions of this column family using SHC connector.

Scanning the table from HBase shell fetches all the versions as displayed above. However, using the SHC connector, the dataframe contains only the latest version.

def catalog = s"""{ "table":{"namespace":"default", "name":"gtest"}, "rowkey":"account_id", "columns":{ "account_id":{"cf":"rowkey", "col":"account_id", "type":"string"}, "event_result":{"cf":"cf1", "col":"event_result", "type":"string"} } }""".stripMargin

val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504", HBaseRelation.MAX_VERSIONS -> "5").format("org.apache.spark.sql.execution.datasources.hbase").load()

df.show()

+--------------------+------------+

| account_id|event_result|

+--------------------+------------+

|0000adb15e1d04181...| 138|

+--------------------+------------+

4 REPLIES 4

avatar
Explorer

Could you please try this: remove "HBaseRelation.MAX_VERSIONS -> "5" " , only use "HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504" " ?

avatar

I did. but again, just the last row is returned.

avatar
Explorer

It's wired. It's supposed to work well (refer to here). Which's SHC version you are using?

avatar
Explorer

This issue has been resolved (https://github.com/hortonworks-spark/shc/pull/193). Before SHC publishes a new release, you can rebuild current SHC code which has included the fix.