Created 02-16-2017 04:30 AM
Dear All,
At NOKIA technologies we are evaluating the SHC connector to seamlessly read/write SPARK dataframes in HBase. So far, the writes and modifications work perfectly but the version based reads are failing - always returning the latest version only.
I have a Hbase table with multiple versions of a column family cf1:
hbase(main):008:0* describe 'gtest'
Table gtest is ENABLED
gtest COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CE LLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_ VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0350 seconds
hbase(main):038:0> scan 'gtest',{VERSIONS=>5}
ROW COLUMN+CELL
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399503, value=138
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399425, value=1
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399345, value=59
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376205, value=138
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376095, value=1
1 row(s) in 0.0290 seconds
I am trying to read all the versions of this column family using SHC connector.
Scanning the table from HBase shell fetches all the versions as displayed above. However, using the SHC connector, the dataframe contains only the latest version.
def catalog = s"""{ "table":{"namespace":"default", "name":"gtest"}, "rowkey":"account_id", "columns":{ "account_id":{"cf":"rowkey", "col":"account_id", "type":"string"}, "event_result":{"cf":"cf1", "col":"event_result", "type":"string"} } }""".stripMargin
val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504", HBaseRelation.MAX_VERSIONS -> "5").format("org.apache.spark.sql.execution.datasources.hbase").load()
df.show()
+--------------------+------------+
| account_id|event_result|
+--------------------+------------+
|0000adb15e1d04181...| 138|
+--------------------+------------+
Created 02-17-2017 11:40 PM
Could you please try this: remove "HBaseRelation.MAX_VERSIONS -> "5" " , only use "HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504" " ?
Created 02-18-2017 08:44 AM
I did. but again, just the last row is returned.
Created 02-19-2017 04:38 AM
It's wired. It's supposed to work well (refer to here). Which's SHC version you are using?
Created 11-02-2017 09:17 PM
This issue has been resolved (https://github.com/hortonworks-spark/shc/pull/193). Before SHC publishes a new release, you can rebuild current SHC code which has included the fix.