- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark Hbase Connector - how to get/scan all versions of columns and read/load to spark DF?
- Labels:
-
Apache HBase
-
Apache Spark
Created 02-16-2017 04:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All,
At NOKIA technologies we are evaluating the SHC connector to seamlessly read/write SPARK dataframes in HBase. So far, the writes and modifications work perfectly but the version based reads are failing - always returning the latest version only.
I have a Hbase table with multiple versions of a column family cf1:
hbase(main):008:0* describe 'gtest'
Table gtest is ENABLED
gtest COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CE LLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_ VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0350 seconds
hbase(main):038:0> scan 'gtest',{VERSIONS=>5}
ROW COLUMN+CELL
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399503, value=138
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399425, value=1
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399345, value=59
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376205, value=138
0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376095, value=1
1 row(s) in 0.0290 seconds
I am trying to read all the versions of this column family using SHC connector.
Scanning the table from HBase shell fetches all the versions as displayed above. However, using the SHC connector, the dataframe contains only the latest version.
def catalog = s"""{ "table":{"namespace":"default", "name":"gtest"}, "rowkey":"account_id", "columns":{ "account_id":{"cf":"rowkey", "col":"account_id", "type":"string"}, "event_result":{"cf":"cf1", "col":"event_result", "type":"string"} } }""".stripMargin
val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504", HBaseRelation.MAX_VERSIONS -> "5").format("org.apache.spark.sql.execution.datasources.hbase").load()
df.show()
+--------------------+------------+
| account_id|event_result|
+--------------------+------------+
|0000adb15e1d04181...| 138|
+--------------------+------------+
Created 02-17-2017 11:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please try this: remove "HBaseRelation.MAX_VERSIONS -> "5" " , only use "HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504" " ?
Created 02-18-2017 08:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did. but again, just the last row is returned.
Created 02-19-2017 04:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's wired. It's supposed to work well (refer to here). Which's SHC version you are using?
Created 11-02-2017 09:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue has been resolved (https://github.com/hortonworks-spark/shc/pull/193). Before SHC publishes a new release, you can rebuild current SHC code which has included the fix.
