Created on 07-25-2016 01:16 PM - edited 09-16-2022 03:31 AM
Thanks for including hbase-spark in CDH since v5.7.0. Unfortunately, it does not include the latest changes to hbase-spark (see:https://issues.apache.org/jira/browse/HBASE-14789). Example: HBase - Spark Dataframe integration. That means that Python users currently cannot use hbase-spark at all.
Created 07-26-2016 10:30 AM
Hi DefOS,
This JIRA is still a work-in-progress; furthermore, it is targeted for HBase 2.0, which hasn't been released yet. So, no surprise that it's not in CDH 5.7.
Created 07-26-2016 10:54 AM
Created 07-26-2016 04:25 PM
Yes of course, you're right, back ports are often done. But, only for features that are production-ready.
Created 07-27-2016 12:12 AM
You should be able to read HBase Spark connector data via DataFrames in Pyspark, via the sqlContext already today:
~> hbase shell > create 't', 'c' > put 't', '1', 'c:a', 'a column data' > put 't', '1', 'c:b', 'b column data' > exit ~> export SPARK_CLASSPATH=$(hbase classpath) ~> pyspark > hTbl = sqlContext.read.format('org.apache.hadoop.hbase.spark') .option('hbase.table','t') .option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING c:a, B STRING c:b') .option('hbase.use.hbase.context', False) .option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml') .load() > hTbl.show() +---------+---------+---------+ |KEY_FIELD| A| B| +---------+---------+---------+ |1|a column data|b column data| +---------+---------+---------+
There are some limitations as the JIRA notes of course. Which specific missing feature are you looking for, just so we know the scope of request?
Created 07-27-2016 09:49 AM
That's awesome, Harsh J, never saw an example of that in action.
I'm specifically looking for write support, though (https://issues.apache.org/jira/browse/HBASE-15336).
The other issue that benefits me greatly is the improved scan ability, as implemented in https://issues.apache.org/jira/browse/HBASE-14795.
Also, it would be nice to be able to make use of the new JSON format for defining the table catalog (https://issues.apache.org/jira/browse/HBASE-14801).
Created on 09-29-2016 08:35 AM - edited 09-29-2016 08:58 AM
Is there an example to fetch data for a set of rowkeys using this? Basically I am trying to find correct option parameters to use hbase table row properties to query, like (rowkey, row_start, row_stop, reverse, limit etc)
Created on 10-04-2016 10:56 AM - edited 10-04-2016 11:27 AM
happybase package allows some of these functions
Created 06-29-2017 11:39 AM
I have the same problem. First I created a hbase table with two column families like below
hbase(main):009:0> create 'pyspark', 'cf1', 'cf2' hbase(main):011:0> desc 'pyspark' Table pyspark is ENABLED pyspark COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} {NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 2 row(s) in 0.0460 seconds hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark' hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark' hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df' 0 row(s) in 0.0070 seconds hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python' 0 row(s) in 0.0080 seconds hbase(main):017:0> scan 'pyspark' ROW COLUMN+CELL 1 column=cf1:a, timestamp=1498758639265, value=spark 1 column=cf2:b, timestamp=1498758656282, value=pyspark 2 column=cf1:a, timestamp=1498758678501, value=df 2 column=cf2:b, timestamp=1498758690263, value=python
Then in pyspark shell I have done like below:
pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load()
then done
pyspark.show()
This gave me result
+---------+----+-------+ |KEY_FIELD| A| B| +---------+----+-------+ | 1|null|pyspark| | 2|null| python| +---------+----+-------+
Now my questions:
1)Why am I getting Null values in column A of the dataframe.
2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame.
3) Or is there a generic way of do this?
Created 06-29-2017 11:42 AM
Is there any documentation available in connecting to HBase from Pyspark.
I would like to know how we can create dataframes, read and write to Hbase from Pyspark.
Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration