Support Questions

DefOs · ‎07-25-2016

Thanks for including hbase-spark in CDH since v5.7.0. Unfortunately, it does not include the latest changes to hbase-spark (see:https://issues.apache.org/jira/browse/HBASE-14789). Example: HBase - Spark Dataframe integration. That means that Python users currently cannot use hbase-spark at all.

jkestelyn · ‎07-26-2016

Hi DefOS,

This JIRA is still a work-in-progress; furthermore, it is targeted for HBase 2.0, which hasn't been released yet. So, no surprise that it's not in CDH 5.7.

DefOs · ‎07-26-2016

Cloudera's original work on hbase-spark (https://issues.apache.org/jira/browse/HBASE-13992) is also targeted for HBase 2.0, but I'm grateful that it showed up in 5.7.0 already.

Just wanted to ask for including the latest additions for Python users. If the original work on hbase-spark is included in CDH already, you might as well include the latest additions to it. (Of course I understand you'd want to wait until that JIRA closes.)

jkestelyn · ‎07-26-2016

Yes of course, you're right, back ports are often done. But, only for features that are production-ready.

Harsh J · ‎07-27-2016

You should be able to read HBase Spark connector data via DataFrames in Pyspark, via the sqlContext already today:

~> hbase shell
> create 't', 'c'
> put 't', '1', 'c:a', 'a column data'
> put 't', '1', 'c:b', 'b column data'
> exit

~> export SPARK_CLASSPATH=$(hbase classpath)
~> pyspark
> hTbl = sqlContext.read.format('org.apache.hadoop.hbase.spark')
       .option('hbase.table','t')
       .option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING c:a, B STRING c:b')
       .option('hbase.use.hbase.context', False)
       .option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml')
       .load()
> hTbl.show()
+---------+---------+---------+
|KEY_FIELD|        A|        B|
+---------+---------+---------+
|1|a column data|b column data|
+---------+---------+---------+

There are some limitations as the JIRA notes of course. Which specific missing feature are you looking for, just so we know the scope of request?

DefOs · ‎07-27-2016

That's awesome, Harsh J, never saw an example of that in action.

I'm specifically looking for write support, though (https://issues.apache.org/jira/browse/HBASE-15336).

The other issue that benefits me greatly is the improved scan ability, as implemented in https://issues.apache.org/jira/browse/HBASE-14795.

Also, it would be nice to be able to make use of the new JSON format for defining the table catalog (https://issues.apache.org/jira/browse/HBASE-14801).

shra · ‎09-29-2016

Is there an example to fetch data for a set of rowkeys using this? Basically I am trying to find correct option parameters to use hbase table row properties to query, like (rowkey, row_start, row_stop, reverse, limit etc)

shra · ‎10-04-2016

happybase package allows some of these functions

sanjeev20 · ‎06-29-2017

@Harsh J

I have the same problem. First I created a hbase table with two column families like below

hbase(main):009:0> create 'pyspark', 'cf1', 'cf2'

hbase(main):011:0> desc 'pyspark'
Table pyspark is ENABLED
pyspark
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0460 seconds


hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark'

hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark'

hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df'
0 row(s) in 0.0070 seconds

hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python'
0 row(s) in 0.0080 seconds

hbase(main):017:0> scan 'pyspark'
ROW                                           COLUMN+CELL
 1                                            column=cf1:a, timestamp=1498758639265, value=spark
 1                                            column=cf2:b, timestamp=1498758656282, value=pyspark
 2                                            column=cf1:a, timestamp=1498758678501, value=df
 2                                            column=cf2:b, timestamp=1498758690263, value=python

Then in pyspark shell I have done like below:

pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load()

then done

pyspark.show()

This gave me result

+---------+----+-------+
|KEY_FIELD|   A|      B|
+---------+----+-------+
|        1|null|pyspark|
|        2|null| python|
+---------+----+-------+

Now my questions:

1)Why am I getting Null values in column A of the dataframe.

2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame.

3) Or is there a generic way of do this?

sanjeev20 · ‎06-29-2017

@Harsh J

Is there any documentation available in connecting to HBase from Pyspark.

I would like to know how we can create dataframes, read and write to Hbase from Pyspark.

Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration

Cloudera Community

Support Questions

Include latest hbase-spark in CDH