Reply
New Contributor
Posts: 4
Registered: ‎05-10-2016

Include latest hbase-spark in CDH

Thanks for including hbase-spark in CDH since v5.7.0. Unfortunately, it does not include the latest changes to hbase-spark (see:https://issues.apache.org/jira/browse/HBASE-14789). Example: HBase - Spark Dataframe integration. That means that Python users currently cannot use hbase-spark at all.

Posts: 354
Topics: 162
Kudos: 61
Solutions: 27
Registered: ‎06-26-2013

Re: Include latest hbase-spark in CDH

Hi DefOS,


This JIRA is still a work-in-progress; furthermore, it is targeted for HBase 2.0, which hasn't been released yet. So, no surprise that it's not in CDH 5.7.

New Contributor
Posts: 4
Registered: ‎05-10-2016

Re: Include latest hbase-spark in CDH

Cloudera's original work on hbase-spark (https://issues.apache.org/jira/browse/HBASE-13992) is also targeted for HBase 2.0, but I'm grateful that it showed up in 5.7.0 already.

Just wanted to ask for including the latest additions for Python users. If the original work on hbase-spark is included in CDH already, you might as well include the latest additions to it. (Of course I understand you'd want to wait until that JIRA closes.)
Posts: 354
Topics: 162
Kudos: 61
Solutions: 27
Registered: ‎06-26-2013

Re: Include latest hbase-spark in CDH

Yes of course, you're right, back ports are often done. But, only for features that are production-ready.

Posts: 1,524
Kudos: 265
Solutions: 232
Registered: ‎07-31-2013

Re: Include latest hbase-spark in CDH

 You should be able to read HBase Spark connector data via DataFrames in Pyspark, via the sqlContext already today:

 

~> hbase shell
> create 't', 'c'
> put 't', '1', 'c:a', 'a column data'
> put 't', '1', 'c:b', 'b column data'
> exit

~> export SPARK_CLASSPATH=$(hbase classpath)
~> pyspark
> hTbl = sqlContext.read.format('org.apache.hadoop.hbase.spark')
       .option('hbase.table','t')
       .option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING c:a, B STRING c:b')
       .option('hbase.use.hbase.context', False)
       .option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml')
       .load()
> hTbl.show()
+---------+---------+---------+
|KEY_FIELD|        A|        B|
+---------+---------+---------+
|1|a column data|b column data|
+---------+---------+---------+

There are some limitations as the JIRA notes of course. Which specific missing feature are you looking for, just so we know the scope of request?

Backline Customer Operations Engineer
New Contributor
Posts: 4
Registered: ‎05-10-2016

Re: Include latest hbase-spark in CDH

That's awesome, Harsh J, never saw an example of that in action.

 

I'm specifically looking for write support, though (https://issues.apache.org/jira/browse/HBASE-15336).

 

The other issue that benefits me greatly is the improved scan ability, as implemented in https://issues.apache.org/jira/browse/HBASE-14795.

 

Also, it would be nice to be able to make use of the new JSON format for defining the table catalog (https://issues.apache.org/jira/browse/HBASE-14801).

New Contributor
Posts: 2
Registered: ‎09-29-2016

Re: Include latest hbase-spark in CDH

[ Edited ]

Is there an example to fetch data for a set of rowkeys using this? Basically I am trying to find correct option parameters to use hbase table row properties to query, like (rowkey, row_start, row_stop, reverse, limit etc)

New Contributor
Posts: 2
Registered: ‎09-29-2016

Re: Include latest hbase-spark in CDH

[ Edited ]

 

happybase package allows some of these functions

Explorer
Posts: 33
Registered: ‎01-30-2017

Re: Include latest hbase-spark in CDH

@Harsh J

 

I have the same problem. First I created a hbase table with two column families like below

 

hbase(main):009:0> create 'pyspark', 'cf1', 'cf2'

hbase(main):011:0> desc 'pyspark'
Table pyspark is ENABLED
pyspark
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0460 seconds


hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark'

hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark'

hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df'
0 row(s) in 0.0070 seconds

hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python'
0 row(s) in 0.0080 seconds

hbase(main):017:0> scan 'pyspark'
ROW                                           COLUMN+CELL
 1                                            column=cf1:a, timestamp=1498758639265, value=spark
 1                                            column=cf2:b, timestamp=1498758656282, value=pyspark
 2                                            column=cf1:a, timestamp=1498758678501, value=df
 2                                            column=cf2:b, timestamp=1498758690263, value=python

Then in pyspark shell I have done like below:

pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load()

then done 

pyspark.show()

This gave me result

 

+---------+----+-------+
|KEY_FIELD|   A|      B|
+---------+----+-------+
|        1|null|pyspark|
|        2|null| python|
+---------+----+-------+

Now my questions:

 

1)Why am I getting Null values in column A of the dataframe.

 

2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame.

 

3) Or is there a generic way of do this?

Explorer
Posts: 33
Registered: ‎01-30-2017

Re: Include latest hbase-spark in CDH

@Harsh J

 

Is there any documentation available in connecting to HBase from Pyspark.

 

I would like to know how we can create dataframes, read and write to Hbase from Pyspark.

 

Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration

Announcements