Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Include latest hbase-spark in CDH

avatar
Explorer

Thanks for including hbase-spark in CDH since v5.7.0. Unfortunately, it does not include the latest changes to hbase-spark (see:https://issues.apache.org/jira/browse/HBASE-14789). Example: HBase - Spark Dataframe integration. That means that Python users currently cannot use hbase-spark at all.

12 REPLIES 12

avatar
Master Collaborator

Hi DefOS,


This JIRA is still a work-in-progress; furthermore, it is targeted for HBase 2.0, which hasn't been released yet. So, no surprise that it's not in CDH 5.7.

avatar
Explorer
Cloudera's original work on hbase-spark (https://issues.apache.org/jira/browse/HBASE-13992) is also targeted for HBase 2.0, but I'm grateful that it showed up in 5.7.0 already.

Just wanted to ask for including the latest additions for Python users. If the original work on hbase-spark is included in CDH already, you might as well include the latest additions to it. (Of course I understand you'd want to wait until that JIRA closes.)

avatar
Master Collaborator

Yes of course, you're right, back ports are often done. But, only for features that are production-ready.

avatar
Mentor

 You should be able to read HBase Spark connector data via DataFrames in Pyspark, via the sqlContext already today:

 

~> hbase shell
> create 't', 'c'
> put 't', '1', 'c:a', 'a column data'
> put 't', '1', 'c:b', 'b column data'
> exit

~> export SPARK_CLASSPATH=$(hbase classpath)
~> pyspark
> hTbl = sqlContext.read.format('org.apache.hadoop.hbase.spark')
       .option('hbase.table','t')
       .option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING c:a, B STRING c:b')
       .option('hbase.use.hbase.context', False)
       .option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml')
       .load()
> hTbl.show()
+---------+---------+---------+
|KEY_FIELD|        A|        B|
+---------+---------+---------+
|1|a column data|b column data|
+---------+---------+---------+

There are some limitations as the JIRA notes of course. Which specific missing feature are you looking for, just so we know the scope of request?

avatar
Explorer

That's awesome, Harsh J, never saw an example of that in action.

 

I'm specifically looking for write support, though (https://issues.apache.org/jira/browse/HBASE-15336).

 

The other issue that benefits me greatly is the improved scan ability, as implemented in https://issues.apache.org/jira/browse/HBASE-14795.

 

Also, it would be nice to be able to make use of the new JSON format for defining the table catalog (https://issues.apache.org/jira/browse/HBASE-14801).

avatar
New Contributor

Is there an example to fetch data for a set of rowkeys using this? Basically I am trying to find correct option parameters to use hbase table row properties to query, like (rowkey, row_start, row_stop, reverse, limit etc)

avatar
New Contributor

 

happybase package allows some of these functions

avatar
Contributor

@Harsh J

 

I have the same problem. First I created a hbase table with two column families like below

 

hbase(main):009:0> create 'pyspark', 'cf1', 'cf2'

hbase(main):011:0> desc 'pyspark'
Table pyspark is ENABLED
pyspark
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0460 seconds


hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark'

hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark'

hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df'
0 row(s) in 0.0070 seconds

hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python'
0 row(s) in 0.0080 seconds

hbase(main):017:0> scan 'pyspark'
ROW                                           COLUMN+CELL
 1                                            column=cf1:a, timestamp=1498758639265, value=spark
 1                                            column=cf2:b, timestamp=1498758656282, value=pyspark
 2                                            column=cf1:a, timestamp=1498758678501, value=df
 2                                            column=cf2:b, timestamp=1498758690263, value=python

Then in pyspark shell I have done like below:

pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load()

then done 

pyspark.show()

This gave me result

 

+---------+----+-------+
|KEY_FIELD|   A|      B|
+---------+----+-------+
|        1|null|pyspark|
|        2|null| python|
+---------+----+-------+

Now my questions:

 

1)Why am I getting Null values in column A of the dataframe.

 

2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame.

 

3) Or is there a generic way of do this?

avatar
Contributor

@Harsh J

 

Is there any documentation available in connecting to HBase from Pyspark.

 

I would like to know how we can create dataframes, read and write to Hbase from Pyspark.

 

Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration