Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Include latest hbase-spark in CDH

Include latest hbase-spark in CDH

New Contributor

Thanks for including hbase-spark in CDH since v5.7.0. Unfortunately, it does not include the latest changes to hbase-spark (see:https://issues.apache.org/jira/browse/HBASE-14789). Example: HBase - Spark Dataframe integration. That means that Python users currently cannot use hbase-spark at all.

12 REPLIES 12

Re: Include latest hbase-spark in CDH

Master Collaborator

Hi DefOS,


This JIRA is still a work-in-progress; furthermore, it is targeted for HBase 2.0, which hasn't been released yet. So, no surprise that it's not in CDH 5.7.

Re: Include latest hbase-spark in CDH

New Contributor
Cloudera's original work on hbase-spark (https://issues.apache.org/jira/browse/HBASE-13992) is also targeted for HBase 2.0, but I'm grateful that it showed up in 5.7.0 already.

Just wanted to ask for including the latest additions for Python users. If the original work on hbase-spark is included in CDH already, you might as well include the latest additions to it. (Of course I understand you'd want to wait until that JIRA closes.)

Re: Include latest hbase-spark in CDH

Master Collaborator

Yes of course, you're right, back ports are often done. But, only for features that are production-ready.

Highlighted

Re: Include latest hbase-spark in CDH

Master Guru

 You should be able to read HBase Spark connector data via DataFrames in Pyspark, via the sqlContext already today:

 

~> hbase shell
> create 't', 'c'
> put 't', '1', 'c:a', 'a column data'
> put 't', '1', 'c:b', 'b column data'
> exit

~> export SPARK_CLASSPATH=$(hbase classpath)
~> pyspark
> hTbl = sqlContext.read.format('org.apache.hadoop.hbase.spark')
       .option('hbase.table','t')
       .option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING c:a, B STRING c:b')
       .option('hbase.use.hbase.context', False)
       .option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml')
       .load()
> hTbl.show()
+---------+---------+---------+
|KEY_FIELD|        A|        B|
+---------+---------+---------+
|1|a column data|b column data|
+---------+---------+---------+

There are some limitations as the JIRA notes of course. Which specific missing feature are you looking for, just so we know the scope of request?

Re: Include latest hbase-spark in CDH

New Contributor

That's awesome, Harsh J, never saw an example of that in action.

 

I'm specifically looking for write support, though (https://issues.apache.org/jira/browse/HBASE-15336).

 

The other issue that benefits me greatly is the improved scan ability, as implemented in https://issues.apache.org/jira/browse/HBASE-14795.

 

Also, it would be nice to be able to make use of the new JSON format for defining the table catalog (https://issues.apache.org/jira/browse/HBASE-14801).

Re: Include latest hbase-spark in CDH

New Contributor

Is there an example to fetch data for a set of rowkeys using this? Basically I am trying to find correct option parameters to use hbase table row properties to query, like (rowkey, row_start, row_stop, reverse, limit etc)

Re: Include latest hbase-spark in CDH

New Contributor

 

happybase package allows some of these functions

Re: Include latest hbase-spark in CDH

Explorer

@Harsh J

 

I have the same problem. First I created a hbase table with two column families like below

 

hbase(main):009:0> create 'pyspark', 'cf1', 'cf2'

hbase(main):011:0> desc 'pyspark'
Table pyspark is ENABLED
pyspark
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0460 seconds


hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark'

hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark'

hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df'
0 row(s) in 0.0070 seconds

hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python'
0 row(s) in 0.0080 seconds

hbase(main):017:0> scan 'pyspark'
ROW                                           COLUMN+CELL
 1                                            column=cf1:a, timestamp=1498758639265, value=spark
 1                                            column=cf2:b, timestamp=1498758656282, value=pyspark
 2                                            column=cf1:a, timestamp=1498758678501, value=df
 2                                            column=cf2:b, timestamp=1498758690263, value=python

Then in pyspark shell I have done like below:

pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load()

then done 

pyspark.show()

This gave me result

 

+---------+----+-------+
|KEY_FIELD|   A|      B|
+---------+----+-------+
|        1|null|pyspark|
|        2|null| python|
+---------+----+-------+

Now my questions:

 

1)Why am I getting Null values in column A of the dataframe.

 

2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame.

 

3) Or is there a generic way of do this?

Re: Include latest hbase-spark in CDH

Explorer

@Harsh J

 

Is there any documentation available in connecting to HBase from Pyspark.

 

I would like to know how we can create dataframes, read and write to Hbase from Pyspark.

 

Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration

Don't have an account?
Coming from Hortonworks? Activate your account here