Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Include latest hbase-spark in CDH

New Contributor

Thanks for including hbase-spark in CDH since v5.7.0. Unfortunately, it does not include the latest changes to hbase-spark (see:https://issues.apache.org/jira/browse/HBASE-14789). Example: HBase - Spark Dataframe integration. That means that Python users currently cannot use hbase-spark at all.

12 REPLIES 12

Master Collaborator

Hi DefOS,


This JIRA is still a work-in-progress; furthermore, it is targeted for HBase 2.0, which hasn't been released yet. So, no surprise that it's not in CDH 5.7.

New Contributor
Cloudera's original work on hbase-spark (https://issues.apache.org/jira/browse/HBASE-13992) is also targeted for HBase 2.0, but I'm grateful that it showed up in 5.7.0 already.

Just wanted to ask for including the latest additions for Python users. If the original work on hbase-spark is included in CDH already, you might as well include the latest additions to it. (Of course I understand you'd want to wait until that JIRA closes.)

Master Collaborator

Yes of course, you're right, back ports are often done. But, only for features that are production-ready.

Master Guru

 You should be able to read HBase Spark connector data via DataFrames in Pyspark, via the sqlContext already today:

 

~> hbase shell
> create 't', 'c'
> put 't', '1', 'c:a', 'a column data'
> put 't', '1', 'c:b', 'b column data'
> exit

~> export SPARK_CLASSPATH=$(hbase classpath)
~> pyspark
> hTbl = sqlContext.read.format('org.apache.hadoop.hbase.spark')
       .option('hbase.table','t')
       .option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING c:a, B STRING c:b')
       .option('hbase.use.hbase.context', False)
       .option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml')
       .load()
> hTbl.show()
+---------+---------+---------+
|KEY_FIELD|        A|        B|
+---------+---------+---------+
|1|a column data|b column data|
+---------+---------+---------+

There are some limitations as the JIRA notes of course. Which specific missing feature are you looking for, just so we know the scope of request?

New Contributor

That's awesome, Harsh J, never saw an example of that in action.

 

I'm specifically looking for write support, though (https://issues.apache.org/jira/browse/HBASE-15336).

 

The other issue that benefits me greatly is the improved scan ability, as implemented in https://issues.apache.org/jira/browse/HBASE-14795.

 

Also, it would be nice to be able to make use of the new JSON format for defining the table catalog (https://issues.apache.org/jira/browse/HBASE-14801).

New Contributor

Is there an example to fetch data for a set of rowkeys using this? Basically I am trying to find correct option parameters to use hbase table row properties to query, like (rowkey, row_start, row_stop, reverse, limit etc)

New Contributor

 

happybase package allows some of these functions

Explorer

@Harsh J

 

I have the same problem. First I created a hbase table with two column families like below

 

hbase(main):009:0> create 'pyspark', 'cf1', 'cf2'

hbase(main):011:0> desc 'pyspark'
Table pyspark is ENABLED
pyspark
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEE
P_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0460 seconds


hbase(main):012:0> put 'pyspark', '1', 'cf1:a','spark'

hbase(main):013:0> put 'pyspark', '1', 'cf2:b','pyspark'

hbase(main):015:0> put 'pyspark', '2', 'cf1:a','df'
0 row(s) in 0.0070 seconds

hbase(main):016:0> put 'pyspark', '2', 'cf2:b','python'
0 row(s) in 0.0080 seconds

hbase(main):017:0> scan 'pyspark'
ROW                                           COLUMN+CELL
 1                                            column=cf1:a, timestamp=1498758639265, value=spark
 1                                            column=cf2:b, timestamp=1498758656282, value=pyspark
 2                                            column=cf1:a, timestamp=1498758678501, value=df
 2                                            column=cf2:b, timestamp=1498758690263, value=python

Then in pyspark shell I have done like below:

pyspark = sqlContext.read.format('org.apache.hadoop.hbase.spark').option('hbase.table','pyspark').option('hbase.columns.mapping', 'KEY_FIELD STRING :key, A STRING cf1:a, B STRING cf1:b, A STRING cf2:a, B STRING cf2:b').option('hbase.use.hbase.context', False).option('hbase.config.resources', 'file:///etc/hbase/conf/hbase-site.xml').load()

then done 

pyspark.show()

This gave me result

 

+---------+----+-------+
|KEY_FIELD|   A|      B|
+---------+----+-------+
|        1|null|pyspark|
|        2|null| python|
+---------+----+-------+

Now my questions:

 

1)Why am I getting Null values in column A of the dataframe.

 

2) Should we manually pass column family and column names in option (hbase.columns.mapping) in the statement to create data frame.

 

3) Or is there a generic way of do this?

Explorer

@Harsh J

 

Is there any documentation available in connecting to HBase from Pyspark.

 

I would like to know how we can create dataframes, read and write to Hbase from Pyspark.

 

Any documentation links will be good enough for me to look through and explore about hbase pyspark intergration

New Contributor

Thanks for sharing this walkthrough!

@Harsh J  Can you help me?

I can't play this hbase-pyspark connection with cloudera CDH 6.1.1. I get the message: "An error occurred while calling o70.load .: java.lang.ClassNotFoundException: Failed to find data source: org.apache.hadoop.hbase.spark. Please find packages at http://spark.apache.org /third-party-projects.html "

 

Thank you so mucth

 

New Contributor

hi 

can anyone provide the command to run spark (spark-submit for example) with the connector? 

 

i get error "Failed to find data source: org.apache.hadoop.hbase.spark"

 

 

New Contributor

hello, @amirmam 

 

Did you manage to solve? I have the same problem with the current version of CDH 6.1.1

thanks!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.