Community Articles

subratadas · ‎03-24-2021

This article will describe the steps to access the remote CDP cluster HBase data from another CDP cluster using Spark in Kerberized environment.

Assume we have two clusters say 'Cluster_A' and 'Cluster_B'. 'Cluster_A' is having the HBase data. 'Cluster_B' is having Spark. Now, we are trying to access the HBase data available in 'Cluster_A' using Spark from 'Cluster_B'.

Prerequisites:

Both clusters 'Cluster_A' and 'Cluster_B' with keytabs need to have the same REALM.

Follow these steps:

Login to 'Cluster_A' edge node

Obtain the Kerberos ticket using kinit:

kinit -kt <key_tab_file> <principal_name>

Login to the Hbase shell and create the 'person' table:

ranga]# hbase shell

hbase(main):001:0> list
TABLE
0 row(s)

hbase(main):013:0> create 'person', 'p'
Created table person
Took 8.2307 seconds
=> Hbase::Table - person

hbase(main):014:0> put 'person',1,'p:id','1'
Took 0.0173 seconds
hbase(main):015:0> put 'person',1,'p:name','Ranga Reddy'
Took 0.0045 seconds
hbase(main):016:0> put 'person',1,'p:email','ranga@gmail.com'
Took 0.0043 seconds
hbase(main):017:0> put 'person',1,'p:age','25'
Took 0.0049 seconds
hbase(main):018:0> scan 'person'
ROW                                                              COLUMN+CELL
 1                                                               column=p:age, timestamp=1616425683759, value=25
 1                                                               column=p:email, timestamp=1616425681754, value=ranga@gmail.com
 1                                                               column=p:id, timestamp=1616425681717, value=1
 1                                                               column=p:name, timestamp=1616425681736, value=Ranga Reddy

hbase(main):018:0> exit

Copy the hbase-site.xml from 'Cluster_A' to 'Cluster_B':

scp /etc/hbase/conf/hbase-site.xml root@cluster_b_ipaddress:/tmp

Login to 'Cluster_B' edge node

Obtain the Kerberos ticket using kinit:

kinit -kt <key_tab_file> <principal_name>

Place the above-copied hbase-site.xml (in Step 4) into some temporary directory. Example: /tmp/hbase/conf
```
mkdir -p /tmp/hbase/conf
cp /tmp/hbase-site.xml /tmp/hbase/conf
```

Launch the Spark shell by providing HBase Spark connector packages and HBase configuration (/tmp/hbase/conf) directory:

spark-shell \
--master yarn \
--conf spark.driver.extraClassPath=/tmp/hbase/conf \
--conf spark.executor.extraClassPath=/tmp/hbase/conf \
--jars /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-1.0.0*.jar,\
/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-protocol-shaded-*.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-mapreduce.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-miscellaneous.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-protobuf.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol-shaded.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol-shaded.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-netty.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-zookeeper.jar,\
/opt/cloudera/parcels/CDH/jars/htrace-core-3.1.0-incubating.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-annotations.jar

Run the following code after launching the spark-shell:

val df = spark.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "id STRING :key, name STRING :p:name, email STRING p:email, age STRING p:age").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", false).load()
df.show(truncate=false)

The output of the above command:

scala> df.show(truncate=false)
+---+----+---------------+---+
|age|name|email |id |
+---+----+---------------+---+
|25 |1 |ranga@gmail.com| |
+---+----+---------------+---+

Thanks for reading this article. I hope you have enjoyed it.

Cloudera Community

Community Articles

Accessing the remote CDP cluster HBase data from another CDP cluster using Spark in Kerberized environment.

Apache HBase

Apache Spark

Cloudera Data Platform (CDP)

Kerberos

Prerequisites: