Created on 03-24-2021 11:21 AM - edited on 03-24-2021 09:44 PM by subratadas
This article will describe the steps to access the remote CDP cluster HBase data from another CDP cluster using Spark in Kerberized environment.
Assume we have two clusters say 'Cluster_A' and 'Cluster_B'. 'Cluster_A' is having the HBase data. 'Cluster_B' is having Spark. Now, we are trying to access the HBase data available in 'Cluster_A' using Spark from 'Cluster_B'.
Both clusters 'Cluster_A' and 'Cluster_B' with keytabs need to have the same REALM.
Follow these steps:
kinit -kt <key_tab_file> <principal_name>
ranga]# hbase shell
hbase(main):001:0> list
TABLE
0 row(s)
hbase(main):013:0> create 'person', 'p'
Created table person
Took 8.2307 seconds
=> Hbase::Table - person
hbase(main):014:0> put 'person',1,'p:id','1'
Took 0.0173 seconds
hbase(main):015:0> put 'person',1,'p:name','Ranga Reddy'
Took 0.0045 seconds
hbase(main):016:0> put 'person',1,'p:email','ranga@gmail.com'
Took 0.0043 seconds
hbase(main):017:0> put 'person',1,'p:age','25'
Took 0.0049 seconds
hbase(main):018:0> scan 'person'
ROW COLUMN+CELL
1 column=p:age, timestamp=1616425683759, value=25
1 column=p:email, timestamp=1616425681754, value=ranga@gmail.com
1 column=p:id, timestamp=1616425681717, value=1
1 column=p:name, timestamp=1616425681736, value=Ranga Reddy
hbase(main):018:0> exit
scp /etc/hbase/conf/hbase-site.xml root@cluster_b_ipaddress:/tmp
kinit -kt <key_tab_file> <principal_name>
mkdir -p /tmp/hbase/conf
cp /tmp/hbase-site.xml /tmp/hbase/conf
spark-shell \
--master yarn \
--conf spark.driver.extraClassPath=/tmp/hbase/conf \
--conf spark.executor.extraClassPath=/tmp/hbase/conf \
--jars /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-1.0.0*.jar,\
/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-protocol-shaded-*.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-mapreduce.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-miscellaneous.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-protobuf.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol-shaded.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol-shaded.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-netty.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-zookeeper.jar,\
/opt/cloudera/parcels/CDH/jars/htrace-core-3.1.0-incubating.jar,\
/opt/cloudera/parcels/CDH/lib/hbase/hbase-annotations.jar
val df = spark.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "id STRING :key, name STRING :p:name, email STRING p:email, age STRING p:age").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", false).load()
df.show(truncate=false)
The output of the above command:scala> df.show(truncate=false)
+---+----+---------------+---+
|age|name|email |id |
+---+----+---------------+---+
|25 |1 |ranga@gmail.com| |
+---+----+---------------+---+
Thanks for reading this article. I hope you have enjoyed it.