Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Master Collaborator

This article will describe the steps to access the remote CDP cluster HBase data from another CDP cluster using Spark in Kerberized environment.

Assume we have two clusters say 'Cluster_A' and 'Cluster_B'. 'Cluster_A' is having the HBase data. 'Cluster_B' is having Spark. Now, we are trying to access the HBase data available in 'Cluster_A' using Spark from 'Cluster_B'.

Prerequisites:

Both clusters 'Cluster_A' and 'Cluster_B' with keytabs need to have the same REALM.

Follow these steps: 

  1. Login to 'Cluster_A' edge node
  2. Obtain the Kerberos ticket using kinit:
    kinit -kt <key_tab_file> <principal_name>​
  3. Login to the Hbase shell and create the 'person' table:
    ranga]# hbase shell
    
    hbase(main):001:0> list
    TABLE
    0 row(s)
    
    hbase(main):013:0> create 'person', 'p'
    Created table person
    Took 8.2307 seconds
    => Hbase::Table - person
    
    hbase(main):014:0> put 'person',1,'p:id','1'
    Took 0.0173 seconds
    hbase(main):015:0> put 'person',1,'p:name','Ranga Reddy'
    Took 0.0045 seconds
    hbase(main):016:0> put 'person',1,'p:email','ranga@gmail.com'
    Took 0.0043 seconds
    hbase(main):017:0> put 'person',1,'p:age','25'
    Took 0.0049 seconds
    hbase(main):018:0> scan 'person'
    ROW                                                              COLUMN+CELL
     1                                                               column=p:age, timestamp=1616425683759, value=25
     1                                                               column=p:email, timestamp=1616425681754, value=ranga@gmail.com
     1                                                               column=p:id, timestamp=1616425681717, value=1
     1                                                               column=p:name, timestamp=1616425681736, value=Ranga Reddy
    
    hbase(main):018:0> exit
  4. Copy the hbase-site.xml from 'Cluster_A' to 'Cluster_B':
    scp /etc/hbase/conf/hbase-site.xml root@cluster_b_ipaddress:/tmp​
  5. Login to 'Cluster_B' edge node
  6. Obtain the Kerberos ticket using kinit:
    kinit -kt <key_tab_file> <principal_name>
  7. Place the above-copied hbase-site.xml (in Step 4) into some temporary directory. Example: /tmp/hbase/conf
    mkdir -p /tmp/hbase/conf
    cp /tmp/hbase-site.xml /tmp/hbase/conf
  8. Launch the Spark shell by providing HBase Spark connector packages and HBase configuration (/tmp/hbase/conf) directory:
    spark-shell \
    --master yarn \
    --conf spark.driver.extraClassPath=/tmp/hbase/conf \
    --conf spark.executor.extraClassPath=/tmp/hbase/conf \
    --jars /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-1.0.0*.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-protocol-shaded-*.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-mapreduce.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-miscellaneous.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-protobuf.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol-shaded.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol-shaded.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-shaded-netty.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-zookeeper.jar,\
    /opt/cloudera/parcels/CDH/jars/htrace-core-3.1.0-incubating.jar,\
    /opt/cloudera/parcels/CDH/lib/hbase/hbase-annotations.jar
  9. Run the following code after launching the spark-shell:
    val df = spark.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "id STRING :key, name STRING :p:name, email STRING p:email, age STRING p:age").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", false).load()
    df.show(truncate=false)
    The output of the above command:
    scala> df.show(truncate=false)
    +---+----+---------------+---+
    |age|name|email |id |
    +---+----+---------------+---+
    |25 |1 |ranga@gmail.com| |
    +---+----+---------------+---+

Thanks for reading this article. I hope you have enjoyed it.

2,248 Views
Version history
Last update:
‎03-24-2021 09:44 PM
Updated by: