Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Cloudera Employee

COD - CDE Spark-HBase tutorial

In this article related to CDP Public Cloud, we will walk through steps required to be followed to read/ write to COD (Cloudera Operational Database) from Spark on CDE (Cloudera Data Engineering) using spark-hbase connector.
If you are looking to leverage Phoenix instead, please refer to this community article.

Assumption

  • COD is already provisioned and database is created. Refer to this link for the same.
    For this example, we will assume "amallegni-cod" as the database name.
  • CDE is already provisioned and the virtual cluster is already created. Refer to this link for the same.

COD

Download client configuration

Spark in CDE to be able to talk to COD, would require the hbase-site.xml config of the COD cluster. Do the following steps to retrieve the same:

  1. Go to the COD control plane UI and click on "amallegni-cod" database.
  2. Under the Connect tab of COD database, look for HBase Client Configuration URL field. Following is the screenshot for the same.
    amallegni_0-1643821731964.png

     

    The configuration can be downloaded using the following curl command.

 

curl -f -o "hbase-config.zip" -u "<YOUR WORKLOAD USERNAME>" "https://cod--4wfxojpfxmwg-gateway.XXXXXXXXX.cloudera.site/clouderamanager/api/v41/clusters/cod--4wfxojpfxmwg/services/hbase/clientConfig"

 

 

  • Make sure to provide the "workload" password for the above curl call.
  • Explore the downloaded zip file to obtain the hbase-site.xml file.

Create the HBase table

In this example, we are going to use COD database HUE to quickly create a new HBase table inside our COD database. Let's walk through this step by step:

  1. Go to the COD control plane UI and click on "amallegni-cod" database.
  2. Click on the Hue link.

amallegni_4-1643801126730.png


Once in Hue, click on the HBase menu item on the left sidebar, and then click on the New Table button in the top right corner:

amallegni_5-1643801353420.png

 

Choose your table name and column families, and then click on the Submit button.
For the sake of this example, let's call the table 'testtable' and let's create a single column family called 'testcf'.

 

CDE

For Spark code complete examples, refer to HbaseRead.scala and this HBaseWrite.scala examples.

Configure your Job via CDE CLI


To configure the job via CDE CLI, perform these steps:

  1. Configure CDE CLI to point to the virtual cluster created in the above step. For more details, see Configuring the CLI client.
  2. Create resources using the following command.
    cde resource create --name cod-spark-resource
  3. Upload hbase-site.xml
    cde resource upload --name cod-spark-resource --local-path /your/path/to/hbase-site.xml --resource-path conf/hbase-site.xml
  4. Upload the demo app jar that was built earlier.
    cde resource upload --name cod-spark-resource --local-path /path/to/your/spark-hbase-project.jar --resource-path spark-hbase-project.jar
  5. Create the CDE job using a JSON definition which should look like this:

 

{
"mounts": [
{
"resourceName": "cod-spark-resource"
}
],
"name": "my-cod-spark-job",
"spark": {
"className": "<YOUR MAIN CLASS>",
"conf": {
"spark.executor.extraClassPath": "/app/mount/conf",
"spark.driver.extraClassPath": "/app/mount/conf"
},
"args": [ "<YOUR ARGS IF ANY>"],
"driverCores": 1,
"driverMemory": "1g",
"executorCores": 1,
"executorMemory": "1g",
"file": "spark-hbase-project.jar",
"pyFiles": [],
"files": ["conf/hbase-site.xml"],
"numExecutors": 4
}
}​​

 

Finally, assuming the above JSON was saved into my-job-definition.json, import the job using the following command:

 

cde job import --file my-job-definition.json

 

Please note the spark.driver.extraClassPath and spark.executor.extraClassPath inside the job definition, pointing to the same path we used to upload the hbase-site.xml into our CDE resource. 
This is important since this way the hbase-site.xml will be automatically loaded from the classpath and you won't need to refer to it explicitly in your Spark code, hence, you will only need to do like this:

 

val conf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(spark.sparkContext, conf)

 

If you prefer CDE UI

If you prefer using the UI instead, you'll have to take the following into account (please find the screenshot below):

  • In your CDE job configuration page, the hbase-site.xml should be uploaded under Advanced options > Other dependencies.
  • At the time of this article, it is not possible to specify a path for a file inside a CDE resource or for files uploaded under Other dependencies. For this reason, in your job definition you should use /app/mount as the value for your spark.driver.extraClassPath and spark.executor.extraClassPathvariables.
    In your CDE job configuration page, you should set this variable inside the Configurations section.

amallegni_7-1643803192348.png

1,583 Views
Comments
avatar
Cloudera Employee

Please note the last screenshot has typos in it, while the article hasn't.
The correct configuration properties' names are the following (please note camel case):

 

spark.driver.extraClassPath 

spark.executor.extraClassPath