Created on 10-03-202104:20 PM - edited on 10-04-202109:21 PM by subratadas
Accessing Kudu tables from Spark on the Cloudera Data Platform (CDP) is a common integration pattern for real-time analytic workloads that require fast inserts and updates while at the same time enabling efficient columnar scans across a single storage layer.
This article describes Kudu integration for Spark jobs running in the Cloudera Data Engineering (CDE) Service by using the Kudu storage defined as part of a CDP Datahub cluster.
We'll first configure the Kudu cluster. This can be done by creating a Datahub cluster of type Real-time Data Mart:
Once the cluster has been provisioned successfully, you may need to add your IP address to the firewall rules for the cluster hosts in order to access some of the cluster's interfaces. For example, with Amazon Web Services (AWS), a link to the EC2 instance configuration page is provided on the Datahub cluster page (on the EC2 instance page, locate the Security configuration and edit Security Group> Inbound Rules) :
Next, we'll set up a small Kudu table for testing. An easy way to do this is from the Hue Web UI (a link is provided to this under the Datahub cluster Services section, shown above). Once logged into the Hue Impala Editor, run the following SQL to create the table and insert 1 record:
CREATE TABLE IF NOT EXISTS default.cde_kudu_table(
PARTITION BY HASH (k)
STORED AS KUDU;
INSERT INTO default.cde_kudu_table VALUES (1, 'aaa', 111);
CDE Spark Job
Next, we'll interact with the Kudu table that was just created from a CDE Spark job.
Copy the three Datahub (Kudu) cluster master node hostnames (FQDNs) under the Hardware tab, as shown in the example below:
Next, edit the PySpark kudu_master variable in the sample code below by replacing the <<hostnames>> with the master node FQDNs noted in the previous step, and then save the file as cde_kudu.py: