Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

Overview

In this article I will show 3 different ways of connecting to Cloudera Data Lake using CML and R kernel. The first two approaches use Spark as the primary compute engine, while the last one uses JDBC connection directly. 

  1. Connect directly to Data Lake through Hive Warehouse Connector (HWC)
  2. Connect to Data Lake via Cloudera Data Warehouse (CDW) through HWC 
  3. Connect to Data Lake via Cloudera Data Warehouse (CDW) through JDBC driver (Note: this method is not recommended for reads larger than 1GB)

 

Prerequisites

1. Cloudera Public Cloud Environment

2. Data Lake

3. Cloudera Machine Learning

4. Cloudera Data Warehouse (Hive/Impala)

5. Workload Username and Password

6. Permission both on environment and datasets

7. JDBC Driver

 

 

 

Step-by-Step

 

Preparing CML

1. Prepare your workload username and password

2. Create project in CML with below configuration

Initial Setup: Git

Git URL: https://github.com/servaed/cml-r-connection-template.git

Runtime Setup: Basic

Kernel: R (3.x or 4.x)

eduardb_0-1667471428948.png

 

 

There will be 3 template files and 1 working JDBC Driver (per 31 October 2022).

 

1. connect-to-datalake.R

For connecting to attached SDX Data Lake using Spark HWC

 

#CONNECT TO DATA LAKE USING SPARK HWC


#Run this once
#install.packages("sparklyr")


#Set parameters
sparklyr_jars_default <- "/opt/spark/optional-lib/hive-warehouse-connector-assembly.jar"


#Add library
library(sparklyr)


#Add spark_config()
spark_config()
config <- spark_config()


#Add configuration
config$spark.security.credentials.hiveserver2.enabled="false"
config$spark.datasource.hive.warehouse.read.via.llap="false"
config$spark.sql.hive.hwc.execution.mode="spark"
#config$spark.datasource.hive.warehouse.read.jdbc.mode="spark"


# Required setting for HWC-direct reader - the hiveacid sqlextension does the automatic
# switch between reading through HWC (for managed tables) or spark-native (for external)
# depending on table type.
config$spark.sql.extensions="com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension"
config$spark.kryo.registrator="com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator"
config$sparklyr.jars.default <- sparklyr_jars_default

#Connect with spark
sc <- spark_connect(config = config)
src_databases(sc)


#Change database if necessary
spark_session(sc) %>% invoke("sql", "USE <sample_database>")


#Read and show table
intDf1 <- sparklyr::spark_read_table(sc, '<sample_table>')
sparklyr::sdf_collect(intDf1)

 

 

2. connect-to-hive-impala.R

For connecting to CDW/Hive/Impala using Spark HWC

 

#CONNECT TO CDW IMPALA / HIVE /UNIFIED ANALYTICS USING JDBC

#Git for JDBC driver file
#https://github.com/servaed/cml-r-connection-template.git

#Run this once
#install.packages("DBI")
#install.packages("rJava")
#install.packages("RJDBC")

#Set parameters
hive_jdbc_driver <- '/home/cdsw/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar'
hiveserver2_jdbc_url <- '<your_jdbc_url>'
workload_username <- '<your_workload_username>'
workload_password <- '<your_workload_password>'

#Add Library
library("DBI")
library("rJava")
library("RJDBC")


#Initialize Connection
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", hive_jdbc_driver, identifier.quote="`")

conn <- dbConnect(drv,hiveserver2_jdbc_url, workload_username, workload_password)

#working with the connection
show_databases <- dbGetQuery(conn, "show databases")
show_databases

select_query <- dbGetQuery(conn, "<sample_query>")
select_query

 

 

3. connect-to-hive-impala-jdbc.R

For connecting to CDW/Hive/Impala using JDBC

 

#CONNECT TO CDW IMPALA / HIVE /UNIFIED ANALYTICS USING SPARK HWC


#Run this once
#install.packages("sparklyr")


#Set parameters
sparklyr_jars_default <- '/opt/spark/optional-lib/hive-warehouse-connector-assembly.jar'
hiveserver2_jdbc_url <- '<your_jdbc_url>'
workload_username <- '<your_workload_username>'
workload_password <- '<your_workload_password>'


#Combine parameters
hiveserver2_full_jdbc_url <- paste(hiveserver2_jdbc_url,';user=', workload_username, ';password=', workload_password,sep='')
hiveserver2_full_jdbc_url


#Add library
library(sparklyr)


#Add spark_config()
spark_config()
config <- spark_config()


#Add configuration
config$spark.security.credentials.hiveserver2.enabled="false"
config$spark.datasource.hive.warehouse.read.via.llap="false"
config$spark.sql.hive.hwc.execution.mode="spark"
#config$spark.datasource.hive.warehouse.read.jdbc.mode="spark"
config$spark.sql.hive.hiveserver2.jdbc.url= hiveserver2_full_jdbc_url 


# Required setting for HWC-direct reader - the hiveacid sqlextension does the automatic
# switch between reading through HWC (for managed tables) or spark-native (for external)
# depending on table type.
config$spark.sql.extensions="com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension"
config$spark.kryo.registrator="com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator"
config$sparklyr.jars.default <- sparklyr_jars_default

sc <- spark_connect(config = config)

ss <- spark_session(sc)
hive <- invoke_static(sc,"com.hortonworks.hwc.HiveWarehouseSession","session",ss)%>%invoke("build")
df <- invoke(hive,"execute","<sample_query>")
sparklyr::sdf_collect(df)

 

 

4. You might need to download the latest version of JDBC driver from CDW if necessary

eduardb_2-1667471429042.png

 

 

Connect to attached SDX Data Lake using HWC

 

1. Open new session with spark enabled

eduardb_3-1667471428920.png

 

2. Open connect-to-datalake.R

eduardb_4-1667471429002.png

 

3. Install sparklyr (Code Line 5)

eduardb_5-1667471428989.png

 

4. Wait until completed

eduardb_6-1667471428885.png

 

5. Replace <samples> with your Database and Table names

eduardb_7-1667471428956.png

 

6. Run All

eduardb_8-1667471428917.png

 

 

Connect to CDW/Hive/Impala using Spark HWC

 

1. Open new session with spark enabled

eduardb_9-1667471428921.png

 

2. Open connect-to-hive-impala.R

eduardb_11-1667471429028.png

 

3. Get your JDBC URL from an existing CDW. Click on ⠇and choose Copy JDBC URL.

eduardb_12-1667471429006.png

 

4. Install sparklyr (Code Line 5)

eduardb_13-1667471428990.png

 

5. Wait until completed

eduardb_14-1667471428886.png

 

6. Change below parameters to the URL you copied in step 4 as well as your workload username and password. 

eduardb_15-1667471428919.png

 

7. Change sample query

eduardb_16-1667471428917.png

 

8. Run All

eduardb_17-1667471428918.png

 

 

Connect to CDW/Hive/Impala using JDBC

1. Open new session with spark enabled

eduardb_18-1667471428921.png

 

2. Open connect-to-hive-impala-jdbc.R

eduardb_19-1667471429033.png

 

3. Get your JDBC URL from an existing CDW. Click on ⠇and choose Copy JDBC URL.

eduardb_20-1667471429007.png

 

4. Install DBI, rJava, and RJDBC packages for R

eduardb_21-1667471428916.png

 

5. Wait until completed

eduardb_22-1667471428976.png

 

6. Change below parameters. Note: you might need to change the JDBC driver file if you’re downloading newer driver from CDW

eduardb_23-1667471429034.png

 

7. Change sample query

eduardb_24-1667471428897.png

 

8. Run All

eduardb_25-1667471428919.png

 

Conclusion

There are various ways that CML supports when accessing Data Lake with the R kernel. With many Data Scientists continuing to use R as their language of choice, I hope this article helps them to get started with CML quickly.

1,466 Views