Community Articles

subratadas · ‎04-08-2022

This article explains how to use the Snowflake Connector for Spark in Cloudera Machine Learning.

Save your Snowflake password for your account.
Go to Account Settings > Environment Variables and create a new entry with Name as "SNOW_PASSWORD" and Value as your <password>.
Create a new session in your CML project.
Use any of the editors, and the Python 3.7 kernel. You also need to enable Spark for this session and select the Spark 2.4.7 version.
Download required dependencies.
To initiate the connection, you need to have a Snowflake Connector for Spark and a Snowflake JDBC Driver that's compatible with CML's Spark version. You can get these from the Maven Central Repository: spark-connector and jdbc-driver. Place them in a ./jars folder in your CML Project.
```
cdsw@ysezv3fm94edq4pb:~$ ls -l ./jars/
total 28512
-rw-r--r-- 1 cdsw cdsw 28167364 Mar 17 21:03 snowflake-jdbc-3.13.16.jar
-rw-r--r-- 1 cdsw cdsw  1027154 Jan 27 01:18 spark-snowflake_2.11-2.9.3-spark_2.4.jar
```

Initiate the Snowflake connection.
You need to set your Snowflake Account ID and your username for the connection. Your Snowflake password is retrieved from the environment variable that you configured in the Account Settings.
I'm using the default Snowflake Warehouse and a sample database/schema.

import os
from pyspark.sql import SparkSession

sfOptions = {
  "sfURL" : "<Account ID>.snowflakecomputing.com",
  "sfUser" : '<Username>',
  "sfPassword" : os.environ['SNOW_PASSWORD'],
  "sfDatabase" : 'SNOWFLAKE_SAMPLE_DATA',
  "sfSchema" : 'TPCH_SF1',
  "sfWarehouse" : 'COMPUTE_WH'
}

snowJars = [os.path.join('/home/cdsw/jars', x) for x in os.listdir('/home/cdsw/jars')]

spark = SparkSession.builder \
    .appName("cml-spark-snowflake-test") \
    .config("spark.jars", ",".join(snowJars)) \
    .getOrCreate()

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

query = '''
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER limit 100
'''

df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
    .options(**sfOptions) \
    .option("query", query) \
    .load()

df.show()

Execute the code; you can see the results as follows:

Cloudera Community

Community Articles

How to use Snowflake's Spark connector in CML

Apache Spark

Cloudera Machine Learning (CML)

How to connect to Snowflake in CML

Integrating Apache Hive with Apache Spark - Hive W...

Spark in CML: Recommendations for using Spark in C...

How to connect to CDW Iceberg table in Snowflake

How to configure CML's Spark Connection

How to setup Hive Warehouse Connector in CML (CDP ...

Issues running Spark in CML

CML Model Deployment with MLFlow and APIv2

Hive Warehouse Connector concurrent write to Hive ...

Spark 3 legacy configurations list ( Spark 2 behav...