Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Super Collaborator
Created on
04-08-2022
02:29 AM
- edited on
04-10-2022
09:45 PM
by
subratadas
This article explains how to use the Snowflake Connector for Spark in Cloudera Machine Learning.
- Save your Snowflake password for your account.
Go to Account Settings > Environment Variables and create a new entry with Name as "SNOW_PASSWORD" and Value as your <password>. - Create a new session in your CML project.
Use any of the editors, and the Python 3.7 kernel. You also need to enable Spark for this session and select the Spark 2.4.7 version. - Download required dependencies.
To initiate the connection, you need to have a Snowflake Connector for Spark and a Snowflake JDBC Driver that's compatible with CML's Spark version. You can get these from the Maven Central Repository: spark-connector and jdbc-driver. Place them in a ./jars folder in your CML Project.cdsw@ysezv3fm94edq4pb:~$ ls -l ./jars/ total 28512 -rw-r--r-- 1 cdsw cdsw 28167364 Mar 17 21:03 snowflake-jdbc-3.13.16.jar -rw-r--r-- 1 cdsw cdsw 1027154 Jan 27 01:18 spark-snowflake_2.11-2.9.3-spark_2.4.jar
- Initiate the Snowflake connection.
You need to set your Snowflake Account ID and your username for the connection. Your Snowflake password is retrieved from the environment variable that you configured in the Account Settings.
I'm using the default Snowflake Warehouse and a sample database/schema.import os from pyspark.sql import SparkSession sfOptions = { "sfURL" : "<Account ID>.snowflakecomputing.com", "sfUser" : '<Username>', "sfPassword" : os.environ['SNOW_PASSWORD'], "sfDatabase" : 'SNOWFLAKE_SAMPLE_DATA', "sfSchema" : 'TPCH_SF1', "sfWarehouse" : 'COMPUTE_WH' } snowJars = [os.path.join('/home/cdsw/jars', x) for x in os.listdir('/home/cdsw/jars')] spark = SparkSession.builder \ .appName("cml-spark-snowflake-test") \ .config("spark.jars", ",".join(snowJars)) \ .getOrCreate() SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake" query = ''' select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER limit 100 ''' df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \ .options(**sfOptions) \ .option("query", query) \ .load() df.show()
- Execute the code; you can see the results as follows:
1,633 Views