Community Articles

Find and share helpful community-sourced technical articles.
avatar
Super Collaborator

This article explains how to use the Snowflake Connector for Spark in Cloudera Machine Learning.

  1. Save your Snowflake password for your account.
    Go to Account Settings > Environment Variables and create a new entry with Name as "SNOW_PASSWORD" and Value as your <password>.peter_ableda_3-1649340312919.png
  2. Create a new session in your CML project.
    Use any of the editors, and the Python 3.7 kernel. You also need to enable Spark for this session and select the Spark 2.4.7 version.peter_ableda_0-1649341545933.png
  3. Download required dependencies.
    To initiate the connection, you need to have a Snowflake Connector for Spark and a Snowflake JDBC Driver that's compatible with CML's Spark version. You can get these from the Maven Central Repository: spark-connector and jdbc-driver. Place them in a ./jars folder in your CML Project.
    cdsw@ysezv3fm94edq4pb:~$ ls -l ./jars/
    total 28512
    -rw-r--r-- 1 cdsw cdsw 28167364 Mar 17 21:03 snowflake-jdbc-3.13.16.jar
    -rw-r--r-- 1 cdsw cdsw  1027154 Jan 27 01:18 spark-snowflake_2.11-2.9.3-spark_2.4.jar
  4. Initiate the Snowflake connection
    You need to set your Snowflake Account ID and your username for the connection. Your Snowflake password is retrieved from the environment variable that you configured in the Account Settings. 
    I'm using the default Snowflake Warehouse and a sample database/schema.
    import os
    from pyspark.sql import SparkSession
    
    sfOptions = {
      "sfURL" : "<Account ID>.snowflakecomputing.com",
      "sfUser" : '<Username>',
      "sfPassword" : os.environ['SNOW_PASSWORD'],
      "sfDatabase" : 'SNOWFLAKE_SAMPLE_DATA',
      "sfSchema" : 'TPCH_SF1',
      "sfWarehouse" : 'COMPUTE_WH'
    }
    
    snowJars = [os.path.join('/home/cdsw/jars', x) for x in os.listdir('/home/cdsw/jars')]
    
    spark = SparkSession.builder \
        .appName("cml-spark-snowflake-test") \
        .config("spark.jars", ",".join(snowJars)) \
        .getOrCreate()
    
    SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
    
    query = '''
    select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER limit 100
    '''
    
    df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
        .options(**sfOptions) \
        .option("query", query) \
        .load()
    
    df.show()
  5. Execute the code; you can see the results as follows:peter_ableda_2-1649342299910 (1).png
1,379 Views