About pauldefusco

pauldefusco · ‎01-13-2025

JupyterLab is a powerful, web-based interactive development environment widely used for data science, machine learning, and Spark application development. It extends the classic Jupyter Notebook interface, offering a more flexible and integrated workspace. For Spark application development, JupyterLab provides the following advantages: 1. Interactive Exploration: Run Spark queries and visualize results interactively, making it ideal for data exploration and debugging. 2. Rich Visualization: Seamlessly integrate with Python visualization libraries like Matplotlib, Seaborn, and Plotly to analyze and interpret Spark data. 3. Ease of Integration: Use PySpark or Sparkmagic to connect JupyterLab with Spark clusters, enabling distributed data processing from the notebook environment. Spark Connect is a feature introduced in Apache Spark that provides a standardized, client-server architecture for connecting to Spark clusters. It decouples the client from the Spark runtime, allowing users to interact with Spark through lightweight, language-specific clients without the need to run a full Spark environment on the client side. With Spark Connect, users can: 1. Access Spark Remotely: Connect to a Spark cluster from various environments, including local machines or web applications. 2. Support Multiple Languages: Use Spark with Python, Scala, Java, SQL, and other languages through dedicated APIs. 3. Simplify Development: Develop and test Spark applications without needing a full Spark installation, making it easier for developers and data scientists to work with distributed data processing. This architecture enhances usability, scalability, and flexibility, making Spark more accessible to a wider range of users and environments. In this article you will use JupyterLab locally to interactively prototype a PySpark and Iceberg application in a dedicated Spark Virtual Cluster running in Cloudera Data Engineering on AWS. Prerequisites A CDE Service and Virtual Cluster on version 1.23 or above, and 3.5.1, respectively. A local installation of the CDE CLI on version 1.23 or above. A local installation of JupyterLab. Version 4.0.7 was used for this demonstration but other versions should work as well. A local installation of Python. Version 3.9.12 was used for this demonstration but other versions will work as well. 1. Launch a CDE Spark Connect Session Create CDE Files Resources and upload csv files. cde resource create \ --name telcoFiles \ --type files cde resource upload \ --name telcoFiles \ --local-path resources/cell_towers_1.csv \ --local-path resources/cell_towers_2.csv Start a CDE Session of type Spark Connect. Edit the Session Name parameter so it doesn't collide with other users' sessions. cde session create \ --name spark-connect-session-res \ --type spark-connect \ --num-executors 2 \ --driver-cores 2 \ --driver-memory "2g" \ --executor-cores 2 \ --executor-memory "2g" \ --mount-1-resource telcoFiles In the Sessions UI, validate the Session is Running. 2. Install Spark Connect Prerequisites From the terminal, install the following Spark Connect prerequisites: Download the cdeconnect and PySpark packages from the CDE Session Configuration tab and place them in your project home folder: Create a new Python Virtual Environment: python -m venv spark_connect_jupyter source spark_connect_jupyter/bin/activate Install the following packages. Notice that these exact versions were used with Python 3.9. Numpy, cmake, and PyArrow versions may be subject to change depending on your Python version. pip install numpy==1.26.4 pip install --upgrade cmake pip install pyarrow==14.0.0 pip install cdeconnect.tar.gz pip install pyspark-3.5.1.tar.gz Launch the JupyterLab server with: pip install jupyterlab jupyter lab 3. Run Your First PySpark & Iceberg Application via Spark Connect You are now ready to connect to the CDE Session from your local JupyterLab instance using Spark Connect. In the first cell, edit the sessionName option and add your session name from the CLI Create Session command above. In the second cell, edit your username. Now run each cell and observe outputs.

pauldefusco · ‎01-13-2025

PyCharm is a popular integrated development environment (IDE) for Python, developed by JetBrains. It provides a comprehensive set of tools designed to boost productivity and simplify the coding process. When working with Apache Spark, PyCharm can be an excellent choice for managing Python-based Spark applications. Its features include: Code Editing and Debugging: PyCharm offers intelligent code completion, syntax highlighting, and robust debugging tools that simplify Spark application development. Virtual Environments and Dependency Management: PyCharm makes it easy to configure Python environments with Spark libraries and manage dependencies. Notebook Support: With built-in support for Jupyter Notebooks, PyCharm allows you to work interactively with data, making it easier to visualize and debug Spark pipelines. Version Control: PyCharm integrates with Git and other version control systems, simplifying collaboration and project management. Spark Connect is a feature introduced in Apache Spark that provides a standardized, client-server architecture for connecting to Spark clusters. It decouples the client from the Spark runtime, allowing users to interact with Spark through lightweight, language-specific clients without the need to run a full Spark environment on the client side. With Spark Connect, users can: Access Spark Remotely: Connect to a Spark cluster from various environments, including local machines or web applications. Support Multiple Languages: Use Spark with Python, Scala, Java, SQL, and other languages through dedicated APIs. Simplify Development: Develop and test Spark applications without needing a full Spark installation, making it easier for developers and data scientists to work with distributed data processing. This architecture enhances usability, scalability, and flexibility, making Spark more accessible to a wider range of users and environments. In this article you will learn how to use PyCharm locally to interactively prototype your code in a dedicated Spark Virtual Cluster running in Cloudera Data Engineering in AWS. Prerequisites A CDE Service and Virtual Cluster on version 1.23 or above, and 3.5.1, respectively. A local installation of the CDE CLI on version 1.23 or above. A local installation of PyCharm. Version 4.0.7 was used for this demonstration but other versions should work as well. A local installation of Python. Version 3.9.12 was used for this demonstration but other versions will work as well. 1. Launch a CDE Spark Connect Session Start a CDE Session of type Spark Connect. Edit the Session Name parameter so it doesn't collide with other users' sessions. cde session create \ --name pycharm-session \ --type spark-connect \ --num-executors 2 \ --driver-cores 2 \ --driver-memory "2g" \ --executor-cores 2 \ --executor-memory "2g" In the Sessions UI, validate the Session is Running. 2. Install Spark Connect Prerequisites From the terminal, install the following Spark Connect prerequisites: Create a new Project and Python Virtual Environment in PyCharm: In the terminal, install the following packages. Notice that these exact versions were used with Python 3.9. Numpy, cmake, and PyArrow versions may be subject to change depending on your Python version. pip install numpy==1.26.4 pip install --upgrade cmake pip install pyarrow==14.0.0 pip install cdeconnect.tar.gz pip install pyspark-3.5.1.tar.gz 3. Run Your First PySpark & Iceberg Application via Spark Connect You are now ready to connect to the CDE Session from your local IDE using Spark Connect. Open "prototype.py". Make the following changes: At line 46, edit the "sessionName" parameter with your Session Name from the above CLI command. At line 48, edit the "storageLocation" parameter with your S3 bucket prefix. At line 49, edit the "username" parameter with your assigned username. Now run "prototype.py" and observe outputs.

pauldefusco · ‎01-06-2025

You can find the Docs for Python and API by going to User Settings -> API Keys From the Python CML APIv2 Docs, these are the two methods you need: Step 1: Create Job from __future__ import print_function import time import cmlapi from cmlapi.rest import ApiException from pprint import pprint # create an instance of the API class api_instance = cmlapi.CMLServiceApi() body = cmlapi.CreateJobRequest() # CreateJobRequest | project_id = 'project_id_example' # str | ID of the project containing the job. try: # Create a new job. api_response = api_instance.create_job(body, project_id) pprint(api_response) except ApiException as e: print("Exception when calling CMLServiceApi->create_job: %s\n" % e) Step 2: Run the Job from __future__ import print_function import time import cmlapi from cmlapi.rest import ApiException from pprint import pprint # create an instance of the API class api_instance = cmlapi.CMLServiceApi() body = cmlapi.CreateJobRunRequest() # CreateJobRunRequest | project_id = 'project_id_example' # str | ID of the project containing the job. job_id = 'job_id_example' # str | The job ID to create a new job run for. try: # Create and start a new job run for a job. api_response = api_instance.create_job_run(body, project_id, job_id) pprint(api_response) except ApiException as e: print("Exception when calling CMLServiceApi->create_job_run: %s\n" % e) Using the "cmlapi.CreateJobRequest()" or "cmlapi.CreateJobRunRequest()" methods can be tricky. Here's an advanced example: https://github.com/pdefusco/SparkGen/blob/main/autogen/cml_orchestrator.py In particular: sparkgen_1_job_body = cmlapi.CreateJobRequest( project_id = project_id, name = "SPARKGEN_1_"+session_id, script = "autogen/cml_sparkjob_1.py", cpu = 4.0, memory = 8.0, runtime_identifier = "docker.repository.cloudera.com/cloudera/cdsw/ml-runtime-workbench-python3.7-standard:2023.05.1-b4", runtime_addon_identifiers = ["spark320-18-hf4"], environment = { "x":str(x), "y":str(y), "z":str(z), "ROW_COUNT_car_installs":str(ROW_COUNT_car_installs), "UNIQUE_VALS_car_installs":str(UNIQUE_VALS_car_installs), "PARTITIONS_NUM_car_installs":str(PARTITIONS_NUM_car_installs), "ROW_COUNT_car_sales":str(ROW_COUNT_car_sales), "UNIQUE_VALS_car_sales":str(UNIQUE_VALS_car_sales), "PARTITIONS_NUM_car_sales":str(PARTITIONS_NUM_car_sales), "ROW_COUNT_customer_data":str(ROW_COUNT_customer_data), "UNIQUE_VALS_customer_data":str(UNIQUE_VALS_customer_data), "PARTITIONS_NUM_customer_data":str(PARTITIONS_NUM_customer_data), "ROW_COUNT_factory_data":str(ROW_COUNT_factory_data), "UNIQUE_VALS_factory_data":str(UNIQUE_VALS_factory_data), "PARTITIONS_NUM_factory_data":str(PARTITIONS_NUM_factory_data), "ROW_COUNT_geo_data":str(ROW_COUNT_geo_data), "UNIQUE_VALS_geo_data":str(UNIQUE_VALS_geo_data), "PARTITIONS_NUM_geo_data":str(PARTITIONS_NUM_geo_data) } ) sparkgen_1_job = client.create_job(sparkgen_1_job_body, project_id) And jobrun_body = cmlapi.CreateJobRunRequest(project_id, sparkgen_1_job.id) job_run = client.create_job_run(jobrun_body, project_id, sparkgen_1_job.id) Hope this helps,

pauldefusco · ‎01-06-2025

Spark GraphFrames is a package within Apache Spark that allows users to perform graph processing operations on data using a DataFrame-based approach. It will enable you to perform various graph operations like finding connected components, calculating shortest paths, identifying triangles, and more. Real world applications include social network analysis, recommendation systems, web link structures, flight connections, and other scenarios where relationships between data points are crucial. Cloudera AI (CAI) is a cloud-native service within the Cloudera Data Platform (CDP) that enables enterprise data science teams to collaborate across the full data lifecycle. It provides immediate access to enterprise data pipelines, scalable compute resources, and preferred tools, streamlining the process of moving analytic workloads from research to production. Using Spark GraphFrames in Cloudera AI requires minimum configuration effort. This quickstart provides a basic example so you can get started quickly. Requirements You can use Spark GraphFrames in CAI Workbench with Spark Runtime Addon versions 3.2 or above. When you create the SparkSession object, make sure to select the right version of the package reflecting the Spark Runtime Addon you set during the CAI Session launch from SparkPackages. This example was created with the following platform and system versions, but it will also work in other Workbench versions (below or above 2.0.46). CAI Workbench (a.k.a. "CML Workspace") on version 2.0.46. Spark Runtime Addon version 3.5. Steps to Reproduce the Quickstart Create a CAI Workbench Session with the following settings: Resource Profile of 2 vCPU, 8 GiB Mem, 0 GPU Editor: PBJ Workbench, Python 3.10 Kernel, Standard Edition, 2024.10 Version. In the Session, run the script. Notice the following: Line 46: the SparkSession object is created. The packages option is used to download the lib. You have to change the argument with the package version compatible with your Spark Runtime Addon version, if not using Spark 3.5, as listed in SparkPackages. spark = SparkSession.builder.appName("MyApp") \ .config("spark.jars.packages", "graphframes:graphframes:0.8.4-spark3.5-s_2.12") \ .getOrCreate() Line 71: A GraphFrame object is instantiated using the two PySpark Dataframes. from graphframes import * # Vertex DataFrame v = spark.createDataFrame([ ("a", "Alice", 34), ("b", "Bob", 36), ("c", "Charlie", 30), ("d", "David", 29), ("e", "Esther", 32), ("f", "Fanny", 36), ("g", "Gabby", 60) ], ["id", "name", "age"]) # Edge DataFrame e = spark.createDataFrame([ ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow"), ("f", "c", "follow"), ("e", "f", "follow"), ("e", "d", "friend"), ("d", "a", "friend"), ("a", "e", "friend") ], ["src", "dst", "relationship"]) # Create a GraphFrame g = GraphFrame(v, e) Lines 73 and below: You can now use GraphFrames methods to traverse and filter the graph based on relationships between data instances. > g.vertices.show() +---+-------+---+ | id| name|age| +---+-------+---+ | a| Alice| 34| | b| Bob| 36| | c|Charlie| 30| | d| David| 29| | e| Esther| 32| | f| Fanny| 36| | g| Gabby| 60| +---+-------+---+ > g.edges.show() +---+---+------------+ |src|dst|relationship| +---+---+------------+ | a| b| friend| | b| c| follow| | c| b| follow| | f| c| follow| | e| f| follow| | e| d| friend| | d| a| friend| | a| e| friend| +---+---+------------+ #Find the youngest user’s age in the graph. This queries the vertex DataFrame. > g.vertices.groupBy().min("age").show() +--------+ |min(age)| +--------+ | 29| +--------+ Summary & Next Steps Cloudera AI (CAI) is a cloud-native service within the Cloudera Data Platform (CDP) that enables enterprise data science teams to collaborate across the full data lifecycle. It provides immediate access to enterprise data pipelines, scalable compute resources, and preferred tools, streamlining the process of moving analytic workloads from research to production. In particular: A CAI Workbench Session allows you to directly access and analyze large datasets, run code (like Python or R), and build and train machine learning models, using the flexibility of Spark Runtime Add-ons to deploy Spark on Kubernetes clusters with minimal configurations, with your Spark version of choice. The CAI Engineering team maintains and certifies Spark Runtime Add-ons so you don't have to worry about installing and configuring Spark on Kubernetes clusters when using CAI Workbench. You can check if your Spark version is supported in the latest compatibility matrix. The "packages" option can be used at SparkSession creation to download 3rd party packages such as GraphFrames. For more information on this and other Spark Options, please refer to the Apache Spark Documentation. GraphFrames, in particular, allows you to query your Spark Dataframes in terms of relationships, thus empowering use cases ranging from Social Network analysis to Recommendation Engines, and more. For more information, please refer to the GraphFrames documentation. Finally, you can learn more about Cloudera AI Workbench with the following recommended blogs and community articles: Cloudera AI - What You Should Know An insightful community article that provides an overview of CML's features and capabilities, helping teams deploy machine learning workspaces that auto-scale and auto-suspend to save costs. Illustrating AI/ML Model Development in Cloudera AI A Medium tutorial that demonstrates how to create and deploy models using CML on the Cloudera Data Platform Private Cloud, offering practical insights into the model development process. Cloudera Accelerators for ML Projects A catalog of Applied Machine Learning Prototypes (AMPs) that can be deployed with one click directly from CML, designed to jumpstart AI initiatives by providing tailored solutions for specific use cases. Cloudera AI Overview Official documentation that offers a comprehensive overview of Cloudera AI, detailing its features, benefits, and how it facilitates collaborative machine learning at scale.

pauldefusco · ‎11-12-2024

Followed up via DM

pauldefusco · ‎10-08-2024

Objective This article provides an introduction to the Iceberg using Spark SQL in Cloudera Data Engineering (CDE). CDE provides native Apache Iceberg Table Format support in its Spark Runtimes. This means you can create and interact with Iceberg Table format tables without any configurations. Abstract CDP Data Engineering (CDE) is the only cloud-native service purpose-built for enterprise data engineering teams. Building on Apache Spark, Data Engineering is an all-inclusive data engineering toolset that enables orchestration automation with Apache Airflow, advanced pipeline monitoring, visual troubleshooting, and comprehensive management tools to streamline ETL processes across enterprise analytics teams. Apache Iceberg is an open table format format for huge analytic datasets. It provides features that, coupled with Spark as the compute engine, allows you to build data processing pipelines with dramatic gains in terms of scalability, performance, and overall developer productivity. Iceberg is natively supported by CDE. Any time a CDE Spark Job or Session is created, Iceberg dependencies are automatically set in the SparkSession without any need for configurations. As a CDP User, the CDE Data Engineer can thus create, read, modify, and interact with Iceberg tables as allowed by Ranger policies, whether these were created in Cloudera Data Warehouse (CDW), DataHub, or Cloudera AI (CML). In this tutorial you will create a CDE Session and interact with Apache Iceberg tables using PySpark. Requirements CDE Virtual Cluster of type "All-Purpose" running in CDE Service with version 1.22 or above, and Spark version 3.2 or above. An installation of the CDE CLI is recommended but optional. In the steps below you will create the CDE Session using the CLI, but you can alternatively launch one using the UI. Step by Step Guide Create CDE Files Resource cde resource create --name myFiles --type files cde resource upload --name myFiles --local-path resources/cell_towers_1.csv --local-path resources/cell_towers_2.csv Launch CDE Session & Run Spark Commands cde session create --name icebergSessionCDE --type pyspark --mount-1-resource myFiles cde session interact --name icebergSessionCDE Create Iceberg Tables from Files Resources In this code snippet two Iceberg tables are created from PySpark dataframes. The dataframes load CSV data from a CDE Files Resource specifying the ```/app/mount``` path. # PySpark commands: df1 = spark.read.csv("/app/mount/cell_towers_1.csv", header=True, inferSchema=True) df1.writeTo("CELL_TOWERS_LEFT").using("iceberg").tableProperty("write.format.default", "parquet").createOrReplace() df2 = spark.read.csv("/app/mount/cell_towers_2.csv", header=True, inferSchema=True) df2.writeTo("CELL_TOWERS_RIGHT").using("iceberg").tableProperty("write.format.default", "parquet").createOrReplace() Read Iceberg Tables using PySpark Next, use Spark SQL to access the data from the Iceberg tables: # Spark SQL Commands: spark.sql("SELECT * FROM CELL_TOWERS_LEFT \ WHERE manufacturer == 'TelecomWorld' \ AND cell_tower_failure == 0").show() # Expected Output: +---+---------------+------------+------------------+----------+---------+------------+------------+------------+------------------+ | id| device_id|manufacturer| event_type| longitude| latitude|iot_signal_1|iot_signal_3|iot_signal_4|cell_tower_failure| +---+---------------+------------+------------------+----------+---------+------------+------------+------------+------------------+ | 1|0x100000000001d|TelecomWorld| battery 10%| -83.04828|51.610226| 9| 52| 103| 0| | 2|0x1000000000008|TelecomWorld| battery 10%| -83.60245|51.892113| 6| 54| 103| 0| | 7|0x100000000000b|TelecomWorld| device error| -83.62492|51.891964| 5| 54| 102| 0| | 12|0x1000000000020|TelecomWorld|system malfunction| -83.36766|51.873108| 8| 53| 106| 0| | 13|0x1000000000017|TelecomWorld| battery 5%| -83.04949|51.906513| 4| 52| 105| 0| | 24|0x1000000000026|TelecomWorld| device error| -83.15052|51.605473| 6| 55| 103| 0| | 30|0x1000000000008|TelecomWorld| battery 10%| -83.44602| 51.60561| 2| 53| 106| 0| | 35|0x1000000000002|TelecomWorld|system malfunction| -83.62555|51.827686| 2| 54| 102| 0| | 37|0x100000000001d|TelecomWorld| battery 10%| -83.47665|51.670994| 3| 53| 105| 0| | 41|0x1000000000017|TelecomWorld| device error| -82.89744| 51.92945| 4| 52| 100| 0| +---+---------------+------------+------------------+----------+---------+------------+------------+------------+------------------+ Validate Iceberg Table Use the ```SHOW TBLPROPERTIES``` command to validate Iceberg Table format: # Spark SQL Command: spark.sql("SHOW TBLPROPERTIES CELL_TOWERS_LEFT").show() # Expected Output: +--------------------+-------------------+ | key| value| +--------------------+-------------------+ | current-snapshot-id|8073060523561382284| | format| iceberg/parquet| | format-version| 1| |write.format.default| parquet| +--------------------+-------------------+ As an alternative method to validate Iceberg Table format, investigate Iceberg Metadata with any of the following Spark SQL commands: # Query Iceberg History Table spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.CELL_TOWERS_LEFT.history").show() +--------------------+-------------------+---------+-------------------+ | made_current_at| snapshot_id|parent_id|is_current_ancestor| +--------------------+-------------------+---------+-------------------+ |2024-10-08 20:30:...|8073060523561382284| null| true| +--------------------+-------------------+---------+-------------------+ # Query Iceberg Partitions Table +------------+----------+----------------------------+--------------------------+----------------------------+--------------------------+ |record_count|file_count|position_delete_record_count|position_delete_file_count|equality_delete_record_count|equality_delete_file_count| +------------+----------+----------------------------+--------------------------+----------------------------+--------------------------+ | 1440| 1| 0| 0| 0| 0| +------------+----------+----------------------------+--------------------------+----------------------------+--------------------------+ # Query Iceberg Snapshots Table spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.CELL_TOWERS_LEFT.snapshots").show() +--------------------+-------------------+---------+---------+--------------------+--------------------+ | committed_at| snapshot_id|parent_id|operation| manifest_list| summary| +--------------------+-------------------+---------+---------+--------------------+--------------------+ |2024-10-08 20:30:...|8073060523561382284| null| append|s3a://paul-aug26-...|{spark.app.id -> ...| +--------------------+-------------------+---------+---------+--------------------+--------------------+ # Query Iceberg Refs Table spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.CELL_TOWERS_LEFT.refs").show() +----+------+-------------------+-----------------------+---------------------+----------------------+ |name| type| snapshot_id|max_reference_age_in_ms|min_snapshots_to_keep|max_snapshot_age_in_ms| +----+------+-------------------+-----------------------+---------------------+----------------------+ |main|BRANCH|8073060523561382284| null| null| null| +----+------+-------------------+-----------------------+---------------------+----------------------+ Create Empty Iceberg Table using Spark SQL You can also use Spark SQL to create an Iceberg Table. Run a ```SHOW TABLE``` command on an existing table to investigate table format: # Spark SQL Command: print(spark.sql("SHOW CREATE TABLE CELL_TOWERS_LEFT").collect()[0][0]) # Expected Output CREATE TABLE spark_catalog.default.cell_towers_left ( `id` INT, `device_id` STRING, `manufacturer` STRING, `event_type` STRING, `longitude` DOUBLE, `latitude` DOUBLE, `iot_signal_1` INT, `iot_signal_3` INT, `iot_signal_4` INT, `cell_tower_failure` INT) USING iceberg LOCATION 's3a://paul-aug26-buk-a3c2b50a/data/warehouse/tablespace/external/hive/CELL_TOWERS_LEFT' TBLPROPERTIES( 'current-snapshot-id' = '8073060523561382284', 'format' = 'iceberg/parquet', 'format-version' = '1', 'write.format.default' = 'parquet') Next, create a new Iceberg table in the likes of this table. Notice the ```USING iceberg``` clause: # Spark SQL Command: spark.sql(""" CREATE TABLE ICE_TARGET_TABLE ( `id` INT, `device_id` STRING, `manufacturer` STRING, `event_type` STRING, `longitude` DOUBLE, `latitude` DOUBLE, `iot_signal_1` INT, `iot_signal_3` INT, `iot_signal_4` INT, `cell_tower_failure` INT) USING iceberg; """) This table is empty. Query Table Files to validate this: # Spark SQL Command: spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.ICE_TARGET_TABLE.files;").show() # Expected Output: +-------+---------+-----------+-------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+----------------+ |content|file_path|file_format|spec_id|record_count|file_size_in_bytes|column_sizes|value_counts|null_value_counts|nan_value_counts|lower_bounds|upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics| +-------+---------+-----------+-------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+----------------+ +-------+---------+-----------+-------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+----------------+ Append Data Into Empty Iceberg Table Append data from a PySpark dataframe into an Iceberg table. Notice the use of the ```append()``` method. # PySPark command: df2.writeTo("SPARK_CATALOG.DEFAULT.ICE_TARGET_TABLE").using("iceberg").tableProperty("write.format.default", "parquet").append() Query Iceberg Metadata in order to validate that the append operation completed successfully: # Spark SQL Command: spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.ICE_TARGET_TABLE.files;").show() # Expected Output: +-------+--------------------+-----------+-------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+ |content| file_path|file_format|spec_id|record_count|file_size_in_bytes| column_sizes| value_counts| null_value_counts|nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id| readable_metrics| +-------+--------------------+-----------+-------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+ | 0|s3a://paul-aug26-...| PARQUET| 0| 1440| 36103|{1 -> 5796, 2 -> ...|{1 -> 1440, 2 -> ...|{1 -> 0, 2 -> 0, ...|{5 -> 0, 6 -> 0}|{1 -> , 2 -> ...|{1 -> �, 2 -> ...| null| [4]| null| 0|{{286, 1440, 0, n...| +-------+--------------------+-----------+-------+------------+------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+ Create Iceberg Table from Hive Table There are a few options to convert Hive tables into Iceberg Tables. The easiest approach is an "inplace-migration" to Iceberg table format. Create a Hive Table using a PySpark dataframe: # PySpark Command: df1.write.mode("overwrite").saveAsTable('HIVE_TO_ICEBERG_TABLE', format="parquet") Now migrate it to Iceberg table format: spark.sql("ALTER TABLE HIVE_TO_ICEBERG_TABLE UNSET TBLPROPERTIES ('TRANSLATED_TO_EXTERNAL')") spark.sql("CALL spark_catalog.system.migrate('HIVE_TO_ICEBERG_TABLE')") Validate Iceberg Table format: # Spark SQL Command: spark.sql("SHOW TBLPROPERTIES HIVE_TO_ICEBERG_TABLE").show() # Expected Output: +--------------------+--------------------+ | key| value| +--------------------+--------------------+ | bucketing_version| 2| | current-snapshot-id| 1440783321004851162| |external.table.purge| TRUE| | format| iceberg/parquet| | format-version| 1| | migrated| true| |numFilesErasureCoded| 0| |schema.name-mappi...|[ {\n "field-id"...| +--------------------+--------------------+ Summary Cloudera Data Engineering (CDE) and the broader Cloudera Data Platform (CDP) offer a powerful, scalable solution for building, deploying, and managing data workflows in hybrid and multi-cloud environments. CDE simplifies data engineering with serverless architecture, auto-scaling Spark clusters, and built-in Apache Iceberg support. Unlike competing offerings, each CDE release is certified against one or more Apache Iceberg versions. This ensures full compatibility between the Spark engine and the underlying Open Lakehouse capabilities, such as Apache Ranger for security policies. Whenever you launch a CDE Spark Job or Session, Iceberg dependencies are automatically configured as dictated by the chosen Spark version. With full native Iceberg support, you can leverage CDE Sessions to create or migrate to Iceberg Table format without any special configurations. Next Steps Here is a list of helpful articles and blogs related to Cloudera Data Engineering and Apache Iceberg: Cloudera on Public Cloud 5-Day Free Trial Experience Cloudera Data Engineering through common use cases that also introduce you to the platform’s fundamentals and key capabilities with predefined code samples and detailed step by step instructions. Try Cloudera on Public Cloud for free Cloudera Blog: Supercharge Your Data Lakehouse with Apache Iceberg Learn how Apache Iceberg integrates with Cloudera Data Platform (CDP) to enable scalable and performant data lakehouse solutions, covering features like in-place table evolution and time travel. Read more on Cloudera Blog Cloudera Docs: Using Apache Iceberg in Cloudera Data Engineering This documentation explains how Apache Iceberg is utilized in Cloudera Data Engineering to handle massive datasets, with detailed steps on managing tables and virtual clusters. Read more in Cloudera Documentation Cloudera Blog: Building an Open Data Lakehouse Using Apache Iceberg This article covers how to build and optimize a data lakehouse architecture using Apache Iceberg in CDP, along with advanced features like partition evolution and time travel queries. Read more on Cloudera Blog Compatibility for Cloudera Data Engineering and Runtime Components Learn about Cloudera Data Engineering (CDE) and compatibility for Runtime components across different versions. This document also includes component version compatibility information for AWS Graviton. Read more in the Cloudera Documentation

pauldefusco · ‎09-03-2024

Objective Cloudera Data Engineering (CDE) is a cloud-native service provided by Cloudera. It is designed to simplify and enhance the development, deployment, and management of data engineering workloads at scale. CDE is part of the Cloudera Data Platform (CDP), which is a comprehensive, enterprise-grade platform for managing and analyzing data across hybrid and multi-cloud environments. Cloudera Data Engineering offers several advantages. With CDE, you can create a "CDE Spark-Submit" using the same syntax as your regular Spark-Submit. Alternatively, you can specify your Spark-Submit as a "CDE Job of type Spark" using a reusable Job Definition, which enhances observability, troubleshooting, and dependency management. These unique capabilities of CDE are especially useful for Spark Data Engineers who develop and deploy Spark Pipelines at scale. This includes working with different Spark-Submit definitions and dynamic, complex dependencies across multiple clusters. For example, when packaging a JAR for a Spark Submit, you can include various types of dependencies that your Spark application requires to run properly. These can consist of application code (compiled Scala/Java code), third-party libraries (external dependencies), configuration and resource files (for application configuration or runtime data), and custom JARs (any internal or utility libraries your application needs). In this article, you will learn how to effectively manage JAR dependencies and simplify Cloudera Data Engineering in various scenarios. Example 1: CDE Job with Scala Application Code in Spark Jar Scala Spark applications are typically developed and deployed in the following manner: Set Up Project in IDE: Use SBT to set up a Scala project in your IDE. Write Code: Write your Scala application. Compile & Package: Use the sbt package to compile and package your code into a JAR. Submit to Spark: Use spark-submit to run your JAR on a Spark cluster. In this example, you will build a CDE Spark Job with a Scala application that has already been compiled into a JAR. To learn how to complete these steps, please visit this tutorial. cde resource create --name cde_scala_job_files cde resource upload --name cde_scala_job_files --local-path jars/cdejobjar_2.12-1.0.jar cde job create \ --name cde-scala-job \ --type spark \ --mount-1-resource cde_scala_job_files \ --application-file cdejobjar_2.12-1.0.jar \ --conf spark.sql.shuffle.partitions=10 \ --executor-cores 2 \ --executor-memory 2g cde job run --name cde-scala-job You can add further JAR dependencies with the ```--jar``` or ```--jars``` options. In this case, you can add the Spark XML library from the same CDE Files Resource: cde resource upload --name cde_scala_job_files --local-path jars/spark-xml_2.12-0.16.0.jar cde job create \ --name cde-scala-job-jar-dependency \ --type spark \ --mount-1-resource cde_scala_job_files \ --application-file cdejobjar_2.12-1.0.jar \ --conf spark.sql.shuffle.partitions=10 \ --executor-cores 2 \ --executor-memory 2g \ --jar spark-xml_2.12-0.16.0.jar cde job run --name cde-scala-job-jar-dependency Notice that you could achieve the same by using two CDE file resources, each containing one of the JARs. You can create as many CDE file resources as needed for each JAR file. In the foloowing example, you will be referencing the application code JAR located in the "cde_scala_job_files" CDE Files Resource that you previously created, as well as an additional JAR for the Spark-XML package from a new CDE Files Resource that you will create as "cde_spark_xml_jar". Note the use of the new "--mount-N-prefix" option below. When you are using more than one CDE Resource with the same "CDE Job Create" command, you need to assign an alias to each Files Resource so that each command option can correctly reference the files. cde resource create --name cde_spark_xml_jar cde resource upload --name cde_spark_xml_jar --local-path jars/spark-xml_2.12-0.16.0.jar cde job create \ --name cde-scala-job-multiple-jar-resources \ --type spark \ --mount-1-prefix scala_app_code \ --mount-1-resource cde_scala_job_files \ --mount-2-prefix spark_xml_jar \ --mount-2-resource cde_spark_xml_jar \ --application-file scala_app_code/cdejobjar_2.12-1.0.jar \ --conf spark.sql.shuffle.partitions=10 \ --executor-cores 2 \ --executor-memory 2g \ --jar spark_xml_jar/spark-xml_2.12-0.16.0.jar cde job run --name cde-scala-job-multiple-jar-resources Example 2: CDE Job with PySpark Application Code and Jar Dependency from Maven For Maven dependencies, you can use the `--packages` option to automatically download and include dependencies. This is often more convenient than manually managing JAR files. In the following example, the `--packages` option replaces the `--jars` option. In this example, you will reference the Spark-XML package from Maven so that you can use it to parse the sample "books.xml" file from the CDE Files Resource. cde resource create --name spark_files --type files cde resource upload --name spark_files --local-path read_xml.py --local-path books.xml cde job create --name sparkxml \ --application-file read_xml.py \ --mount-1-resource spark_files \ --type spark \ --packages com.databricks:spark-xml_2.12:0.16.0 cde job run --name sparkxml Like in the previous example, multiple CDE file resources can be used to manage PySpark Application code and the sample XML file. Notice that the application code in ```read_xml_multi_resource.py``` is different. At line 67, the ```sample_xml_file``` Files Resource is referenced directly in the application code with its alias ```xml_data```. cde resource create --name sample_xml_file --type files cde resource create --name pyspark_script --type files cde resource upload --name pyspark_script --local-path read_xml_multi_resource.py cde resource upload --name sample_xml_file --local-path books.xml cde job create --name sparkxml-multi-deps \ --application-file code/read_xml_multi_resource.py \ --mount-1-prefix code \ --mount-1-resource pyspark_script \ --mount-2-prefix xml_data \ --mount-2-resource sample_xml_file \ --type spark \ --packages com.databricks:spark-xml_2.12:0.16.0 cde job run --name sparkxml-multi-deps Example 3: CDE Job with PySpark Application Code and Jar Dependency from CDE Files Resource Similar to example 1, you can reference JARs directly uploaded into CDE Files Resources instead of using Maven as in example 2. The following commands pick up from example 2 but replace the ```packages``` option with the ```jars``` option. Notice that the ```--jars``` option is used in the ```cde job run``` command rather than the ```cde job create```. The ```---jars``` option can either be set at CDE Job creation or runtime. cde resource create --name spark_xml_jar --type files cde resource upload --name spark_xml_jar --local-path jars/spark-xml_2.12-0.16.0.jar cde job create --name sparkxml-multi-deps-jar-from-res \ --application-file code/read_xml_multi_resource.py \ --mount-1-prefix code \ --mount-1-resource pyspark_script \ --mount-2-prefix xml_data \ --mount-2-resource sample_xml_file \ --mount-3-prefix deps \ --mount-2-resource spark_xml_jar \ --type spark \ cde job run --name sparkxml-multi-deps-jar-from-res \ --jar deps/spark-xml_2.12-0.16.0.jar Summary In this article, the CDE CLI was used to simplify Spark JAR management with Cloudera Data Engineering. You can utilize the CDE CLI to create CDE Job Definitions using Spark JAR dependencies and to create CDE file resources to store and reference one or multiple JARs. Cloudera Data Engineering offers significant improvements in Spark Dependency Management compared to traditional Spark-Submits outside of CDE. The Job Runs page in the CDE UI can be used to monitor JAR dependencies applied to each job execution. Cloudera Data Engineering presents substantial advancements in Spark Observability and Troubleshooting compared to traditional Spark-Submits outside of CDE. References & Useful Articles CDE Concepts Using the CDE CLI CDE CLI Command Reference

pauldefusco · ‎08-28-2024

Objective Cloudera Machine Learning (CML) is a platform designed to help organizations build, deploy, and manage machine learning models at scale. It is part of Cloudera’s suite of enterprise data platforms and solutions, focusing on providing a robust environment for data scientists, analysts, and engineers to collaborate on end-to-end machine learning workflows. PyGurobi is a Python interface for the Gurobi Optimizer, a powerful and widely used solver for mathematical optimization problems. Gurobi is known for its high performance in solving a variety of optimization problems, including linear programming (LP), quadratic programming (QP), mixed-integer programming (MIP), and others. In this tutorial, you will use PyGurobi on CML to optimize product prices and maximize enterprise revenue. Requirements The following are required to reproduce this example: CML Workspace in AWS, Azure, OCP, or ECS. Basic knowledge of Python for Machine Learning including Sci-Kit Learn, Spark, Iceberg, and XGBoost. You should have basic familiarity with linear and nonlinear programming. If you are new to mathematical optimization, please visit this link for a quick introduction. Step by Step Instructions Supporting code for reproducing the tutorial can be found in this Git repository. Launch a CML Session with the following runtime and resource profile: Editor: JupyterLab Kernel: Python 3.10 Edition: Standard Version: 2024.05 Enable Spark: Spark 3.2 or above Resource Profile: 2 CPU / 4 GB Mem / 0 GPU Runtime Image: docker.repository.cloudera.com/cloudera/cdsw/ml-runtime-jupyterlab-python3.10-standard:2024.05.1-b8 Open the terminal and install the requirements: pip3 install -r requirements.txt Part 0: Data Generation Run notebook ```00_datagen_iceberg_pyspark.ipynb``` and observe the following: A Spark dataframe with 10000 synthetic product price transactions is created. The P1 and P2 columns represent the prices for two products sold, and the N1 column represents the quantity of Product 1 sold. The dataframe is stored as an Iceberg table. Part 1: Pricing Optimization with Gurobi Run notebook ```01_price_optimization_with_competing_products.ipynb``` and observe the following: An MLFlow Experiment Context is created with the name "Price Optimization Experiment". An initial regressor is built to predict prices using the data stored in the Iceberg table. The data is read using PandasOnSpark which is included in the Spark Runtime AddOn by default. A Price Optimization model is instantiated with an Objective Function and associated Constraints. The model is trained on the data. Its outputs include an optimal price recommendation for the two products, with an associated product quantity, and finally, a revenue estimate. In other words, revenue is maximized at 70347.77 when prices are 400 and 300 for the two products, respectively. Part 2: Deploy Optimization Model in an API Endpoint Run notebook ```02_price_optimization_model_deployment.ipynb``` and observe the following: CML APIv2 allows you to programmatically execute actions within CML Workspaces. In this example, the API is used to create a small Python Interface to manage model deployments. In particular, the interface was used to create a separate CML Project to host an API Endpoint. The API Endpoint is used to allocate a dedicated container for the model and provide an entry point for prediction requests. Navigate back to the CML workspace and notice a new project named ```CML Project for Optimization Model``` has been created. Open it and notice a new Endpoint has been created in the Model Deployments section. Open the model deployment and, once it has been completed, enter the following sample payload in the Test Request window. Observe the output response. Test Input: {"p[1]": [354,353,352,351,354,353,312,311,314,313,352,351], "p[2]": [110,120,320,220,101,100,101,260,355,140,300,299], "n[1]": [54,53,112,151,154,153,52,51,4,53,92,71]} Sample Test Output: { "model_deployment_crn": "crn:cdp:ml:us-west-1:558bc1d2-8867-4357-8524-311d51259233:workspace:f76bd7eb-adde-43eb-9bd9-e16ec2cb0238/c152a438-6449-465e-8685-e1cc0b9988fa", "prediction": { "data": { "n[1]": [ 54, 53, 112, 151, 154, 153, 52, 51, 4, 53, 92, 71 ], "p[1]": [ 354, 353, 352, 351, 354, 353, 312, 311, 314, 313, 352, 351 ], "p[2]": [ 110, 120, 320, 220, 101, 100, 101, 260, 355, 140, 300, 299 ] }, "optimal prices": [ 400, 300 ], "optimal product quantities": [ 80, 120 ], "total revenue": 68032.83 }, "uuid": "e6700d88-f4e7-4705-988b-89e9c8092194" } Summary In this tutorial, you used PyGurobi in Cloudera Machine Learning to maximize product revenue by identifying optimal prices and sales quantities for two products. The PyGurobi library allows you to solve complex linear and nonlinear programming such as the above. Cloudera on Cloud provides the tooling necessary to use libraries such as PyGurobi in an enterprise setting. With CML you can easily leverage Spark on Kubernetes, Runtime Add-Ons, Iceberg, Python, MLFlow, and more, to install and containerize workloads and machine learning models at scale, without any custom installations. Related Articles and Resources Here are some useful articles about Cloudera Machine Learning (CML) that can help you better understand its features and capabilities: Cloudera Machine Learning - What You Should Know: This article on the Cloudera Community provides an overview of CML, explaining how it enables teams to deploy machine learning workspaces that auto-scale to fit their needs using Kubernetes. It highlights key features like cost-saving through auto-suspend capabilities and offers a consistent experience across an organization. The article is a good starting point for understanding CML's role within the Cloudera Data Platform (CDP). Introduction to CML. How to Use Experiments in Cloudera Machine Learning: This guide walks through using experiments in CML, which allows users to run scripts with different inputs and compare metrics, particularly useful for tasks like hyperparameter optimization. The article includes practical examples that illustrate how experiments can be applied in real-world. MLFlow Experiments in CML. Cloudera Machine Learning Documentation: This hands-on guide from Datafloq provides a detailed checklist for managing CML projects effectively, with a focus on optimizing productivity and data quality. It discusses essential components such as data cleansing and how these contribute to improved decision-making, which is crucial for successful machine learning outcomes. CML Documentation. Getting Started with the Gurobi Python API: This tutorial provides a comprehensive introduction to using PyGurobi, from creating a model, adding variables and constraints, to setting objectives and optimizing. It explains the use of key functions such as `Model.addVar`, `Model.addConstr`, and `Model.setObjective`, making it an excellent starting point for beginners interested in mathematical optimization using Gurobi in Python Gurobi Help Center. Python API Overview: This overview explains the various types of models that can be handled by Gurobi, such as Mixed Integer Linear Programs (MILP), Mixed Integer Quadratic Programs (MIQP), and Non-Linear Programs (NLP). It also covers the environments used within the Gurobi Python interface and guides solving models, managing multiple solutions, and handling infeasible models. Python API Overview. Gurobi Optimizer Python Environment: This resource outlines how to set up and start using the Gurobi Python environment, including installing the Gurobi package via Anaconda or pip. It also highlights the different types of licenses available, including free academic licenses and evaluation licenses for commercial users. Gurobi Optimizer Python Environment.

pauldefusco · ‎08-27-2024

Objective Cloudera Data Engineering (CDE) is a cloud native service for Cloudera Data Platform that allows you to submit batch jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure. Wheels allow for faster installations and more stability in the package distribution process. In the context of PySpark, Wheels allow you to make python dependent modules available to executors without having to do pip install dependencies on every node and to use application source code as a package. In this tutorial you will create a CDE Spark Job using a Wheel file via the CDE CLI. Project Requirements In order to execute the Hands On Labs you need: * A Spark 3 and Iceberg-enabled CDE Virtual Cluster (Azure, AWS and Private Cloud ok). * The CDE CLI installed on your local machine. If you need to install it for the first time please follow these steps. * Familiarity with Python, PySpark and the CDE CLI is highly recommended. * No script code changes are required. Project Setup Clone this GitHub repository to your local machine or the VM where you will be running the script. mkdir ~/Documents/cde_wheel_jobs cd ~/Documents/cde_wheel_jobs git clone https://github.com/pdefusco/CDE_Wheel_Jobs.git Alternatively, if you don't have git installed on your machine, create a folder on your local computer; navigate to this URL and manually download the files. Step by Step Instructions The Spark Job code can be found in the ```mywheel/__main__.py``` file but it does not require modifications. For demo purposes we have chosen to use a simple Spark SQL job. The Wheel has already been created for you and will automatically download to the ```dist``` directory in your local machine upon cloning this project. Using the Wheel with a CDE Spark Submit A CDE Spark Submit is the fastest way to prototype a Spark Job. In this example we will run a CDE Spark Submit with the Wheel file. Once you have the CDE CLI installed on your terminal you can launch a CDE Job from local via the CDE CLI via the ```cde spark submit``` command. Copy the following command and execute it in your terminal: cde spark submit --py-files dist/mywheel-0.0.1-py3-none-any.whl mywheel/__main__.py In the terminal, validate that the Spark Job has launched successfully and note the Job Run ID. Next, navigate to the CDE Job Runs UI and validate job execution: Open the Job Configuration tab and notice that the Wheel has been uploaded in a File Resource for you. However, notice that the Job Configuration tab does not provide means to edit or reschedule the job definition. In other words the entries in the Configuration tab are final. In order to be able to change the definition we will need to create a CDE Spark Job. Using the Wheel with a CDE Spark Job Similar to a CDE Spark Submit a CDE Spark Job is Application code to execute a Spark Job in a CDE Virtual Cluster. However, the CDE Job allows you to easily define, edit and reuse configurations and resources in future runs. Jobs can be run on demand or scheduled. An individual job execution is called a job run. In this example we will create a CDE Resource of type File and upload the Spark Application code and the Wheel dependency. Then, we will run the Job. Execute the following CDE CLI commands in your local terminal. Create the File Resource: cde resource create --name mywheels Upload Application Code and Wheel to the File Resource: cde resource upload --name mywheels --local-path dist/mywheel-0.0.1-py3-none-any.whl cde resource upload --name mywheels --local-path mywheel/__main__.py Navigate to the CDE Resource tab and validate that the Resource and the corresponding files are now available. Create the CDE Spark Job definition. Navigate to the CDE Jobs UI and notice a new CDE Spark Job has been created. The job hasn't run yet so only the Configuration tab is populated with the Spark Job definition. cde job create --name cde_wheel_job --type spark --py-files mywheel-0.0.1-py3-none-any.whl --application-file __main__.py --mount-1-resource mywheels Finally, run the job. Now the Job Runs will include a new entry reflecting Job execution: cde job run --name cde_wheel_job Notice that the CDE Job definition can now be edited. This allows you to make changes to files and dependencies, create or change job execution schedule, and more. For example, the CDE Job can now be executed again. Conclusions & Next Steps CDE is the Cloudera Data Engineering Service, a containerized managed service for Spark and Airflow. If you are exploring CDE you may find the following tutorials relevant: Spark 3 & Iceberg: A quick intro of Time Travel Capabilities with Spark 3. Simple Intro to the CDE CLI: An introduction to the CDE CLI for the CDE beginner. CDE CLI Demo: A more advanced CDE CLI reference with additional details for the CDE user who wants to move beyond the basics. CDE Resource 2 ADLS: An example integration between ADLS and CDE Resource. This pattern is applicable to AWS S3 as well and can be used to pass execution scripts, dependencies, and virtually any file from CDE to 3rd party systems and viceversa. Using CDE Airflow: A guide to Airflow in CDE including examples to integrate with 3rd party systems via Airflow Operators such as BashOperator, HttpOperator, PythonOperator, and more. GitLab2CDE: a CI/CD pipeline to orchestrate Cross-Cluster Workflows for Hybrid/Multicloud Data Engineering. Postman2CDE: An example of the Postman API to bootstrap CDE Services with the CDE API. Oozie2CDEAirflow API: An API to programmatically convert Oozie workflows and dependencies into CDE Airflow and CDE Jobs. This API is designed to easily migrate from Oozie to CDE Airflow and not just Open Source Airflow. For more information on the Cloudera Data Platform and its form factors please visit this site. For more information on migrating Spark jobs to CDE, please reference this guide. If you have any questions about CML or would like to see a demo, please reach out to your Cloudera Account Team or send a message through this portal and we will be in contact with you soon.

pauldefusco · ‎05-30-2024

CDP Data Engineering (CDE) is the only cloud-native service purpose-built for enterprise data engineering teams. Building on Apache Spark, Data Engineering is an all-inclusive data engineering toolset that enables orchestration automation with Apache Airflow, advanced pipeline monitoring, visual troubleshooting, and comprehensive management tools to streamline ETL processes across enterprise analytics teams. Apache Iceberg is an open table format for huge analytic datasets. It provides features that, coupled with Spark as the compute engine, allow you to build data processing pipelines with dramatic gains in terms of scalability, performance, and overall developer productivity. CDE Provides native Iceberg support. With the release of CDE 1.20, the Spark Runtime has been updated with Apache Iceberg 1.3. This version introduces new features that provide great benefits to Data Engineers. This article will familiarize you with Iceberg Table Branching and Tagging in CDE. Table Branching: the ability to create independent lineages of snapshots, each with its lifecycle. Table Tagging: ability to tag an Iceberg table snapshot. Requirements CDE Virtual Cluster of type "All-Purpose" running in CDE Service with version 1.20 or above. A working installation of the CDE CLI. The supporting code and associated files and resources are available in this git repository. Step by Step Instructions Prerequisites Create CDE Files Resource: cde resource create --name myFiles --type files cde resource upload --name myFiles --local-path resources/cell_towers_1.csv --local-path resources/cell_towers_2.csv Launch CDE Session & Run Spark Commands: cde session create --name icebergSession --type pyspark --mount-1-resource myFiles cde session interact --name icebergSession Create Iceberg Table: USERNAME = "pauldefusco" df = spark.read.csv("/app/mount/cell_towers_1.csv", header=True, inferSchema=True) df.writeTo("CELL_TOWERS_{}".format(USERNAME)).using("iceberg").tableProperty("write.format.default", "parquet").createOrReplace() Working with Iceberg Table Branches Insert Data into Branch: # LOAD NEW TRANSACTION BATCH batchDf = spark.read.csv("/app/mount/cell_towers_2.csv", header=True, inferSchema=True) batchDf.printSchema() batchDf.createOrReplaceTempView("BATCH_TEMP_VIEW".format(USERNAME)) # CREATE TABLE BRANCH spark.sql("ALTER TABLE CELL_TOWERS_{} CREATE BRANCH ingestion_branch".format(USERNAME)) # WRITE DATA OPERATION ON TABLE BRANCH batchDf.write.format("iceberg").option("branch", "ingestion_branch").mode("append").save("CELL_TOWERS_{}".format(USERNAME)) Notice that a simple SELECT query against the table still returns the original data. spark.sql("SELECT * FROM CELL_TOWERS_{};".format(USERNAME)).show() If you want to access the data in the branch, you can specify the branch name in your SELECT query. spark.sql("SELECT * FROM CELL_TOWERS_{} VERSION AS OF 'ingestion_branch';".format(USERNAME)).show() Track table snapshots post Merge Into operation: # QUERY ICEBERG METADATA HISTORY TABLE spark.sql("SELECT * FROM CELL_TOWERS_{}.snapshots".format(USERNAME)).show(20, False) Cherrypicking Snapshots The cherrypick_snapshot procedure creates a new snapshot incorporating the changes from another snapshot in a metadata-only operation (no new data files are created). To run the cherrypick_snapshot procedure you need to provide two parameters: the name of the table you’re updating and the ID of the snapshot the table should be updated based on. This transaction will return the snapshot IDs before and after the cherry-pick operation as source_snapshot_id and current_snapshot_id. You will use the cherrypick operation to commit the changes to the table that were staged in the 'ingestion_branch' branch up until now. # SHOW PAST BRANCH SNAPSHOT ID'S spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.CELL_TOWERS_{}.refs;".format(USERNAME)).show() # SAVE THE SNAPSHOT ID CORRESPONDING TO THE CREATED BRANCH branchSnapshotId = spark.sql("SELECT snapshot_id FROM SPARK_CATALOG.DEFAULT.CELL_TOWERS_{}.refs WHERE NAME == 'ingestion_branch';".format(USERNAME)).collect()[0][0] # USE THE PROCEDURE TO CHERRY-PICK THE SNAPSHOT # THIS IMPLICITLY SETS THE CURRENT TABLE STATE TO THE STATE DEFINED BY THE CHOSEN PRIOR SNAPSHOT ID spark.sql("CALL spark_catalog.system.cherrypick_snapshot('SPARK_CATALOG.DEFAULT.CELL_TOWERS_{0}',{1})".format(USERNAME, branchSnapshotId)) # VALIDATE THE CHANGES # THE TABLE ROW COUNT IN THE CURRENT TABLE STATE REFLECTS THE APPEND OPERATION - IT PREVIOSULY ONLY DID BY SELECTING THE BRANCH spark.sql("SELECT COUNT(*) FROM CELL_TOWERS_{};".format(USERNAME)).show() Working with Iceberg Table Tags Tags are immutable labels for Iceberg Snapshot IDs and can be used to reference a particular table version via a simple tag rather than having to work with Snapshot IDs directly. Create Table Tag: spark.sql("ALTER TABLE SPARK_CATALOG.DEFAULT.CELL_TOWERS_{} CREATE TAG businessOrg RETAIN 365 DAYS".format(USERNAME)).show() Select your table snapshot as of a particular tag: spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.CELL_TOWERS_{} VERSION AS OF 'businessOrg';".format(USERNAME)).show() The refs Metadata Table The refs metadata table helps you understand and manage your table’s snapshot history and retention policy, making it a crucial part of maintaining data versioning and ensuring that your table’s size is under control. Among its many use cases, the table provides a list of all the named references within an Iceberg table such as Branch names and corresponding Snapshot IDs. spark.sql("SELECT * FROM SPARK_CATALOG.DEFAULT.CELL_TOWERS_{}.refs;".format(USERNAME)).show() Summary & Next Steps CDE supports Apache Iceberg which provides a table format for huge analytic datasets in the cloud. Iceberg enables you to work with large tables, especially on object stores and supports concurrent reads and writes on all storage media. You can use Cloudera Data Engineering virtual clusters running Spark 3 to interact with Apache Iceberg tables. The Iceberg Metadata Layer can track snapshots under different paths or give particular snapshots a name. These features are respectively called table branching and tagging. Thanks to them, Iceberg Data Engineers can implement data pipelines with advanced isolation, reproducibility, and experimentation capabilities.

Online	Online
Last Visited	‎02-27-2025 05:01 PM

Member Since	‎11-22-2019 01:34 PM
Last Visited	‎02-27-2025 05:01 PM
Posts	33
Kudos received	10

Cloudera Community

JupyterLab and Spark Connect Quickstart in Clouder...

PyCharm and Spark Connect Quickstart in Cloudera D...

Re: CML API call to start the AIML job

Getting Started with Spark GraphFrames in Cloudera...

Re: CML Runtime with Nvidia Libs and VSCode Editor

How to Create an Iceberg Table with PySpark in Clo...

How to Simplify Spark-Submit JAR Dependency Manage...

Price Optimization with PyGurobi in Cloudera Machi...

Cloudera Data Engineering Spark Job with Python Wh...

Iceberg Table Tagging and Branching in Cloudera Da...