Created on 09-07-202312:06 PM - edited 09-07-202312:08 PM
Pandas is a fast, powerful and flexible Python data analysis and manipulation tool. It has gained incredible popularity in the Data Science community as one of the most important packages for data wrangling and preparation. With the Pandas DataFrame at the core of its API, it is very simple to use and performant on relatively small datasets.
On the other hand, Apache Spark is an open-source, distributed processing system used for big data workloads. It has also gained extreme popularity as the go-to engine for prototyping and deploying Data Engineering and Machine Learning pipelines at scale. Due to the maturity and reliability that the Spark project has achieved in recent years it is widely used for production use cases in the enterprise.
Like Pandas, Spark features an API with the DataFrame as a foundational data structure for analyzing data at scale. However, while it is dramatically more performant at scale than Pandas it has never been quite as easy and intuitive to use.
The Koalas project was created to address this gap. The idea was simple: provide users with the ability to run their existing Pandas code on Spark. With Spark 3.2 the project has been fully incorporated in PySpark as the "Pandas API on Spark".
Cloudera Machine Learning (CML) is Cloudera’s new cloud-native machine learning service, built for CDP. The CML service provisions clusters, also known asML workspaces, that run natively on Kubernetes. ML workspaces support fully-containerized execution of Python, R, Scala, and Spark workloads through flexible and extensibleengines.
The Pandas on Spark API is a common choice among CML users. The attached notebook provides a simple Quickstart for a CML Session. To use it in CML you must have:
A CML Session with the Python Kernel and the Spark Add on enabled (Spark Version 3.2 or above only).
Even though Pandas on Spark does not require a SparkSession or SparkContext object, use the CML Spark Data Connection to launch a SparkSession object and set custom configurations. For example, this will allow you to read Iceberg tables into a Pandas On Spark DataFrame with a single line of code.
The same Spark performance concepts such as trying to avoid shuffles apply to Pandas on Spark. For example you should leverage checkpointing when utilizing the same DataFrame repeatedly, shuffle with caution, and review the Spark Plan when needed: