Created on 01-06-2025 02:12 PM - edited on 01-07-2025 10:00 PM by VidyaSargur
Spark GraphFrames is a package within Apache Spark that allows users to perform graph processing operations on data using a DataFrame-based approach. It will enable you to perform various graph operations like finding connected components, calculating shortest paths, identifying triangles, and more. Real world applications include social network analysis, recommendation systems, web link structures, flight connections, and other scenarios where relationships between data points are crucial.
Cloudera AI (CAI) is a cloud-native service within the Cloudera Data Platform (CDP) that enables enterprise data science teams to collaborate across the full data lifecycle. It provides immediate access to enterprise data pipelines, scalable compute resources, and preferred tools, streamlining the process of moving analytic workloads from research to production.
Using Spark GraphFrames in Cloudera AI requires minimum configuration effort. This quickstart provides a basic example so you can get started quickly.
You can use Spark GraphFrames in CAI Workbench with Spark Runtime Addon versions 3.2 or above. When you create the SparkSession object, make sure to select the right version of the package reflecting the Spark Runtime Addon you set during the CAI Session launch from SparkPackages.
This example was created with the following platform and system versions, but it will also work in other Workbench versions (below or above 2.0.46).
Create a CAI Workbench Session with the following settings:
In the Session, run the script. Notice the following:
spark = SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "graphframes:graphframes:0.8.4-spark3.5-s_2.12") \
.getOrCreate()
from graphframes import *
# Vertex DataFrame
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)
], ["id", "name", "age"])
# Edge DataFrame
e = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
# Create a GraphFrame
g = GraphFrame(v, e)
> g.vertices.show()
+---+-------+---+
| id| name|age|
+---+-------+---+
| a| Alice| 34|
| b| Bob| 36|
| c|Charlie| 30|
| d| David| 29|
| e| Esther| 32|
| f| Fanny| 36|
| g| Gabby| 60|
+---+-------+---+
> g.edges.show()
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| a| b| friend|
| b| c| follow|
| c| b| follow|
| f| c| follow|
| e| f| follow|
| e| d| friend|
| d| a| friend|
| a| e| friend|
+---+---+------------+
#Find the youngest user’s age in the graph. This queries the vertex DataFrame.
> g.vertices.groupBy().min("age").show()
+--------+
|min(age)|
+--------+
| 29|
+--------+
Cloudera AI (CAI) is a cloud-native service within the Cloudera Data Platform (CDP) that enables enterprise data science teams to collaborate across the full data lifecycle. It provides immediate access to enterprise data pipelines, scalable compute resources, and preferred tools, streamlining the process of moving analytic workloads from research to production. In particular:
Finally, you can learn more about Cloudera AI Workbench with the following recommended blogs and community articles: