Support Questions
Find answers, ask questions, and share your expertise

DenseRank() in Spark DataFrame usage?



I have a requirement for Denserank for ordering the elements based on date.

JavaRDD<PageData> rddX = connector.getSparkContext().parallelize(pageData,20);

DataFrame processedData= context.createDataFrame(rddX, PageData.class);

DataFrame processedData ="date_publication"), processedData.col("date_application"),processedData.col("id_ref"),processedData.col("id_unite"),processedData.col("libelle"),processedData.col("valeur"), processedData.col("zone"),processedData.col("tableau"), org.apache.spark.sql.functions.denseRank().over(org.apache.spark.sql.expressions.Window.partitionBy(processedData.col("tableau")).orderBy(processedData.col("libelle"))).alias("ordre"));

The above code is giving error. Can any one please help me on this?


Cloudera Employee

Hi @Rambabu Chamakuri

It might be easier to express it an sql statement :

// SC is an existing JavaSparkContext
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc)

JavaRDD<PageData> rddX = sc.parallelize(pageData,20);
// Apply the schema to the RDD to create a dataFrame 
DataFrame processedData= sqlContext.createDataFrame(rddX, PageData.class);

// Register the DataFrame as a table.

//Use SQL to express your Queries
DataFrame result = sqlContext.sql("SELECT date_publication, date_application, id_ref, id_unite, libelle, valeur, zone, tableau, dense_rank() OVER (PARTITION BY tableau ORDER BY libelle DESC) as ordre FROM data");

You may have already read them, but here are a few good ressources to help you out :

Databrick's "Introducing Window Functions in Spark SQL" blog article :

Apache SPARK programming guide