Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DenseRank() in Spark DataFrame usage?


DenseRank() in Spark DataFrame usage?



I have a requirement for Denserank for ordering the elements based on date.

JavaRDD<PageData> rddX = connector.getSparkContext().parallelize(pageData,20);

DataFrame processedData= context.createDataFrame(rddX, PageData.class);

DataFrame processedData ="date_publication"), processedData.col("date_application"),processedData.col("id_ref"),processedData.col("id_unite"),processedData.col("libelle"),processedData.col("valeur"), processedData.col("zone"),processedData.col("tableau"), org.apache.spark.sql.functions.denseRank().over(org.apache.spark.sql.expressions.Window.partitionBy(processedData.col("tableau")).orderBy(processedData.col("libelle"))).alias("ordre"));

The above code is giving error. Can any one please help me on this?


Re: DenseRank() in Spark DataFrame usage?

Cloudera Employee

Hi @Rambabu Chamakuri

It might be easier to express it an sql statement :

// SC is an existing JavaSparkContext
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc)

JavaRDD<PageData> rddX = sc.parallelize(pageData,20);
// Apply the schema to the RDD to create a dataFrame 
DataFrame processedData= sqlContext.createDataFrame(rddX, PageData.class);

// Register the DataFrame as a table.

//Use SQL to express your Queries
DataFrame result = sqlContext.sql("SELECT date_publication, date_application, id_ref, id_unite, libelle, valeur, zone, tableau, dense_rank() OVER (PARTITION BY tableau ORDER BY libelle DESC) as ordre FROM data");

You may have already read them, but here are a few good ressources to help you out :

Databrick's "Introducing Window Functions in Spark SQL" blog article :

Apache SPARK programming guide

Don't have an account?
Coming from Hortonworks? Activate your account here