Support Questions

kctan6 · ‎06-17-2016

For example, I have a Spark DataFrame with three columns 'Domain', 'ReturnCode', and 'RequestType'

Example Starting Dataframe

www.google.com,200,GET
www.google.com,300,GET
www.espn.com,200,POST

I would like to pivot on Domain and get aggregate counts for the various ReturnCodes and RequestTypes. Do I have to pivot each table separately and then join them back together or is it possible to do it in one step?

Desired Dataframe

Domain,200,300,GET,POST
www.google.com,1,1,2,0
www.espn.com,1,0,0,1

Example of Pivot Code with Join

val dfa = df.groupby('Domain').pivot('ReturnCode').count()
val dfb = df.groupby('Domain').pivot('RequestType').count()
dfa.join(dfb, Seq("ReferrerDomain", "ReferrerDomain")).show()

GeeKay2015 · ‎06-18-2016

Since we can do pivoting on only one column so one way of doing in one go is combine the 2 columns to a new column and use that new column as pivot column. The output is some what close to what you are expecting. Hope this helps.

The Input:

Domain,ReturnCode,RequestTyp
ewww.google.com,200,GET
www.google.com,300,GET
www.espn.com,200,POST

The code:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions.udf

object pivotDF {
  // Define the application Name
  val AppName: String = "pivotDF"

  // Set the logging level
  Logger.getLogger("org.apache").setLevel(Level.ERROR)

  // Define a udf to concatenate two passed in string values
  val concat = udf( (first: String, second: String) => { first + " " + second } )

  def main (args: Array[String]) {

    // define the input parameters
    val input_file = "/Users/gangadharkadam/myapps/pivot/src/main/resources/domains.csv"

    // Create the Spark configuration and the spark context
    println("Initializing the Spark Context...")

    val conf = new SparkConf().setAppName(AppName).setMaster("local")

    // Define the Spark Context
    val sc = new SparkContext(conf)

    // Define the SQL context
    val sqlContext = new HiveContext(sc)
    import sqlContext.implicits._

    //Load and parse the Engine Information data into Spark DataFrames
    val domainDF = sqlContext

      //Define the format
      .read.format("com.databricks.spark.csv")

      //use the first line as header
      .option("header", "true")
      
      //Automatically infer the data types and schema
      .option("inferSchema", "true")
      
      //load the file
      .load(input_file)
    
    // pivot using concatenated column
    domainDF.withColumn("combColumn", concat($"ReturnCode",$"RequestType"))
      .groupBy("domain").pivot("combColumn").agg(count).show()
  }
}

The output:

domain	200 GET	200 POST	300 GET
www.espn.com	0	1	0
www.google.com	1	0	1

kctan6 · ‎06-20-2016

Thanks @Gangadhar Kadam, i understand that this is a possible workaround but I would like to know if I want the two columns separate is that possible or do I need to do the pivot twice and join the tables?

Cloudera Community

Support Questions

How do I run multiple pivots on a Spark DataFrame?

Spark DataFrame to Solr Cloud - runs on Sandbox 2....

Kafka producer running into multiple org.apache.ka...

Accessing Hbase tables and querying on Dataframes ...

TimestampType format for Spark DataFrames

Save Spark DataFrame table into Phoenix

Not able to split the column into multiple columns...

Spark Dataframe Query not working

Issues running Spark in CML

Spark RDD/Dataframe caching

Saving Spark 2.2 dataframs in Hive table