Support Questions

Find answers, ask questions, and share your expertise

How can one flatten arbitrary structs within a Dataframe in Spark / SparkR

Contributor

I create dataframes from Parquet and JSON that contain nested structs that vary substantially from one file to the next.

I would like to flatten all of the columns present in every struct contained in the data frame. However, columns only gives the top level column names and I cannot find a way to iterate without providing column names.

%sh
hdfs dfs -cat input.json

{"UID":3463,"well":{"UID":3463,"wellbore":{"UID":1242,"Value":1}}}
%spark2
data.columns
data.printSchema

res9: Array[String] = Array(UID, well)
root
 |-- UID: long (nullable = true)
 |-- well: struct (nullable = true)
 |    |-- UID: long (nullable = true)
 |    |-- wellbore: struct (nullable = true)
 |    |    |-- UID: long (nullable = true)
 |    |    |-- Value: long (nullable = true)

Ideally I would like

data.columns.flatten

res9: Array[String] = Array(UID, well, well.UID, well.wellbore, well.wellbore.UID, well.wellbore.Value ) .... so on
2 REPLIES 2

Explorer

Have you tried using sql explode function? i haven't tried this with SparkR, used it previously to flatten hierarchal json structure.

import org.apache.spark.sql.functions.explode

val jsonRDD = sc.wholeTextFiles("mydataset.json")
val jsonDF = sqlContext.read.json(jsonRDD.values)
jsonDF.createOrReplaceTempView("jd")

val sqlDF = spark.sql("SELECT explode(data) FROM jd") 

Contributor

explode is only implemented for map and array. When you pass in a struct Spark throws: "data type mismatch: input to function explode should be array or map type"