I create dataframes from Parquet and JSON that contain nested structs that vary substantially from one file to the next.
I would like to flatten all of the columns present in every struct contained in the data frame. However, columns only gives the top level column names and I cannot find a way to iterate without providing column names.
%sh
hdfs dfs -cat input.json
{"UID":3463,"well":{"UID":3463,"wellbore":{"UID":1242,"Value":1}}}
%spark2
data.columns
data.printSchema
res9: Array[String] = Array(UID, well)
root
|-- UID: long (nullable = true)
|-- well: struct (nullable = true)
| |-- UID: long (nullable = true)
| |-- wellbore: struct (nullable = true)
| | |-- UID: long (nullable = true)
| | |-- Value: long (nullable = true)
Ideally I would like
data.columns.flatten
res9: Array[String] = Array(UID, well, well.UID, well.wellbore, well.wellbore.UID, well.wellbore.Value ) .... so on