Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Transform DataFrame with Array Column


Transform DataFrame with Array Column

Expert Contributor

In Spark 2.2 I have a DataFrame, like this:

id	|	ValueList (Array<Struct>)
1	|	[(z, 1), (y, 2), (x, 3)]
2	|	[(y, 3), (x, 1), (u, 5)]

I want to transform my DataFrame by looking up the different keys in the Struct of the ValueList column and generating new columns with these names and set the value (or null if not existing in this row). So the final DataFrame should look like this:

id	|	ValueList			|	u	|	x	| 	y	|	z
1	|	[(z, 1), (y, 2), (x, 3)]	|	null	|	3	|	2	|	1
2	|	[(y, 3), (x, 1), (u, 5)]	|	5	|	1	|	3	|	null

How can I do this? I couldn't find a matching function for reading out the values of the Array<Struct> column, to be able to somehow transform the ValueList column into this new format.

What I did so far, is reading the different keys (here u, x, y, z) from the DataFrame by collecting the data and iterate the resulting list of Rows. But how can I now use this information (I guess in combination with DF.withColumn) to fill this new DataFrame (and set the missing values to null)?

Any help would be appreciated, thank you!

Don't have an account?
Coming from Hortonworks? Activate your account here