Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DataFrame explode maintain structure

Highlighted

DataFrame explode maintain structure

New Contributor

I've seen this post https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Explode-function-in-Data-Frames/td...

 

It was helpful for that example. But for a more complicated schema, from a json file for example, the "one level deep" examples of Strings don't seem to help much.

How would one maintain the structure/schema of the items that are exploded in a schema like this: (exploding on ContractLineItems array)

 

e.g. schema

ProcurementDocument: struct (nullable = true)
|-- AwardInstrument: struct (nullable = true)
|    |-- ContractLineItems: struct (nullable = true)
|    |    |-- LineItems: array (nullable = true)
|    |    |    |    |-- LineItemBasicInformation: struct (nullable = true)
|    |    |    |    |    |-- PricingArrangement: struct (nullable = true)
|    |    |    |    |    |    |-- PricingArrangementBase: string (nullable = true)

 

LineItems is an array (of other stuff/structs). I've tried with sqlContext.sql("select .... lateral view exlode()") but I can't get it to work.

Expected result: when I explode on ContractLineItems, if there are 5 LineItems, then I would want all of the structure & fields to be available in the new column.

The other examples simply explode on Strings. e.g. people.explode("words","word"){c: String => c.split(" ") }

Anything more complex, and that's where I'm stuck.

 

I'm looking for something like procurementDocs.explode("ContractLineItems","lineItems"){ a: Seq[Row] Array? List? _?]=> the complete structure? }

and to have the exploded line items still contain LineItemBasicInformation, PricingArrangement, PricingArrangementBase.