Reply
New Contributor
Posts: 1
Registered: ‎09-04-2015

DataFrame explode maintain structure

[ Edited ]

I've seen this post https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Explode-function-in-Data-Frames/td...

 

It was helpful for that example. But for a more complicated schema, from a json file for example, the "one level deep" examples of Strings don't seem to help much.

How would one maintain the structure/schema of the items that are exploded in a schema like this: (exploding on ContractLineItems array)

 

e.g. schema

ProcurementDocument: struct (nullable = true)
|-- AwardInstrument: struct (nullable = true)
|    |-- ContractLineItems: struct (nullable = true)
|    |    |-- LineItems: array (nullable = true)
|    |    |    |    |-- LineItemBasicInformation: struct (nullable = true)
|    |    |    |    |    |-- PricingArrangement: struct (nullable = true)
|    |    |    |    |    |    |-- PricingArrangementBase: string (nullable = true)

 

LineItems is an array (of other stuff/structs). I've tried with sqlContext.sql("select .... lateral view exlode()") but I can't get it to work.

Expected result: when I explode on ContractLineItems, if there are 5 LineItems, then I would want all of the structure & fields to be available in the new column.

The other examples simply explode on Strings. e.g. people.explode("words","word"){c: String => c.split(" ") }

Anything more complex, and that's where I'm stuck.

 

I'm looking for something like procurementDocs.explode("ContractLineItems","lineItems"){ a: Seq[Row] Array? List? _?]=> the complete structure? }

and to have the exploded line items still contain LineItemBasicInformation, PricingArrangement, PricingArrangementBase.