Created on 03-20-2017 06:42 PM - edited 09-16-2022 04:17 AM
Hello -
I have a complex JSON file that I need to parse, store the results in a staging table and then move it to ORC table by applying some transformations. The problem is that the JSON has a lot of objects and arrays. I was able to parse the file and store it in a stage JSONSerDe table by defining the various objects and arrays. However, I would like to denormalize the data and store it in the target ORC table. How do I map the JSON array in 1 row and create multiple rows out of it?
Sample Data -
{"eventHeader":{"eventType":"x", "eventOutcome":"declined"}, "eventPayload":{"nr":"01", "Flag":false, "from":"45", "to":"67", "amts":[{"type":"A","amt":1,"impPrcsnAmt":4}, {"type":"B","amt":23,"impPrcsnAmt":4}, {"type":"C","amt":98,"impPrcsnAmt":4} ] }}
This shows up as one row in the stage table. As mentioned above, I would like to move it into 3 rows in the target table.
Can anyone help and advise?
Created 03-20-2017 09:39 PM
You can use nested types in Hive with ORC (struct, lists, maps). This way you don't have to explode each record into multiple rows. This also gives you the benefit in denormalization as the data that is part of same array will be physically sitting together on the disk - so it has read benefits.
Created 03-21-2017 11:31 PM
Actually, we have implemented an example of how to do it in the ORC project. In particular, ORC-150 added both a JSON schema discovery tool and JSON to ORC converter.
Given the single row of data above, the schema discovery tool produces the schema below. Obviously given more data, it would produce a better schema. The JSON conversion tool uses a provided schema (or runs the schema discovery tool) to convert the data.
struct< eventHeader:struct< eventOutcome:string, eventType:string>, eventPayload:struct< Flag:boolean, amts:array<struct<amt:tinyint,impPrcsnAmt:tinyint,type:string>>, from:binary, nr:binary, to:binary>>
Created 12-07-2017 04:08 PM
Nice work!
How well does this work with the numerous ORC performance enhancements?
Is HIVE-14565 a concern with respect to using deeply nested structures?
Thanks.
Created 03-27-2017 03:02 PM
You can assemble in NiFi and then store to ORC. I recommend breaking your JSON down into simpler structures since you will have to query it and use it with other data.
Can you make it a wide table? Duplicate data is not a big deal for Hadoop.