Support Questions

gurpreet_dce · ‎03-20-2017

Hello -

I have a complex JSON file that I need to parse, store the results in a staging table and then move it to ORC table by applying some transformations. The problem is that the JSON has a lot of objects and arrays. I was able to parse the file and store it in a stage JSONSerDe table by defining the various objects and arrays. However, I would like to denormalize the data and store it in the target ORC table. How do I map the JSON array in 1 row and create multiple rows out of it?

Sample Data -

{"eventHeader":{"eventType":"x", "eventOutcome":"declined"}, "eventPayload":{"nr":"01", "Flag":false, "from":"45", "to":"67", "amts":[{"type":"A","amt":1,"impPrcsnAmt":4}, {"type":"B","amt":23,"impPrcsnAmt":4}, {"type":"C","amt":98,"impPrcsnAmt":4} ] }}

This shows up as one row in the stage table. As mentioned above, I would like to move it into 3 rows in the target table.

Can anyone help and advise?

mqureshi · ‎03-20-2017

@G P

You can use nested types in Hive with ORC (struct, lists, maps). This way you don't have to explode each record into multiple rows. This also gives you the benefit in denormalization as the data that is part of same array will be physically sitting together on the disk - so it has read benefits.

owen1 · ‎03-21-2017

Actually, we have implemented an example of how to do it in the ORC project. In particular, ORC-150 added both a JSON schema discovery tool and JSON to ORC converter.

Given the single row of data above, the schema discovery tool produces the schema below. Obviously given more data, it would produce a better schema. The JSON conversion tool uses a provided schema (or runs the schema discovery tool) to convert the data.

struct<
  eventHeader:struct<
    eventOutcome:string,
    eventType:string>,
  eventPayload:struct<
    Flag:boolean,
    amts:array<struct<amt:tinyint,impPrcsnAmt:tinyint,type:string>>,
    from:binary,
    nr:binary,
    to:binary>>

dmoore247 · ‎12-07-2017

Nice work!
How well does this work with the numerous ORC performance enhancements?
Is HIVE-14565 a concern with respect to using deeply nested structures?

Thanks.

TimothySpann · ‎03-27-2017

You can assemble in NiFi and then store to ORC. I recommend breaking your JSON down into simpler structures since you will have to query it and use it with other data.

Can you make it a wide table? Duplicate data is not a big deal for Hadoop.

Cloudera Community

Support Questions

Parsing JSON data and Storing it in ORC Hive