Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Parsing JSON data and Storing it in ORC Hive

avatar
New Contributor

Hello -

I have a complex JSON file that I need to parse, store the results in a staging table and then move it to ORC table by applying some transformations. The problem is that the JSON has a lot of objects and arrays. I was able to parse the file and store it in a stage JSONSerDe table by defining the various objects and arrays. However, I would like to denormalize the data and store it in the target ORC table. How do I map the JSON array in 1 row and create multiple rows out of it?

Sample Data -

{"eventHeader":{"eventType":"x", "eventOutcome":"declined"}, "eventPayload":{"nr":"01", "Flag":false, "from":"45", "to":"67", "amts":[{"type":"A","amt":1,"impPrcsnAmt":4}, {"type":"B","amt":23,"impPrcsnAmt":4}, {"type":"C","amt":98,"impPrcsnAmt":4} ] }}

This shows up as one row in the stage table. As mentioned above, I would like to move it into 3 rows in the target table.

Can anyone help and advise?

4 REPLIES 4

avatar
Super Guru
@G P

You can use nested types in Hive with ORC (struct, lists, maps). This way you don't have to explode each record into multiple rows. This also gives you the benefit in denormalization as the data that is part of same array will be physically sitting together on the disk - so it has read benefits.

avatar
Cloudera Employee

Actually, we have implemented an example of how to do it in the ORC project. In particular, ORC-150 added both a JSON schema discovery tool and JSON to ORC converter.

Given the single row of data above, the schema discovery tool produces the schema below. Obviously given more data, it would produce a better schema. The JSON conversion tool uses a provided schema (or runs the schema discovery tool) to convert the data.

struct<
  eventHeader:struct<
    eventOutcome:string,
    eventType:string>,
  eventPayload:struct<
    Flag:boolean,
    amts:array<struct<amt:tinyint,impPrcsnAmt:tinyint,type:string>>,
    from:binary,
    nr:binary,
    to:binary>>

avatar
New Contributor

Nice work!
How well does this work with the numerous ORC performance enhancements?
Is HIVE-14565 a concern with respect to using deeply nested structures?

Thanks.

avatar
Master Guru

You can assemble in NiFi and then store to ORC. I recommend breaking your JSON down into simpler structures since you will have to query it and use it with other data.

Can you make it a wide table? Duplicate data is not a big deal for Hadoop.