We have huge 4 gb schema less nested JSON files , we have a need to create Hive tables on top of those to enable reporting on top of it.
what is the best way to do this.?
Can we create Hive tables as it is on JSON files and can use that for reporting?
or Do we need to flatten those and create tables.?
or Can we read line by line and insert in to hive , will it work on huge files with millions of rows.
or Can it be done in any better ways.?
Hi @Saikrishna Tarapareddy I'm most familiar with 2 different means to ingest JSON. The first is to use the built-in Hive UDF "get_json_object". You create a table with a single string column and load your JSON file into the table. You then execute a select on the table and specify the udf. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_objec.... The problem with this approach is it runs on each column and I had issues sorting through heavily nest json files.
Another approach is to use a json serde. The one I'm familiar with (though its old) is this one https://github.com/rcongiu/Hive-JSON-Serde. This will be much faster but you'll need to build schema first. This tool can help https://github.com/quux00/hive-json-schema. Here is an article I would refer to in the past http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html. Granted, all these solutions are bit dated so maybe someone else has more up-to-date approaches.
In any case, once the table is built you will want to create a table based on ORC. You do not want to query a table based on a raw json file.
@Scott Shaw, thank you , i will look in to those.
i am also looking to break down nested JSON files in to multiple files based on tags and then create hive tables on top of each file. Our source of these huge JSON files is traditional RDBS where they are joining multiple files and sending us as one JSON. i like to split the JSON one for each file. will that be a good approach.?
@Scott Shaw , I am not able to access your gitub url . here I have a requirement like need to create the hive table on nested JSON file .please help me how to do it using jsonserde serde..thank you in advance