Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to create hive table out of JSON Schema

avatar
New Contributor

Hi,

 

I have JSON schema which is very deeply nested, how can we automatically create hive DDL out of JSON schema.

 

I did googling and all I am seeing how to create hive table out of JSON data.

 

Thanks,

Jai

2 REPLIES 2

avatar
Expert Contributor

You can use  JSON Serde. You have to create the table with a structure that maps the structure of the json. 

For example:

 

data.json

{"X": 134, "Y": 55, "labels": ["L1", "L2"]}
{"X": 11, "Y": 166, "labels": ["L1", "L3", "L4"]}

create table

CREATE TABLE Point
(
    X INT,
    Y INT,
    labels ARRAY<STRING>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION 'path/to/table';

Then you should upload your json file in the location path of the table, giving the right permissions and you are good to go.

 

 

avatar
New Contributor

Hi Ludof,

 

First of all, thanks a lot for the response, second my apologies that I could not respond you timely.

 

Actually I have very complex XSDs with >2000 elements in nested xsd complex types.  So, above solution would not work in my case. I can not create hive table manually with these number of elements and also Objects nested at 10th level

 

Sorry, I cannot share the code here but this is how I implemented the project.

 

Goal: Ingest XMLs data into HDFS and query using Hive/Impala

 

Solution: Convert XDS into Hive Avro table and keep pumping xml -> avro into hdfs.

 

  1. I took all XSDs into XML Spy tool and generated sample xml
    • I still had to fix some elements with default values in it because Spark was able to infer more correctly and intelligently. For example “0000” was being inferred to long which is correct as per the values but sine it is in double quotes I would expect it as String and this is how XML Spy generated the default values for alpha numeric fields.
    • Now I have fully curated XML sample file
  2. Wrote a Spark-xml code
    • Gave the sample xml as input and converted into Avro file. We know Avro file has schema in it.
    • Took the Avro schema and created Hive table on top of it
  3. Finally wrote the Spark job
    • It reads xml files from HDFS
    • At time of reading I am asking Spark to infer xml schema as per my custom schema which I have gotten from sample xml
    • Convert xml into Avro file
    • Write Avro file to HDFS location.
  4. Query using Hive/Impala

 

 

Thanks,

Jai