First of all, thanks a lot for the response, second my apologies that I could not respond you timely.
Actually I have very complex XSDs with >2000 elements in nested xsd complex types. So, above solution would not work in my case. I can not create hive table manually with these number of elements and also Objects nested at 10th level
Sorry, I cannot share the code here but this is how I implemented the project.
Goal: Ingest XMLs data into HDFS and query using Hive/Impala
Solution: Convert XDS into Hive Avro table and keep pumping xml -> avro into hdfs.
I took all XSDs into XML Spy tool and generated sample xml
I still had to fix some elements with default values in it because Spark was able to infer more correctly and intelligently. For example “0000” was being inferred to long which is correct as per the values but sine it is in double quotes I would expect it as String and this is how XML Spy generated the default values for alpha numeric fields.
Now I have fully curated XML sample file
Wrote a Spark-xml code
Gave the sample xml as input and converted into Avro file. We know Avro file has schema in it.
Took the Avro schema and created Hive table on top of it
Finally wrote the Spark job
It reads xml files from HDFS
At time of reading I am asking Spark to infer xml schema as per my custom schema which I have gotten from sample xml