We are ingesting a very complex XML file (super nested, unordered elements etc). We have HDP and HDF for ingestion and considering a few options:
1. XML Serde on File (not the most intuitive for really complex structures)
2. Spit XML into child splits and remerge into 1:M hive tables (a little better than option 1 but still gets a little crazy)
3. Convert XML to JSON with xlst, and use hive Serde. I found JSON SerDe a little more flexible and was able to deal with the deep nested, unordered entities okay.
4. Convert XML directly to Avro with Spark
5. Read XML and pull out only relevant entities/attributes
Any recommendations or other approaches people have succesffuly used?