I have a requirement to process XML data from the database table (XML data is one column in the table). This column is a blob column in the table.
The XML is comples that has multiple nested structure with one-many relationships. I am looking at ways to ingest this data and make it available for consumption in hive. The igestion may also have to support the source XML schema evolution (more elements can be added to the XML).
I see the following options
Sqoop import into HDFS -> Convert the xml to multiple text files (using map reduce) -> Create hive tables (A) for each of these text files ->Create avro schema for each of these text files -> Create hive tables (B) with these avro schemas -> Load from A to B -> Merge the new (if exist) and the old avro schema
In this option quite a bit of manual intervention is required to create hive tables, avro schemas.
The merged schemas becomes the schema for the hive tables
Sqoop import into HDFS -> Convert the xml to avro (using mapreduce and https://github.com/elodina/xml-avro) - > Create a hive table on top of this avro data and using avsc generated from previous step -> Merge the new (if exist) and the old avro schema
Its cumbersome to define hive table with all the columns that map to the the XML is big with many many elements and relationships.
Sqoop import into HDFS -> Pre-generate the binding classes and avsc using https://github.com/nokia/Avro-Schema-Generator -> Convert XML to Avro using the Map Reduce (Bind xml to jaxb generated objects, create avro data by copying from jaxb objects (not sure how to do this yet) ) -> Create hive table on top of the avsc and the generated avro data.