Welcome to the Cloudera Community

Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

Hadoop: XML processing and schema Evolution

avatar
Expert Contributor

I have a requirement to process XML data from the database table (XML data is one column in the table). This column is a blob column in the table.

The XML is comples that has multiple nested structure with one-many relationships. I am looking at ways to ingest this data and make it available for consumption in hive. The igestion may also have to support the source XML schema evolution (more elements can be added to the XML).

I see the following options

Option 1:

Sqoop import into HDFS -> Convert the xml to multiple text files (using map reduce) -> Create hive tables (A) for each of these text files ->Create avro schema for each of these text files -> Create hive tables (B) with these avro schemas -> Load from A to B -> Merge the new (if exist) and the old avro schema

In this option quite a bit of manual intervention is required to create hive tables, avro schemas.

The merged schemas becomes the schema for the hive tables

Option 2:

Sqoop import into HDFS -> Convert the xml to avro (using mapreduce and https://github.com/elodina/xml-avro) - > Create a hive table on top of this avro data and using avsc generated from previous step -> Merge the new (if exist) and the old avro schema

The tool https://github.com/elodina/xml-avro needs XSD for the conversion.

The merged schemas becomes the schema for the hive tables.

The problem with this option is that the tool does not work if a xml element has both attribute and text.

Option 3:

Sqoop import into HDFS -> Lay a hive table on top of the XML data (https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources)

Its cumbersome to define hive table with all the columns that map to the the XML is big with many many elements and relationships.

Option 4:

Sqoop import into HDFS -> Pre-generate the binding classes and avsc using https://github.com/nokia/Avro-Schema-Generator -> Convert XML to Avro using the Map Reduce (Bind xml to jaxb generated objects, create avro data by copying from jaxb objects (not sure how to do this yet) ) -> Create hive table on top of the avsc and the generated avro data.

 

Your thoughts would greatly help

Who agreed with this topic