12-18-2018 11:33 PM - last edited on 12-19-2018 08:46 AM by cjervis
We have a file in hdfs which contains multiple json events all with different schema and we want to batch process that file. The underlying schema of those events is different, means one event can 10 fields, other event can have 8 fields with nested structure, another event can have 5 nested fields with all underlying individual fields. I mean to say schema is not fixed.
What is the best strategy to process events in such scenario. We are open with any tool like Spark, Hive etc for batch processing of events. The end result is to give the structured format to these events so that we can analyse these by combining with other hive tables/datasets.