Original source is Mainframe systems or RDBMS and ingested into hive tables in hadoop. Now Application reads from hive tables orc format as source data. We have a sub-set of columns that we need for our v1 release. And the schema will be fixed for v1. In future for v2 and so on we will have another set of columns coming but as this is the batch application business will notify for v2 the new columns or existing columns being renamed or deleted etc. I have worked in streaming applications where we receive files from different organisations where the schema is not fixed.
In this application do we need schema evolution to be considered even for future proposition. Or do i have to consider Avro instead of ORC. The application has a requirement to read subset of columns and aggregate on few columns
I have mentioned data storage as hive (1.2) ORC format with Snappy compression to be processed in Spark 1.6 in HDP 2.5. And adding new columns in ORC is supported.