Created 02-21-2018 06:14 AM
Created 02-21-2018 09:56 AM
@Pavan Kumar KondaIt depends on lot of constraints like compression, serialization and whether the storage format is splittable etc.
I think ORC is just for hive.
Avro
Created 02-21-2018 06:45 PM
ORC is for more than Hive. It is a separate project now and has support from Spark and Nifi.
Created 02-23-2018 07:52 AM
Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.
Created 02-21-2018 06:59 PM
Only ORC and Parquet have the necessary features
ORC can use predicate pushdown based on either:
Parquet only has min/max. ORC can filter at the file level, stripe level, or 10k row level. Parquet can only filter at the file level or stripe level.
The previous answer mentions some of Avro's properties that are shared by ORC and Parquet: