Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Between Avro, Parquet, and RC/ORC which is useful for accessing only a few rows out of billions?

avatar
New Contributor
 
4 REPLIES 4

avatar
Expert Contributor

@Pavan Kumar KondaIt depends on lot of constraints like compression, serialization and whether the storage format is splittable etc.
I think ORC is just for hive.
Avro

  • Avro is language neutral data serialization
  • Writables has the drawback that they do not provide language portability.
  • Avro formatted data can be described through language independent schema. Hence Avro formatted data can be shared across applications using different languages.
  • Avro stores the schema in header of file so data is self-describing.
  • Avro formatted files are splittable and compressible and hence it’s a good candidate for data storage in Hadoop ecosystem.
  • Schema Evolution – Schema used to read a Avro file need not be same as schema which was used to write the files. This makes it possible to add new fields.

    Parquet
  • Parquet is a columnar format. Columnar formats works well where only few columns are required in query/ analysis.
    • Only required columns would be fetched / read, it reduces the disk I/O.
    • Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.
    • Parquet provides very good compression upto 75% when used with compression formats like snappy.
    • Parquet can be read and write using Avro API and Avro Schema.
    • It also provides predicate pushdown, thus reducing further disk I/O cost.

avatar
Cloudera Employee

ORC is for more than Hive. It is a separate project now and has support from Spark and Nifi.

avatar
Contributor

Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.

avatar
Cloudera Employee

Only ORC and Parquet have the necessary features

  • Predicate pushdown where a condition is checked against the metadata to see if the rows need to be read.
  • Column projection to only read the bytes for the necessary columns.

ORC can use predicate pushdown based on either:

  • min and max for each column
  • optional bloom filter for looking for particular values

Parquet only has min/max. ORC can filter at the file level, stripe level, or 10k row level. Parquet can only filter at the file level or stripe level.

The previous answer mentions some of Avro's properties that are shared by ORC and Parquet:

  • They are both language neutral with C++, Java, and other language implementations.
  • They are both self describing.
  • They are both splittable when compressed.
  • They both support schema evolution.