Member since
12-07-2017
3
Posts
2
Kudos Received
0
Solutions
08-13-2018
02:20 PM
2 Kudos
ORC and Parquet are optimized for OLAP queries since only a subset of the columns from the source tables are used. Avro and other row based perform better if you have to look at entire record. Hav from one datatype to another (multi-hive table approach) is a common practice to determine which format performs the best for your use case. Performance test all three types is my recommendation. There is no one size fits all.
... View more
08-12-2018
01:50 PM
Great post Binu! What storage format would you suggest if you plan on storing the hive table into a dataframe and running an iterative process (machine learning algorithm x) against the data? I’m hard pressed to find any kind of discussions on this concept.
... View more