Support Questions

marshall_felder · ‎08-13-2018

Assuming a data pipeline will be loading hive tables as spark dataframes. Which storage format is optimum for training machine learning models and running iterative processes? Row based (text, Avro) or column based (Orc, Parquet) files?

sunile_manjee · ‎08-13-2018

ORC and Parquet are optimized for OLAP queries since only a subset of the columns from the source tables are used. Avro and other row based perform better if you have to look at entire record. Hav from one datatype to another (multi-hive table approach) is a common practice to determine which format performs the best for your use case. Performance test all three types is my recommendation. There is no one size fits all.

View solution in original post

sunile_manjee · ‎08-13-2018

ORC and Parquet are optimized for OLAP queries since only a subset of the columns from the source tables are used. Avro and other row based perform better if you have to look at entire record. Hav from one datatype to another (multi-hive table approach) is a common practice to determine which format performs the best for your use case. Performance test all three types is my recommendation. There is no one size fits all.

Cloudera Community

Support Questions

Which storage format is optimum for training machine learning models and running iterative processes?

Machine Learning Ops with Cloudera AI

Price Optimization with PyGurobi in Cloudera Machi...

Distributed XGBoost with PySpark in Cloudera Machi...

Machine Learning Model Factory and Road To Product...

PandasOnSpark in Cloudera Machine Learning (CML)

Spark Machine Learning Pipeline by Example

Accelerating ML models with distributed Xgboost in...

Using Custom Data Connections in Cloudera Machine ...

Tuning Hyperparameters with Experiments feature on...

Cloudera Machine Learning (CML) - Questions & Answ...