Support Questions

gveeran · ‎08-09-2017

I have little experience in Hive and currently learning Spark with Scala Also I am working with HDP 2.6. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below

ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark

I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.

Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.

Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.

Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.

Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.

Method 1:

Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join

Method 2:

SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark

SQLShaw · ‎08-15-2017

@Gopalakrishnan Veeran

As a starting point, only Hive will provide you ACID capabilities so if you want to perform updates, merge, or any other CDC capability than HIve is where you want to start.

A combination of Hive, LLAP, Tez, and ORC will give you the best performance with the best flexibility. LLAP will handle your ad-hoc type query patterns by using a shared, distributed cache. For longer running queries at scale, Hive with Tez has been proven most reliable. In addition, Hive is the only SQL in Hadoop tool to be able to run all 99 TPC-DS queries with only trivial syntax changes. This is important when you are migrating for existing RDBMS systems.

Though not quite ready for primetime you may want to take a look at HPLSQL http://www.hplsql.org/. We plan to begin introducing this into the product in future releases.

You are also able to read text files directly with LLAP which eliminates the need to transform the data to the ORC format which can be time consuming for large files.

viswan_anoop · ‎01-09-2018

@Scott Shaw Is it possible to directly call a Stored PROC in HPLSQL form BI tools(like Crystal) over JDBC/ODBC connection ? We need some data federation /Orchestration(currently managed through Oracle SP) and wondering what would be the recommendation in Hadoop world to achive the same. Thanks

Cloudera Community

Support Questions

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?