Support Questions

Find answers, ask questions, and share your expertise

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

avatar
New Contributor

I have little experience in Hive and currently learning Spark with Scala Also I am working with HDP 2.6. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below

  • ORC will do the same as parquet in Spark
  • Tez engine will give better performance like Spark engine
  • Joins are better/faster in Hive than Spark

I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.

Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.

Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.

  • Input/lookup data are csv/text formats and updating into tables
  • Two input tables which has 5 million rows and 30 columns
  • 30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
  • Multiple joins involved like inner and left outer since many look up tables used.

Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.

Method 1:

  • Hive on Tez with ORC tables
  • Python UDF thru TRANSFORM option
  • Joins with performance tuning like map join

Method 2:

  • SparkSQL with Parquet format which is converting from text/csv
  • Scala for UDF
  • Hope we can perform multiple inner and left outer join in Spark
2 REPLIES 2

avatar

@Gopalakrishnan Veeran

As a starting point, only Hive will provide you ACID capabilities so if you want to perform updates, merge, or any other CDC capability than HIve is where you want to start.

A combination of Hive, LLAP, Tez, and ORC will give you the best performance with the best flexibility. LLAP will handle your ad-hoc type query patterns by using a shared, distributed cache. For longer running queries at scale, Hive with Tez has been proven most reliable. In addition, Hive is the only SQL in Hadoop tool to be able to run all 99 TPC-DS queries with only trivial syntax changes. This is important when you are migrating for existing RDBMS systems.

Though not quite ready for primetime you may want to take a look at HPLSQL http://www.hplsql.org/. We plan to begin introducing this into the product in future releases.

You are also able to read text files directly with LLAP which eliminates the need to transform the data to the ORC format which can be time consuming for large files.

avatar

@Scott Shaw Is it possible to directly call a Stored PROC in HPLSQL form BI tools(like Crystal) over JDBC/ODBC connection ? We need some data federation /Orchestration(currently managed through Oracle SP) and wondering what would be the recommendation in Hadoop world to achive the same. Thanks