I have little experience in Hive and currently learning Spark with Scala Also I am working with HDP 2.6. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
As a starting point, only Hive will provide you ACID capabilities so if you want to perform updates, merge, or any other CDC capability than HIve is where you want to start.
A combination of Hive, LLAP, Tez, and ORC will give you the best performance with the best flexibility. LLAP will handle your ad-hoc type query patterns by using a shared, distributed cache. For longer running queries at scale, Hive with Tez has been proven most reliable. In addition, Hive is the only SQL in Hadoop tool to be able to run all 99 TPC-DS queries with only trivial syntax changes. This is important when you are migrating for existing RDBMS systems.
Though not quite ready for primetime you may want to take a look at HPLSQL http://www.hplsql.org/. We plan to begin introducing this into the product in future releases.
You are also able to read text files directly with LLAP which eliminates the need to transform the data to the ORC format which can be time consuming for large files.
@Scott Shaw Is it possible to directly call a Stored PROC in HPLSQL form BI tools(like Crystal) over JDBC/ODBC connection ? We need some data federation /Orchestration(currently managed through Oracle SP) and wondering what would be the recommendation in Hadoop world to achive the same. Thanks