12-13-2016 08:17 AM
Yes, Cloudera's recommendation is to use the SQL engines already provided within the ecosystem. Spark, Hive, and Impala can share the same mestastore, so any datasource created or modified should be able to be accessed by the other engines as well.
Spark's thrift server has been found to have some concurrency issues and Hive and Impala have been designed to provide better concurrency. You can connect Tableau, Excel and many other tools via JDBC or ODBC to Hive or Impala. In particular, Impala provides lower latency and possibly a better user experience when using BI tools, while Hive can provide better fault tolerance and throughput making it more suitable for ETL processes.
12-13-2016 01:43 PM
Classic Hive is just too slow for applications requiring Spark-level performance. And Cloudera's glaring feet-dragging on keeping Hive up-to-date (v.1.1 is over a year and half old) in comparison with other vendors is increasingly becoming a deal-breaker. Hive is still the core SQL/RDBMS engine for Hadoop - not everyone wants to move to Impala. If it's not performant, your product is out of the enterprise data warehouse.
12-13-2016 01:46 PM - edited 12-13-2016 01:47 PM
Cloudera backports a lot of Hive jiras/patches with every CDH release.
Not sure if this helps in your scenario.
But I totall agree, Hive 1.1 is way too old.
Hope to see Hive 2.1 by CDH 6, I guess?
12-13-2016 08:18 PM - last edited on 12-14-2016 05:04 AM by cjervis
Cloudera does indeed backport many fixes into the distribution. Cloudera is commited to delivering an enterprise grade platform and has rigorous testing to ensure software released is supportable and meets certain criteria, this may mean upstream releases are not included immediately until they are fully vetted.
This community article may also be helpful in understanding component versioning compared to competitors: https://community.cloudera.com/t5/Cloudera-Manager-Installation/CDH-components-version-compared-with...