Support Questions

Find answers, ask questions, and share your expertise

Wich sql engine best solution to use with CDP ?

avatar
New Contributor

Hi,

 

For a CDP Data Warehouse, we need to know wich is the best solution recommended to use ? Hive, Hawq or Impala ?

 

Thanks in advance for your feedback.

2 REPLIES 2

avatar
Master Mentor

@anass 

Hive and Impala have 2 distinct use-cases Hive, a data warehouse system is used for analyzing structured data. Uses HQL or the Hive Query Language which gets internally converted to MapReduce jobs which is fault-tolerant and a very good candidate for ETLs and batch-processing.
On the other hand, Impala executes faster using an engine designed especially for the mission of interactive SQL over HDFS although unlike Hive, Impala is not fault-tolerance. But a fantastic MPP (Massive Parallel Processing) engine.

 

- Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops” with no need for data movement and data transformation for storing data on Hadoop, .

- Impala no java knowledge is required programmatically accessing the data in HDFS or HBase a basic knowledge of SQL querying can do the work.

- Impala performs best when it queries files stored as Parquet format. It's good for sampling data

- Apache Hive is not ideal for interactive computing query whereas Impala is meant for interactive computing.

- Hive is batch-based Hadoop MapReduce whereas Impala is more like MPP database.

- Hive supports complex types but Impala does not.

- Apache Hive is fault-tolerant whereas Impala does not support fault tolerance. So its the best candidate for batch processing which is prone to failures When a hive query is run and if the DataNode goes down while the query is being executed, the output of the query will be produced as Hive is fault-tolerant. However, that is not the case with Impala. If a query execution fails in Impala it has to be started all over again.

- Hive can transform SQL queries into Spark or MR jobs making it a good choice for long-running ETL jobs for which it is desirable to have fault tolerance because developers do not want to re-run a long-running job after executing it for several hours.

 

For better comparison here is the benchmark HAWQ,Hive and Impala

Hope that helps

avatar

I think the reality is now that both are great technologies and the overlap in use cases is pretty big - there are a lot of SQL workloads where either can work.

 

I just wanted to clarify a few points.

 

Impala does support querying complex types from Parquet - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html

 

We also are working on a transparent query retry feature in Impala that should be released soon.