For a CDP Data Warehouse, we need to know wich is the best solution recommended to use ? Hive, Hawq or Impala ?
Thanks in advance for your feedback.
Hive and Impala have 2 distinct use-cases Hive, a data warehouse system is used for analyzing structured data. Uses HQL or the Hive Query Language which gets internally converted to MapReduce jobs which is fault-tolerant and a very good candidate for ETLs and batch-processing.
On the other hand, Impala executes faster using an engine designed especially for the mission of interactive SQL over HDFS although unlike Hive, Impala is not fault-tolerance. But a fantastic MPP (Massive Parallel Processing) engine.
- Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops” with no need for data movement and data transformation for storing data on Hadoop, .
- Impala no java knowledge is required programmatically accessing the data in HDFS or HBase a basic knowledge of SQL querying can do the work.
- Impala performs best when it queries files stored as Parquet format. It's good for sampling data
- Apache Hive is not ideal for interactive computing query whereas Impala is meant for interactive computing.
- Hive is batch-based Hadoop MapReduce whereas Impala is more like MPP database.
- Hive supports complex types but Impala does not.
- Apache Hive is fault-tolerant whereas Impala does not support fault tolerance. So its the best candidate for batch processing which is prone to failures When a hive query is run and if the DataNode goes down while the query is being executed, the output of the query will be produced as Hive is fault-tolerant. However, that is not the case with Impala. If a query execution fails in Impala it has to be started all over again.
- Hive can transform SQL queries into Spark or MR jobs making it a good choice for long-running ETL jobs for which it is desirable to have fault tolerance because developers do not want to re-run a long-running job after executing it for several hours.
For better comparison here is the benchmark HAWQ,Hive and Impala
Hope that helps
I think the reality is now that both are great technologies and the overlap in use cases is pretty big - there are a lot of SQL workloads where either can work.
I just wanted to clarify a few points.
Impala does support querying complex types from Parquet - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html
We also are working on a transparent query retry feature in Impala that should be released soon.