Hive and Impala have 2 distinct use-cases Hive, a data warehouse system is used for analyzing structured data. Uses HQL or the Hive Query Language which gets internally converted to MapReduce jobs which is fault-tolerant and a very good candidate for ETLs and batch-processing. On the other hand, Impala executes faster using an engine designed especially for the mission of interactive SQL over HDFS although unlike Hive, Impala is not fault-tolerance. But a fantastic MPP (Massive Parallel Processing) engine.
- Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops” with no need for data movement and data transformation for storing data on Hadoop, .
- Impala no java knowledge is required programmatically accessing the data in HDFS or HBase a basic knowledge of SQL querying can do the work.
- Impala performs best when it queries files stored as Parquet format. It's good for sampling data
- Apache Hive is not ideal for interactive computing query whereas Impala is meant for interactive computing.
- Hive is batch-based Hadoop MapReduce whereas Impala is more like MPP database.
- Hive supports complex types but Impala does not.
- Apache Hive is fault-tolerant whereas Impala does not support fault tolerance. So its the best candidate for batch processing which is prone to failures When a hive query is run and if the DataNode goes down while the query is being executed, the output of the query will be produced as Hive is fault-tolerant. However, that is not the case with Impala. If a query execution fails in Impala it has to be started all over again.
- Hive can transform SQL queries into Spark or MR jobs making it a good choice for long-running ETL jobs for which it is desirable to have fault tolerance because developers do not want to re-run a long-running job after executing it for several hours.