question Re: Wich sql engine best solution to use with CDP ? in Support Questions

question Re: Wich sql engine best solution to use with CDP ? in Support Questions https://community.cloudera.com/t5/Support-Questions/Wich-sql-engine-best-solution-to-use-with-CDP/m-p/301530#M220727 <a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/80819">@anass</a> Hive and Impala have 2 distinct use-cases Hive, a data warehouse system is used for analyzing structured data. Uses HQL or the Hive Query Language which gets internally converted to MapReduce jobs which is fault-tolerant and a very good candidate for ETLs and batch-processing. On the other hand, Impala executes faster using an engine designed especially for the mission of interactive SQL over HDFS although unlike Hive, Impala is not fault-tolerance. But a fantastic MPP (Massive Parallel Processing) engine. - Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops” with no need for data movement and data transformation for storing data on Hadoop, .- Impala no java knowledge is required programmatically accessing the data in HDFS or HBase a basic knowledge of SQL querying can do the work.- Impala performs best when it queries files stored as Parquet format. It's good for sampling data- Apache Hive is not ideal for interactive computing query whereas Impala is meant for interactive computing.- Hive is batch-based Hadoop MapReduce whereas Impala is more like MPP database.- Hive supports complex types but Impala does not.- Apache Hive is fault-tolerant whereas Impala does not support fault tolerance. So its the best candidate for batch processing which is prone to failures When a hive query is run and if the DataNode goes down while the query is being executed, the output of the query will be produced as Hive is fault-tolerant. However, that is not the case with Impala. If a query execution fails in Impala it has to be started all over again.- Hive can transform SQL queries into Spark or MR jobs making it a good choice for long-running ETL jobs for which it is desirable to have fault tolerance because developers do not want to re-run a long-running job after executing it for several hours. For better comparison here is the <A href="https://dataintoresults.com/post/big-data-benchmark-impala-vs-hawq-vs-hive/" target="_blank" rel="noopener">benchmark HAWQ,Hive and Impala</A> Hope that helps Sat, 15 Aug 2020 10:16:40 GMT Shelton 2020-08-15T10:16:40Z