<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Wich sql engine best solution to use with CDP ? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Wich-sql-engine-best-solution-to-use-with-CDP/m-p/301530#M220727</link>
    <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/80819"&gt;@anass&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hive and Impala have 2 distinct use-cases Hive, a data warehouse system is used for analyzing structured data. Uses HQL or the Hive Query Language which gets internally converted to MapReduce jobs which is fault-tolerant and a very good candidate for ETLs and batch-processing.&lt;BR /&gt;On the other hand, Impala executes faster using an engine designed especially for the mission of interactive SQL over HDFS although unlike Hive, Impala is not fault-tolerance. But a fantastic MPP (Massive Parallel Processing) engine.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;- Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops” with no need for data movement and data transformation for storing data on Hadoop, .&lt;/P&gt;&lt;P&gt;- Impala no java knowledge is required programmatically accessing the data in HDFS or HBase a basic knowledge of SQL querying can do the work.&lt;/P&gt;&lt;P&gt;- Impala performs best when it queries files stored as Parquet format. It's good for sampling data&lt;/P&gt;&lt;P&gt;- Apache Hive is not ideal for interactive computing query whereas Impala is meant for interactive computing.&lt;/P&gt;&lt;P&gt;- Hive is batch-based Hadoop MapReduce whereas Impala is more like MPP database.&lt;/P&gt;&lt;P&gt;- Hive supports complex types but Impala does not.&lt;/P&gt;&lt;P&gt;- Apache Hive is fault-tolerant whereas Impala does not support fault tolerance. So its the best candidate for batch processing which is prone to failures When a hive query is run and if the DataNode goes down while the query is being executed, the output of the query will be produced as Hive is fault-tolerant. However, that is not the case with Impala. If a query execution fails in Impala it has to be started all over again.&lt;/P&gt;&lt;P&gt;- Hive can transform SQL queries into Spark or MR jobs making it a good choice for long-running ETL jobs for which it is desirable to have fault tolerance because developers do not want to re-run a long-running job after executing it for several hours.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For better comparison here is the &lt;A href="https://dataintoresults.com/post/big-data-benchmark-impala-vs-hawq-vs-hive/" target="_blank" rel="noopener"&gt;benchmark HAWQ,Hive and Impala&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Hope that helps&lt;/P&gt;</description>
    <pubDate>Sat, 15 Aug 2020 10:16:40 GMT</pubDate>
    <dc:creator>Shelton</dc:creator>
    <dc:date>2020-08-15T10:16:40Z</dc:date>
  </channel>
</rss>

