New Contributor
Posts: 1
Registered: ‎11-28-2016
Accepted Solution

Hive on Spark or Impala in batch Process (ETL)


Hi All,


I have a doubt about performance and/or usable in batch process(ETL) between Impala or HoS.


I´ve read that impala is better in performance than HoS, but is not "best practice" (or not usual) to use in batch process (ETL).

Why? If it's the fastest, why dont use at all?



Rodrigo Carvalho

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Hive on Spark or Impala in batch Process (ETL)


Some thoughts on your question:

- Hive is more flexible in terms of data formats that it can scan

- You may find Hive to be more feature rich in terms of SQL language support and built-in functions

- Hive will most likely complete your query even if there are node failures (this makes it suitable for long-running jobs); this is true for both Hive on MR and Hive on Spark

- If Impala can run your ETL, then it will probably be faster

- Impala will fail/abort a query if a node goes down during query execution

- The last point may make Impala less suitable for long-running jobs, but of course there is also a shorter failure window because queries are faster, so Impala may very well suit your ETL needs if you can tolerate the faiure behavior


You may also find this article interesting: