Support Questions

Find answers, ask questions, and share your expertise

Hive on Spark or Impala in batch Process (ETL)

avatar
New Contributor

 

Hi All,

 

I have a doubt about performance and/or usable in batch process(ETL) between Impala or HoS.

 

I´ve read that impala is better in performance than HoS, but is not "best practice" (or not usual) to use in batch process (ETL).

Why? If it's the fastest, why dont use at all?

 

hugs,

Rodrigo Carvalho

1 ACCEPTED SOLUTION

avatar

 

Some thoughts on your question:

- Hive is more flexible in terms of data formats that it can scan

- You may find Hive to be more feature rich in terms of SQL language support and built-in functions

- Hive will most likely complete your query even if there are node failures (this makes it suitable for long-running jobs); this is true for both Hive on MR and Hive on Spark

- If Impala can run your ETL, then it will probably be faster

- Impala will fail/abort a query if a node goes down during query execution

- The last point may make Impala less suitable for long-running jobs, but of course there is also a shorter failure window because queries are faster, so Impala may very well suit your ETL needs if you can tolerate the faiure behavior

 

You may also find this article interesting:

https://vision.cloudera.com/sql-on-apache-hadoop-choosing-the-right-tool-for-the-right-job/

 

View solution in original post

2 REPLIES 2

avatar

 

Some thoughts on your question:

- Hive is more flexible in terms of data formats that it can scan

- You may find Hive to be more feature rich in terms of SQL language support and built-in functions

- Hive will most likely complete your query even if there are node failures (this makes it suitable for long-running jobs); this is true for both Hive on MR and Hive on Spark

- If Impala can run your ETL, then it will probably be faster

- Impala will fail/abort a query if a node goes down during query execution

- The last point may make Impala less suitable for long-running jobs, but of course there is also a shorter failure window because queries are faster, so Impala may very well suit your ETL needs if you can tolerate the faiure behavior

 

You may also find this article interesting:

https://vision.cloudera.com/sql-on-apache-hadoop-choosing-the-right-tool-for-the-right-job/

 

avatar
Explorer

Hive is more adaptable as far as data arranges that it can check

- You may see Hive as more component wealthy as far as SQL language support and inherent capacities

- Hive will probably finish your inquiry regardless of whether there are hub disappointments (this makes it reasonable for long-running employments); this is valid for both Hive on MR and Hive on Spark

- If Impala can run your ETL, at that point it will most likely be quicker

- Impala will come up short/prematurely end a question if a hub goes down during inquiry execution

- The last point may make Impala less reasonable for long-running occupations, obviously there is likewise a shorter disappointment window since questions are quicker, so Impala might just suit your ETL needs on the off chance that you can endure the faiure conduct