05-02-2017 07:40 AM
I have a doubt about performance and/or usable in batch process(ETL) between Impala or HoS.
I´ve read that impala is better in performance than HoS, but is not "best practice" (or not usual) to use in batch process (ETL).
Why? If it's the fastest, why dont use at all?
05-03-2017 06:08 PM
Some thoughts on your question:
- Hive is more flexible in terms of data formats that it can scan
- You may find Hive to be more feature rich in terms of SQL language support and built-in functions
- Hive will most likely complete your query even if there are node failures (this makes it suitable for long-running jobs); this is true for both Hive on MR and Hive on Spark
- If Impala can run your ETL, then it will probably be faster
- Impala will fail/abort a query if a node goes down during query execution
- The last point may make Impala less suitable for long-running jobs, but of course there is also a shorter failure window because queries are faster, so Impala may very well suit your ETL needs if you can tolerate the faiure behavior
You may also find this article interesting: