Created 11-22-2022 10:54 PM
APAI am about to upgrade from cdh to cdp and I have some questions regarding new version of Hive.
Until now I used to have hive as etl service because it is more stable but slower than impala. My tables that bi users see are in impala.
My questions are:
1) Is hive 3 fast enough to compete impala ? 2) In case of bi use is it more appropriate to point hive or impala(I read that hive 3 uses cache and makes bi repeated requests faster)? 3) In case of kafka flow, is it appropriate to create an acid table in hive 3 and store the fetched data live ?
Created 11-23-2022 01:00 AM
1. Impala is always faster. Impala does not use yarn. Impala stores catalog data locally which fetches information faster. Impala backend gthread is built on C++ which is very fast.
2. Impala is not fault tolerant , it is best suited for adhoc queries and ETL is best suited for Hive as Hive is fault tolerant. If the query fails due to network/disk failure,hive will retry but Impala would fail.
3. For stemaming/ingestion like Kafka flow you need to put it in EXTERNAL tables not in Managed(ACID) tables. Managed tabled can be used,if you want to perform alteration of the data like Update/Delete .
Please let me know,if you have any queries. Please click "Accept As Solution" , if your query is answered.
Created 11-23-2022 01:00 AM
1. Impala is always faster. Impala does not use yarn. Impala stores catalog data locally which fetches information faster. Impala backend gthread is built on C++ which is very fast.
2. Impala is not fault tolerant , it is best suited for adhoc queries and ETL is best suited for Hive as Hive is fault tolerant. If the query fails due to network/disk failure,hive will retry but Impala would fail.
3. For stemaming/ingestion like Kafka flow you need to put it in EXTERNAL tables not in Managed(ACID) tables. Managed tabled can be used,if you want to perform alteration of the data like Update/Delete .
Please let me know,if you have any queries. Please click "Accept As Solution" , if your query is answered.