Created 08-08-2016 08:19 PM
All,
If I turn on ACID for HIVE, is there a performance impact? And if so, is there some best practices to mitigate or address this with tools like tuning in TEZ or the number of mappers?
The goal is to update tables/records and identify if it makes more sense to do nightly batches vs incremental updates throughout the day.
Created 08-09-2016 01:05 PM
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-BasicDesignfor ... ACID transactions shouldn't impact analytical queries while inserts are happening. A read can take a version and a write can put a new version of the data without waiting on locks. But this adds overhead of reading the delta files when you read the main ORCFile. As delta files accumulate you'll need to compact/consolidate the edits which will use cluster resources. The impact depends on the number of updates. Hive ACID should be used for low concurrency 50 or fewer concurrent users.
In general, I recommend using Hive best practices for Hive query performance - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_hi_per...
Created 08-09-2016 12:55 PM
Hi @devers, ACID does require some performance considerations especially if you have a high number of inserts and deletes. The performance will slowly degrade over time until a compaction is executed. After compaction your performance will normalize. Based on your use case you'll want to adjust the compaction frequency to find your unique performance sweet spot.
https://community.hortonworks.com/questions/15095/hive-compaction-for-acid-transactions.html
Created 08-09-2016 01:05 PM
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-BasicDesignfor ... ACID transactions shouldn't impact analytical queries while inserts are happening. A read can take a version and a write can put a new version of the data without waiting on locks. But this adds overhead of reading the delta files when you read the main ORCFile. As delta files accumulate you'll need to compact/consolidate the edits which will use cluster resources. The impact depends on the number of updates. Hive ACID should be used for low concurrency 50 or fewer concurrent users.
In general, I recommend using Hive best practices for Hive query performance - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_hi_per...
Created 08-09-2016 01:13 PM
Good info. The concurrency metric you state is inline with other recommendations I have seen when dealing with large numbers of updates. Thankyou!
Created 08-09-2016 01:11 PM
Thanks Scott! Does there exist a working example or set of metrics published to help admins with determining the frequency or identifying a threshold?
I see several generalized statements but not so much with test numbers. Maybe a follow up HCC article in the making.
Created 08-10-2016 06:10 PM
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Configuration has a list of configuration options. For example, hive.compactor.delta.num.threshold, lets you control how often minor compaction runs wrt number of SQL operations. Each SQL insert/update/delete generates 1 delta file. Each minor compaction, combines whatever delta files it finds into a single new delta that includes all the info. There is not specific guidance available since depends on specific set up/requirements and has to tested for. There other options listed there that can help.