Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HIVE and ACID table performance for updates

avatar
Rising Star

All,

If I turn on ACID for HIVE, is there a performance impact? And if so, is there some best practices to mitigate or address this with tools like tuning in TEZ or the number of mappers?

The goal is to update tables/records and identify if it makes more sense to do nightly batches vs incremental updates throughout the day.

1 ACCEPTED SOLUTION

avatar

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-BasicDesignfor ... ACID transactions shouldn't impact analytical queries while inserts are happening. A read can take a version and a write can put a new version of the data without waiting on locks. But this adds overhead of reading the delta files when you read the main ORCFile. As delta files accumulate you'll need to compact/consolidate the edits which will use cluster resources. The impact depends on the number of updates. Hive ACID should be used for low concurrency 50 or fewer concurrent users.

In general, I recommend using Hive best practices for Hive query performance - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_hi_per...

View solution in original post

5 REPLIES 5

avatar

Hi @devers, ACID does require some performance considerations especially if you have a high number of inserts and deletes. The performance will slowly degrade over time until a compaction is executed. After compaction your performance will normalize. Based on your use case you'll want to adjust the compaction frequency to find your unique performance sweet spot.

https://community.hortonworks.com/questions/15095/hive-compaction-for-acid-transactions.html

avatar

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-BasicDesignfor ... ACID transactions shouldn't impact analytical queries while inserts are happening. A read can take a version and a write can put a new version of the data without waiting on locks. But this adds overhead of reading the delta files when you read the main ORCFile. As delta files accumulate you'll need to compact/consolidate the edits which will use cluster resources. The impact depends on the number of updates. Hive ACID should be used for low concurrency 50 or fewer concurrent users.

In general, I recommend using Hive best practices for Hive query performance - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_hi_per...

avatar
Rising Star

Good info. The concurrency metric you state is inline with other recommendations I have seen when dealing with large numbers of updates. Thankyou!

avatar
Rising Star

Thanks Scott! Does there exist a working example or set of metrics published to help admins with determining the frequency or identifying a threshold?

I see several generalized statements but not so much with test numbers. Maybe a follow up HCC article in the making.

avatar
Super Collaborator

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Configuration has a list of configuration options. For example, hive.compactor.delta.num.threshold, lets you control how often minor compaction runs wrt number of SQL operations. Each SQL insert/update/delete generates 1 delta file. Each minor compaction, combines whatever delta files it finds into a single new delta that includes all the info. There is not specific guidance available since depends on specific set up/requirements and has to tested for. There other options listed there that can help.