Posts: 8
Registered: ‎09-13-2015
Accepted Solution

Large hive metastore db size when using streaming API

[ Edited ]

Hi, I'm using Hive Streaming API to write data to hive. Recently I looked into the metastore db I found that the tables of COMPLETED_TXN_COMPONENTS, TXNS, TXN_COMPONENTS took large of data size, especially COMPLETED_TXN_COMPONENTS took almost 3GB.



I'm concerning the increasing sizes of these tables, could anyone tole me what are they about?

I looked into the data in COMPLETED_TXN_COMPONENTS, they don't seem meanful rather then records of used transaction id.

1. Is it safe to clear these tables?

2. If I migrate data from one Hive cluster to another one, do I have to keep these 3 tables identical with the metastore db in the new cluster?

Posts: 1,892
Kudos: 432
Solutions: 302
Registered: ‎07-31-2013

Re: Large hive metastore db size when using streaming API

The Hive "Streaming" feature is built upon its unsupported [1] transactional features:

This feature (the ACID one) uses the tables you've mentioned, when DbTxnManager is in use as per the suggested configs.

Cloudera does not recommend the use of ACID features currently, because it is experimental in stability/quality upstream [1].

But anyways, checking some code [2] if all data is compacted in your table then the entries under COMPLETED_TXN_COMPONENTS should be deleted away. Do you see any messages such as "Unable to delete compaction record" in your HMS log? Or any WARN+ log from CompactionTxnHandler class in general? Looking for that and then working over the error should help you solve this.

[1] -, specific quote:
Hive ACID is not supported
Hive ACID is an experimental feature and Cloudera does not currently support it.
[2] - etc.