Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Large hive metastore db size when using streaming API

SOLVED Go to solution

Large hive metastore db size when using streaming API

Explorer

Hi, I'm using Hive Streaming API to write data to hive. Recently I looked into the metastore db I found that the tables of COMPLETED_TXN_COMPONENTS, TXNS, TXN_COMPONENTS took large of data size, especially COMPLETED_TXN_COMPONENTS took almost 3GB.

 

 

I'm concerning the increasing sizes of these tables, could anyone tole me what are they about?

I looked into the data in COMPLETED_TXN_COMPONENTS, they don't seem meanful rather then records of used transaction id.

1. Is it safe to clear these tables?

2. If I migrate data from one Hive cluster to another one, do I have to keep these 3 tables identical with the metastore db in the new cluster?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Large hive metastore db size when using streaming API

Master Guru
The Hive "Streaming" feature is built upon its unsupported [1] transactional features: https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

This feature (the ACID one) uses the tables you've mentioned, when DbTxnManager is in use as per the suggested configs.

Cloudera does not recommend the use of ACID features currently, because it is experimental in stability/quality upstream [1].

But anyways, checking some code [2] if all data is compacted in your table then the entries under COMPLETED_TXN_COMPONENTS should be deleted away. Do you see any messages such as "Unable to delete compaction record" in your HMS log? Or any WARN+ log from CompactionTxnHandler class in general? Looking for that and then working over the error should help you solve this.

[1] - http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_hive_ki.html, specific quote:
"""
Hive ACID is not supported
Hive ACID is an experimental feature and Cloudera does not currently support it.
"""
[2] - https://github.com/cloudera/hive/blob/cdh5.5.2-release/metastore/src/java/org/apache/hadoop/hive/met... etc.
1 REPLY 1

Re: Large hive metastore db size when using streaming API

Master Guru
The Hive "Streaming" feature is built upon its unsupported [1] transactional features: https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

This feature (the ACID one) uses the tables you've mentioned, when DbTxnManager is in use as per the suggested configs.

Cloudera does not recommend the use of ACID features currently, because it is experimental in stability/quality upstream [1].

But anyways, checking some code [2] if all data is compacted in your table then the entries under COMPLETED_TXN_COMPONENTS should be deleted away. Do you see any messages such as "Unable to delete compaction record" in your HMS log? Or any WARN+ log from CompactionTxnHandler class in general? Looking for that and then working over the error should help you solve this.

[1] - http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_hive_ki.html, specific quote:
"""
Hive ACID is not supported
Hive ACID is an experimental feature and Cloudera does not currently support it.
"""
[2] - https://github.com/cloudera/hive/blob/cdh5.5.2-release/metastore/src/java/org/apache/hadoop/hive/met... etc.