Community Articles

nbalaji-elangov · ‎10-02-2017

The use case is to get the updates using timestamp column from multiple Databases and do some simple transformations in nifi and stream the data into hive transactional tables using puthivestreaming processor in nifi.

If not careful, this can easily lead to platform wide instability for hive.

ACID is for slowly changing tables and not for 100s of concurrent queries trying to update the same partition.
ACID tables are bucketed tables. Have correct number of buckets and have uniform data distribution among buckets. Can easily lead to data skew and only one CPU core writes to a single bucket.
Transactional manager and lock manager are stored in hive metastore. Transactional manager keeps the transactional state(open, commit and abort) and lock manager maintains the necessary locks for transactional tables. The recommendation is to separate Hive, oozie and ambari database and configure high availability for databases.
Nifi can overwhelm hive ACID tables. Nifi will stream data using hive streaming API available with puthivestreaming processor. Default value for timer driven scheduling in nifi processors is 0 which will cause a hit on the hive metastore. The recommendation is to microbatch the data from nifi with scheduling time around 1 min or more(higher the better). Batching 5000-10000 records gave the best throughput.
Compaction of the table necessary for ACID read performance tuning. Compaction can be automatic or on-demand.
Make sure to enable email alerting on hive database when the count reached a threshold of around 100,000 transactions.

hive=# select count(*) from TXNS where txn_state='a' and TXN_ID not in (select tc_txnid from txn_components); 

count
------
 3992
(1 row)

One of the data source was overwhelming the metastore. After proper batching and scheduling, the metastore was able to clean by itself. Optimize the hive metastore DB for performance tuning.

10/02/2017 (7pm):
hive=# select txn_user, count(*) from txns where txn_state='a' group by txn_user ;

txn_user |  count
hive     |   73763
nifi     | 1241297
(2 rows)

10/02/2017 (9am):
hive=# select txn_user, count(*) from txns where txn_state='a' group by txn_user ; 

txn_user | count
hive     | 58794 
nifi     | 26962
(2 rows)

Cloudera Community

Community Articles

Lessons learnt from nifi streaming data to hive transactional tables

Apache Hive

Apache NiFi