Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NIFI -- PutHiveSteaming processor throughput improvement

Highlighted

NIFI -- PutHiveSteaming processor throughput improvement

Explorer

Hello,

I am using NIFI's -- PutHiveSteaming processor with all the recommended settings, but am getting a very low performance-- few hundred rows per second. I m using single avro files with thousands of record for each insertion as 2 transactions. Also tried to increase the number of concurrent tasks, but this also doesnt work.

Can some one shed some light on how to improve performace -- (to around hundred thoursand in one second) in this regard please..

4 REPLIES 4
Highlighted

Re: NIFI -- PutHiveSteaming processor throughput improvement

@Opao E

What value do you have set for the Transactions per Batch property? Have you tried increasing this value?

Highlighted

Re: NIFI -- PutHiveSteaming processor throughput improvement

Explorer

Earlier it was 2, but when I increase this the throughput improves, but then standard hive api can do these number of rows in single transaction itself, how come this takes more transactions in Nifi

Highlighted

Re: NIFI -- PutHiveSteaming processor throughput improvement

@Opao E

I'm not sure, but it could be due to the interaction between NiFi and the Hive client libraries.

Re: NIFI -- PutHiveSteaming processor throughput improvement

Explorer

In theory hive streaming should run as fast as you can write to HDFS with the overhead happening when committing transactions. So to improve performance reduce the amount of transactions happening against a table. Here are a few knobs to turn.

1) Check the NiFi version. There was a bug fix in NiFi 1.2/ HDF 3.0. The processor has a new config property to set records per transaction. 2) Set the number of records per transaction high to improve throughput.

3) Increase the number of transactions per batch but not too high. If the data stream does not have enough data to quickly use the transactions in the batch they will be created and time out needlessly. As you are streaming use the show transactions command to monitor if transactions are timing out. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTransactio...

4) On the hive side increase the number of threads doing the compaction. hive.compactor.worker.threads Data is streamed into new files and hive does minor and major compactions to merge the new data into the existing ORC files.

Don't have an account?
Coming from Hortonworks? Activate your account here