I am using NIFI's -- PutHiveSteaming processor with all the recommended settings, but am getting a very low performance-- few hundred rows per second. I m using single avro files with thousands of record for each insertion as 2 transactions. Also tried to increase the number of concurrent tasks, but this also doesnt work.
Can some one shed some light on how to improve performace -- (to around hundred thoursand in one second) in this regard please..
Earlier it was 2, but when I increase this the throughput improves, but then standard hive api can do these number of rows in single transaction itself, how come this takes more transactions in Nifi
In theory hive streaming should run as fast as you can write to HDFS with the overhead happening when committing transactions. So to improve performance reduce the amount of transactions happening against a table. Here are a few knobs to turn.
1) Check the NiFi version. There was a bug fix in NiFi 1.2/ HDF 3.0. The processor has a new config property to set records per transaction. 2) Set the number of records per transaction high to improve throughput.
3) Increase the number of transactions per batch but not too high. If the data stream does not have enough data to quickly use the transactions in the batch they will be created and time out needlessly. As you are streaming use the show transactions command to monitor if transactions are timing out. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTransactio...
4) On the hive side increase the number of threads doing the compaction. hive.compactor.worker.threads Data is streamed into new files and hive does minor and major compactions to merge the new data into the existing ORC files.