Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Tuning PutHive3Streaming NiFi processor

avatar
Contributor

We using PutHive3Streaming processor to send data to Hive from NiFi, I have an issue where we are getting LOTS of small delta files on our busier feeds which is causing issues with compaction etc.

 

I have used a series of merges in NiFi to ensure each flowfile contains many thousand records but it still creates many delta files.

 

I wondered if anyone had any advice on tuning 'Records Per Transaction' and 'Transactions per Batch' options on the PutHive3Streaming processor, I believe this could help with my issue but have had mixed/confusing results from testing. There isn't  a great deal of information on best practice that I have found.

 

Has anyone else had similar issues/found adjustments helpful? 

1 ACCEPTED SOLUTION

avatar
Master Guru

I'm not a Hive expert but I did author the original PutHive3Streaming processor for NiFi. My recommendation is setting Records Per Transaction greater than the number of records in a FlowFile (unless we are talking about super-huge files), and transactions per batch to 1. This makes the transaction semantics similar to how NiFi FlowFile sessions work (rollback, failure, success, e.g.). If the number of records is huge and is causing throughput problems, try dividing that number by 100 and making transactions per batch 100. When you multiply the two numbers together it should be greater than the total number of records in the FlowFile in order to avoid overhead with the Hive Metastore by requesting a large number of batches/transactions.

View solution in original post

2 REPLIES 2

avatar
Master Guru

I'm not a Hive expert but I did author the original PutHive3Streaming processor for NiFi. My recommendation is setting Records Per Transaction greater than the number of records in a FlowFile (unless we are talking about super-huge files), and transactions per batch to 1. This makes the transaction semantics similar to how NiFi FlowFile sessions work (rollback, failure, success, e.g.). If the number of records is huge and is causing throughput problems, try dividing that number by 100 and making transactions per batch 100. When you multiply the two numbers together it should be greater than the total number of records in the FlowFile in order to avoid overhead with the Hive Metastore by requesting a large number of batches/transactions.

avatar
Community Manager

@Griggsy, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: